<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.177414.2</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Research Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Adaptive Phoneme State Learning Architecture&#x00a0;for Enhanced Speech Recognition Using&#x00a0;Backpropagation Neural Network and Hidden&#x00a0;Markov Model</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 2; peer review: 1 approved, 2 not approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Siddalingappa</surname>
                        <given-names>Rashmi</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Resources</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-9786-8436</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>S</surname>
                        <given-names>Deepa</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Savitha</surname>
                        <given-names>Margaret</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Resources</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>P</surname>
                        <given-names>Kalpana</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Stella Mary I</surname>
                        <given-names>Priya</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Resources</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Gornale</surname>
                        <given-names>Shivanand</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-5373-4049</uri>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>B A</surname>
                        <given-names>Lakshmi</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Resources</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <xref ref-type="aff" rid="a4">4</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Li</surname>
                        <given-names>Kefeng</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a5">5</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Wen Goh</surname>
                        <given-names>Khang</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a6">6</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Computer and Data Science, York St John University, London, England, E14 2BA, UK</aff>
                <aff id="a2">
                    <label>2</label>Christ University, Bengaluru, Karnataka, India</aff>
                <aff id="a3">
                    <label>3</label>Department of Computer Science, Rani Channamma University, Belagavi, Karnataka, India</aff>
                <aff id="a4">
                    <label>4</label>UST Global, Bangalore, Karnatake, India</aff>
                <aff id="a5">
                    <label>5</label>Macao Polytechnic University, Macau, Macao</aff>
                <aff id="a6">
                    <label>6</label>INTI International University &amp; Colleges, Nilai, Negeri Sembilan, Malaysia</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:r.siddalingappa@yorksj.ac.uk">r.siddalingappa@yorksj.ac.uk</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>3</day>
                <month>6</month>
                <year>2026</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2026</year>
            </pub-date>
            <volume>15</volume>
            <elocation-id>338</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>27</day>
                    <month>5</month>
                    <year>2026</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2026 Siddalingappa R et al.</copyright-statement>
                <copyright-year>2026</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/15-338/pdf"/>
            <abstract>
                <p>Speech remains a primary mode of human communication; however, automated speech recognition (ASR) systems face challenges from accent variability, temporal fluctuations, noise, and data privacy concerns. This paper proposes an enhanced ASR architecture incorporating an Adaptive Phoneme State Learning (APSL) algorithm with a Backpropagation Neural Network (BPNN) and Hidden Markov Model (HMM). APSL dynamically adjusts HMM state probabilities using phoneme confidence scores derived from the BPNN, thereby improving phoneme transition modeling and alignment. The multi-stage ASR pipeline includes noise reduction, speech-pause detection, and feature extraction via framing and windowing. APSL&#x2019;s adaptive mechanism reduces ambiguities in phoneme transitions, resulting in a more accurate speech-to-text conversion. A comparative evaluation framework assesses the baseline HMM, standalone BPNN, and integrated APSL-BPNN-HMM model. Experiments were conducted using a custom-built dataset of 2000 audio files alongside five benchmark corpora: BNC, ANC, COCA, Buckeye, and Emu. Key evaluation metrics&#x2014;recall, precision, F-score, and Word Error Rate (WER)&#x2014;demonstrate that the APSL-enhanced model significantly outperforms baseline systems, achieving 95.7% recall, 92.95% precision, 94.53% F-score, and 96% overall accuracy. Notably, APSL-BPNN-HMM consistently yielded the lowest WER across all datasets, validating its effectiveness. This work highlights the benefits of adaptive learning in probabilistic frameworks for achieving robust and accurate speech recognition.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>acoustic modeling</kwd>
                <kwd>back propagation neural networks</kwd>
                <kwd>hidden markov model</kwd>
                <kwd>speech recognition</kwd>
                <kwd>voice activity detection</kwd>
            </kwd-group>
            <funding-group>
                <funding-statement>The author(s) declared that no grants were involved in supporting this work.</funding-statement>
            </funding-group>
        </article-meta>
        <notes>
            <sec sec-type="version-changes">
                <label>Revised</label>
                <title>Amendments from Version 1</title>
                <p>This revised version addresses the concerns raised during peer review. First, the literature review has been substantially restructured into four subsections (Historical Foundations, Traditional HMM-Based Systems, Motivation for Hybrid Approaches, and Research Gap), with explicit positioning of our work relative to modern end-to-end architectures including Whisper and Conformer. Second, the Adaptive Phoneme State Learning (APSL) algorithm now includes rigorous mathematical formalization with a Bayesian derivation of the confidence-weighted probability fusion, replacing the previous heuristic description. Third, a new comparative analysis against contemporary ASR systems (Table 5) has been added to contextualize our hybrid approach, along with explicit justification for choosing HMM-BPNN over end-to-end models based on interpretability and reproducibility requirements. Fourth, empirical justifications have been added for key parameters including the 5 ms dynamic window adjustment, validated via grid search on a 500-utterance holdout set. Fifth, the experimental protocol now includes detailed data partitioning strategies with speaker-level leakage prevention and a hardware specifications table. Sixth, real-time performance analysis has been expanded with optimization recommendations for deployment. Seventh, the memory-accuracy trade-off validation has been presented with ablation study results demonstrating that adaptive parameter sharing (k=128) achieves 96% accuracy while reducing memory from 24 GB to 15.12 GB. These revisions strengthen the theoretical foundation, reproducibility, and practical relevance of the proposed APSL-BPNN-HMM speech recognition framework</p>
            </sec>
        </notes>
    </front>
    <body>
        <sec id="sec1" sec-type="intro">
            <title>1. Introduction</title>
            <p>Speech is a dynamic cascade of thoughts produced by articulating utterances in natural language. The visual representation of language is called &#x2018;graphemes,&#x2019; while the sound representation is called &#x2018;phonemes.&#x2019; In linguistics, the study of phonemes encompasses &#x201c;Phonetics&#x201d; and &#x201c;Phonology.&#x201d; Phonetics examines the physical properties of speech sounds, including their production by vocal organs (articulatory phonetics), auditory perception (auditory phonetics), and acoustic properties (acoustic phonetics). Phonology studies sound patterns and the systematic organization of sounds within a linguistic system.
                <sup>
                    <xref ref-type="bibr" rid="ref1">1</xref>
                </sup> These disciplines enable the transformation of graphemes into phonemes (text-to-speech, TTS) and vice versa (speech-to-text, STT). A speech recognition model (SRM) comprises three primary elements: i) Feature Extraction, which captures features and computes HMM states by transforming speech signals into spectral attributes mapped onto phonemic structures, yielding syllabic probability scores,
                <sup>
                    <xref ref-type="bibr" rid="ref2">2</xref>
                </sup> ii) Acoustic model, which identifies sound structures and extracts textual elements from spoken words,
                <sup>
                    <xref ref-type="bibr" rid="ref3">3</xref>
                </sup> and iii) Language model, which deciphers spectral attributes into meaningful word representations.
                <sup>
                    <xref ref-type="bibr" rid="ref4">4</xref>
                </sup> These processes require a pipeline architecture due to cross-language integration challenges. While training corpora must encompass all phoneme variations, storing every word-phoneme pair is impractical given memory and computational constraints. Machine learning addresses this through statistical models like HMM, enabling phoneme representation learning with limited data.
                <sup>
                    <xref ref-type="bibr" rid="ref5">5</xref>
                </sup> Despite advances in automatic speech recognition (ASR), three fundamental challenges persist in real-world deployment: (1) Accent and dialect variability - most ASR systems are trained on standard English, failing for Indian, Nigerian, or Scottish accents; (2) Noise robustness - cafeteria chatter, road noise, and low-quality microphones degrade performance dramatically; (3) Interpretability - end-to-end deep learning models offer no insight into phoneme-level errors, complicating clinical or forensic applications. Hidden Markov Models (HMMs) provide explicit temporal structure and interpretability but suffer from poor acoustic modeling. Neural networks excel at feature extraction but lack temporal constraints. Hybrid HMM-neural systems have been explored, but existing approaches treat neural outputs as static observations rather than adaptive confidence signals.</p>
            <p>This research introduces the Adaptive Phoneme State Learning (APSL) algorithm, integrating a Backpropagation Neural Network (BPNN) with HMM to dynamically refine phoneme state transitions. The objectives are: i) develop a speech recognition interface for English phonemes, ii) transcribe spoken words into text, iii) enhance scalability and efficiency to reduce training time, iv) achieve human-level performance in real-time scenarios, and v) validate methodologies through comprehensive evaluation metrics including F-measure, recall, precision, and accuracy. The specific gap this paper addresses is: How can we dynamically adapt HMM state transitions using neural network confidence scores to improve phoneme alignment without retraining the entire model? Our contributions are: i) APSL algorithm - a mathematically principled framework for confidence-weighted HMM adaptation, ii) Empirical validation on 5 standard corpora + 2000 custom files (24 GB total), iii) Demonstration of 32% memory reduction (24 GB &#x2192; 15.12 GB) with 96% accuracy. While end-to-end (E2E) models such as Transformer-based ASR, Connectionist Temporal Classification (CTC), and RNN-Transducers (RNN-T) have achieved state-of-the-art results on large-scale benchmarks, they present three critical limitations that motivate our hybrid approach: (1) 
                <bold>Data hunger</bold> - E2E models typically require hundreds to thousands of hours of transcribed speech; (2) 
                <bold>Computational opacity</bold> - E2E models offer limited interpretability of phoneme-level decisions, complicating error diagnosis; (3) 
                <bold>Resource constraints</bold> - Deploying large Transformer models on edge devices remains challenging. Our APSL-BPNN-HMM framework deliberately retains HMM's explicit temporal structure while augmenting it with adaptive neural confidence scoring, offering a computationally efficient alternative (15.12 GB memory vs. 50-100+ GB for large E2E models) with interpretable phoneme-state alignments. This is particularly valuable for low-resource accents and privacy-sensitive federated learning scenarios. While deep transfer learning has enabled ASR systems to generalize across domains with limited data,
                <sup>
                    <xref ref-type="bibr" rid="ref38">38</xref>
                </sup> these approaches still require large pre-trained models unsuitable for edge deployment. Our APSL framework offers an alternative: explicit phoneme-state adaptation without large-scale pre-training.The paper is structured as follows: Section 2 reviews HMM-based speech recognition literature, Section 3 outlines the architectural model and methodology, Section 4 explains voice activity detection and textual computation algorithms, Section 5 discusses the experimental setup, Section 6 presents results and future directions, and Section 7 concludes the study.</p>
        </sec>
        <sec id="sec2">
            <title>2. Research background</title>
            <sec id="sec22">
                <title>2.1 Historical foundations</title>
                <p>The roots of phonetics trace back to as early as 500 BC on the Indian subcontinent, with Panini meticulously describing the place and manner of articulation of consonants in Sanskrit.
                    <sup>
                        <xref ref-type="bibr" rid="ref6">6</xref>
                    </sup> The chronicles of speech recognition date to 2002, culminating in a final output release in 2005, functioning proficiently across three languages: English, Spanish, and Mandarin.
                    <sup>
                        <xref ref-type="bibr" rid="ref7">7</xref>
                    </sup> Operating at a speech rate of 10 Hz with a recording precision of 96 kHz/24 bit, this innovation marked a pivotal milestone. Fast-forward to 2019, another speech synthesizer emerged during the &#x201c;Blizzard challenge&#x201d;,
                    <sup>
                        <xref ref-type="bibr" rid="ref8">8</xref>
                    </sup> pronouncing 1200 phonetic utterances at a frequency of 1.5 Hz. These early developments established the foundational principles of acoustic modeling and phonetic analysis that continue to inform contemporary speech recognition research.</p>
            </sec>
            <sec id="sec92">
                <title>2.2 Traditional HMM-Based speech recognition systems</title>
                <p>Several researchers have contributed to the advancement of HMM-based speech recognition systems, as summarized in 
                    <xref ref-type="table" rid="T1">
Table 1</xref>. These studies demonstrate various approaches to phonetic segmentation, speech synthesis, and recognition across different languages and acoustic conditions. While these prior works established HMM as a viable approach for speech recognition across multiple languages and acoustic conditions, they exhibit notable limitations, including moderate accuracy (ranging from 61.5% to 89%), language-specific implementations that limit cross-linguistic applicability, and limited handling of diverse speech qualities and acoustic environments.</p>
                <table-wrap id="T1" orientation="portrait" position="float">
                    <label>
Table 1. </label>
                    <caption>
                        <title>Literature survey summary.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Refs.</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Problem/Focus</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Core method</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Datasets/Setup</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Key findings</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Limitations</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <sup>
                                        <xref ref-type="bibr" rid="ref9">9</xref>
                                    </sup>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Homophonic ambiguities in Malay name retrieval</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Soundex and Asoundex methods for generating name codes</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Malay names corpus</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Improved accuracy by 38.3% compared to prior methods</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Limited to name retrieval; not applicable to continuous speech recognition</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <sup>
                                        <xref ref-type="bibr" rid="ref10">10</xref>
                                    </sup>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Cross-language phonetic segmentation</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">HMM-based phonetic segmentation framework</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Appen Spanish speech corpus</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Achieved approximately 61.5% accuracy</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Moderate accuracy; requires improvement for practical deployment</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <sup>
                                        <xref ref-type="bibr" rid="ref11">11</xref>
                                    </sup>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Phonetic-based recognition of semivowel sounds</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Comparison of HMM and MFCC-based recognizers</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">T146 database</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Explored novel avenues in phonetic analysis of semivowels</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Specific to semivowel recognition; limited generalization to broader phoneme classes</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <sup>
                                        <xref ref-type="bibr" rid="ref12">12</xref>
                                    </sup>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Phonetic segmentation based on speech analysis</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Microcanonical Multiscale Formalism (MMF) technique</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Speech corpus with varied phonetic contexts</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">6% improvement in segmentation accuracy</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Modest accuracy gains; computational complexity not addressed</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <sup>
                                        <xref ref-type="bibr" rid="ref13">13</xref>
                                    </sup>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Arabic speech recognition with pronunciation variations</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">HMM for associating diverse pronunciations</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Arabic speech corpus</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Minimized phonetic out-of-vocabulary rate; demonstrated HMM efficacy</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Language-specific; limited discussion of cross-linguistic applicability</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <sup>
                                        <xref ref-type="bibr" rid="ref14">14</xref>
                                    </sup>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Speech synthesis for Indian English syllables</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">HMM-based speech synthesizer</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Indian English syllable dataset</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Achieved 89% accuracy</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Syllable-word model not delineated; accuracy limited for complex utterances</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <sup>
                                        <xref ref-type="bibr" rid="ref15">15</xref>
                                    </sup>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Murmured speech recognition and conversion</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">HMM with posterior decoding approach</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Murmured speech dataset</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Attained 81.2% accuracy in murmur-to-normal speech conversion</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Moderate accuracy; challenges in handling diverse speech qualities</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <sup>
                                        <xref ref-type="bibr" rid="ref16">16</xref>
                                    </sup>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Speech recognition using time and frequency analysis</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">HMM with time and frequency response extraction techniques</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Standard speech corpus</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Explored feature extraction methods for HMM-based recognition</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Limited performance metrics reported; scalability not discussed</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <table-wrap id="T2" orientation="portrait" position="float">
                    <label>
Table 2. </label>
                    <caption>
                        <title>Phoneme dynamic wrapping table for the example sentence.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">The</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Joy</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Of</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Living</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Is</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">To</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Love</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">And</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">
Respect</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">A0</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">A1</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">A2</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">A3</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">A4</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">A5</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">A6</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">A7</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">A8</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
            <sec id="sec82">
                <title>2.3 Motivation for hybrid HMM-Neural approaches</title>
                <p>Traditional HMM systems provide explicit temporal structure and interpretability through their state transition probabilities and emission distributions. However, their acoustic modeling capacity is constrained by the conditional independence assumption of observations given the hidden state. Neural networks, conversely, excel at learning complex, non-linear feature representations from raw data but lack inherent temporal constraints and explicit alignment mechanisms. This complementary relationship has motivated the development of hybrid architectures that combine the strengths of both paradigms. Hybrid models combining generative components (such as Gaussian Mixture Models) and discriminative components (such as neural networks) have demonstrated superior performance on specialized tasks where pure end-to-end models struggle due to limited training data.
                    <sup>
                        <xref ref-type="bibr" rid="ref39">39</xref>
                    </sup> This finding motivates our HMM-BPNN hybridization, which similarly combines probabilistic temporal modeling with neural discriminative power while preserving the explicit state alignment that HMMs provide. Recent comparative studies of temporal modeling architectures for noisy speech recognition have shown that hybrid CNN-LSTM models achieve 90.72% accuracy in noisy conditions, outperforming standalone CNN (90.12%) and LSTM (86.12%).
                    <sup>
                        <xref ref-type="bibr" rid="ref40">40</xref>
                    </sup> However, these recurrent and convolutional architectures lack explicit phoneme-state alignment and do not provide interpretable confidence scores at the phoneme level&#x2014;a gap our HMM-based approach with APSL directly addresses.</p>
            </sec>
            <sec id="sec182">
                <title>2.4 Research gap and contributions of the present study</title>
                <p>Despite the advances in hybrid HMM-neural systems documented in the literature, existing approaches treat neural network outputs as static observations that are integrated with HMM probabilities using fixed interpolation weights that do not vary based on input characteristics. No existing work dynamically adjusts HMM transition or emission probabilities based on neural confidence scores in an online, utterance-specific manner. Furthermore, most hybrid systems do not adapt their temporal analysis windows in response to classification uncertainty, potentially wasting computational resources on clear speech segments while providing insufficient acoustic context for ambiguous phonemes. Against this backdrop, the present study introduces the Adaptive Phoneme State Learning (APSL) algorithm, which integrates several key innovations: (i) Labeling synthetic waveforms with distinct features to improve phoneme discriminability during model training, (ii) MFCC-based dynamic feature extraction employing filtering to extract feature coefficients as an energy measure for robust acoustic representation, (iii) Bidirectional spectral training addressing the challenge of insufficient training observations in HMM models by encompassing both forward and backward training spectral features, introducing time-dependent windowing factors to reduce memory requirements and optimize likelihood summation across all states and (iv) Confidence-driven adaptive refinement that dynamically adjusts HMM state transitions using BPNN posterior probabilities, with dynamic window extension (&#x0394;t = 5 ms) for low-confidence phonemes to provide additional acoustic context where needed most.</p>
                <p>The proposed APSL-BPNN-HMM model demonstrates robust recognition accuracy even in noisy environments, as validated through extensive experimentation across multiple speech corpora described in subsequent sections.</p>
            </sec>
        </sec>
        <sec id="sec3">
            <title>3. Architecture of speech recognition model for speech-to-text process</title>
            <p>The proposed APSL-BPNN-HMM architecture integrates multiple components to enhance speech recognition through effective signal processing and machine learning, as shown in 
                <xref ref-type="fig" rid="f1">
Figure 1</xref>. The input audio signal is processed through a Speech Acquisition module for proper sampling and data segmentation. Given the stochastic nature of speech signals, Voice Activity Detection (VAD) distinguishes between speech and non-speech regions, improving noise reduction and signal normalization. Feature Extraction employs Mel-frequency cepstral coefficients (MFCC) with preprocessing steps including pre-emphasis (boosting high frequencies) and framing (segmenting data into manageable frames), retaining essential phonetic and linguistic information. The extracted features undergo windowing, segmenting frames into overlapping windows activated using bi-gram lexicon combinations to ensure meaningful word boundaries. The Adaptive Piecewise Segment Labeling (APSL) module enhances segment identification and labeling, improving feature sequence reliability for model training. The labeled features are fed into a Backpropagation Neural Network (BPNN), which refines feature representations and generates intermediate outputs for the Hidden Markov Model (HMM).
                <sup>
                    <xref ref-type="bibr" rid="ref17">17</xref>
                </sup> The HMM models temporal dependencies and stochastic patterns, segmenting speech into phonemes, words, and sentences. Bi-gram connections model phoneme and word transitions, ensuring improved accuracy. The speech recognition module identifies and classifies predicted speech patterns, with performance evaluated using Accuracy, Precision, Recall, F1-score, and Word Error Rate (WER). This architecture effectively addresses noise reduction, signal normalization, and robust speech recognition in dynamic environments through integrated APSL segmentation, MFCC-based feature extraction, and HMM temporal modeling.</p>
            <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                <label>
Figure 1. </label>
                <caption>
                    <title>APSL-BPNN-HMM Architecture for Speech Recognition &#x2014; The proposed architecture integrates key components such as Voice Activity Detection (VAD), Mel-frequency cepstral coefficients (MFCC) based feature extraction with pre-emphasis and framing, Adaptive Piecewise Segment Labeling (APSL) for enhanced segmentation, and a combination of Back Propagation Neural Network (BPNN) and Hidden Markov Model (HMM).</title>
                </caption>
                <graphic id="gr1" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/202186/e8064d8a-91bb-4486-98a2-b01512d69de5_figure1.gif"/>
            </fig>
            <sec id="sec4">
                <title>3.1 Speech acquisition</title>
                <p>Raw speech signals are acquired through microphones, online audio files, or audio CDs. Accurate sampling frequency configuration is critical before recording. For example, a 100-second audio file sampled at 44100Hz yields 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mn>44100</mml:mn>
                            <mml:mo>&#x00d7;</mml:mo>
                            <mml:mn>100</mml:mn>
                            <mml:mo>=</mml:mo>
                            <mml:mn>4,410,000</mml:mn>
                        </mml:math>
</inline-formula> samples, ensuring CD-quality audio. Based on Nyquist&#x2019;s theory,
                    <sup>
                        <xref ref-type="bibr" rid="ref18">18</xref>
                    </sup> the sampling rate must be at least twice the maximum signal frequency to avoid aliasing. For instance, a 10,000 Hz signal requires a minimum 20,000 Hz sampling rate. Sampling frequency selection involves a trade-off between audio quality and memory consumption: lower frequencies reduce memory usage but compromise quality, while higher frequencies enhance fidelity at the cost of increased storage. The optimal balance depends on application-specific requirements.</p>
            </sec>
            <sec id="sec5">
                <title>3.2 Voice activity detection (VAD)</title>
                <p>Voice Activity Detection (VAD) comprises two stages: noise removal and speech pause detection. Noise elimination employs a Training-Based Noise Removal Technique (TBNRT),
                    <sup>
                        <xref ref-type="bibr" rid="ref19">19</xref>
                    </sup> utilizing a corpus of noise types from white to environmental noise. Noise segments matching the noise dictionary are removed using high-pass and low-pass filters. Endpoint detection utilizes algorithms based on energy variance, pitch modulation, zero-crossing rate, cepstral parameters, or linear prediction coding (LPC).
                    <sup>
                        <xref ref-type="bibr" rid="ref20">20</xref>
                    </sup> VAD applies the min/max energy threshold (ET) paradigm. For sample 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>S</mml:mi>
                                <mml:msub>
                                    <mml:mi>B</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                            </mml:msub>
                        </mml:math>
</inline-formula> in each speech segment 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>B</mml:mi>
                                <mml:mi>i</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula>, ET is defined at indices 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>x</mml:mi>
                        </mml:math>
</inline-formula> and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>y</mml:mi>
                        </mml:math>
</inline-formula>, where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>x</mml:mi>
                        </mml:math>
</inline-formula> represents the total signal duration and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>y</mml:mi>
                        </mml:math>
</inline-formula> represents the duration within block 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>B</mml:mi>
                                <mml:mi>i</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula>. 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>S</mml:mi>
                                <mml:mi>i</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> denotes the speech signal in each segment, where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>S</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mrow>
                                <mml:mo stretchy="true">{</mml:mo>
                                <mml:mn>1</mml:mn>
                                <mml:mo>,</mml:mo>
                                <mml:mn>2</mml:mn>
                                <mml:mo>,</mml:mo>
                                <mml:mo>&#x2026;</mml:mo>
                                <mml:mo>,</mml:mo>
                                <mml:mi>n</mml:mi>
                                <mml:mo stretchy="true">}</mml:mo>
                            </mml:mrow>
                        </mml:math>
</inline-formula>.</p>
                <p>

                    <bold>Step&#x2013;1:</bold> The energy is calculated using 
                    <xref ref-type="disp-formula" rid="e1">Equation (1)</xref>:
                    <disp-formula id="e1">

                        <mml:math display="block">
                            <mml:msub>
                                <mml:mi>E</mml:mi>
                                <mml:mi>x</mml:mi>
                            </mml:msub>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>x</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                            <mml:mo>=</mml:mo>
                            <mml:munderover>
                                <mml:mo movablelimits="false">&#x2211;</mml:mo>
                                <mml:mrow>
                                    <mml:mi>y</mml:mi>
                                    <mml:mspace width="0.25em"/>
                                    <mml:mo>&#x2208;</mml:mo>
                                    <mml:mspace width="0.25em"/>
                                    <mml:msub>
                                        <mml:mi>S</mml:mi>
                                        <mml:msub>
                                            <mml:mi>B</mml:mi>
                                            <mml:mi>i</mml:mi>
                                        </mml:msub>
                                    </mml:msub>
                                </mml:mrow>
                                <mml:mi>N</mml:mi>
                            </mml:munderover>
                            <mml:msubsup>
                                <mml:mi>S</mml:mi>
                                <mml:msub>
                                    <mml:mi>f</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                                <mml:mn>2</mml:mn>
                            </mml:msubsup>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>x</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                        </mml:math>

                        <label>(1)</label>
</disp-formula>
                </p>
                <p>

                    <bold>Step&#x2013;2:</bold> Voice Activity Detection (VAD - 
                    <xref ref-type="disp-formula" rid="e2">Equation 2</xref>)
                    <disp-formula id="e2">

                        <mml:math display="block">
                            <mml:msub>
                                <mml:mi>B</mml:mi>
                                <mml:mi>x</mml:mi>
                            </mml:msub>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>x</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                            <mml:mo>=</mml:mo>
                            <mml:mrow>
                                <mml:mo stretchy="true">{</mml:mo>
                                <mml:mtable columnalign="center">
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                            <mml:mo>,</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mi>T</mml:mi>
                                                <mml:mi>M</mml:mi>
                                            </mml:msub>
                                            <mml:mrow>
                                                <mml:mo stretchy="true">(</mml:mo>
                                                <mml:mi>x</mml:mi>
                                                <mml:mo stretchy="true">)</mml:mo>
                                            </mml:mrow>
                                            <mml:mspace width="0.25em"/>
                                            <mml:mo>&#x2265;</mml:mo>
                                            <mml:mspace width="0.25em"/>
                                            <mml:msub>
                                                <mml:mi>T</mml:mi>
                                                <mml:mi>B</mml:mi>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>0</mml:mn>
                                            <mml:mo>,</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mi>T</mml:mi>
                                                <mml:mi>M</mml:mi>
                                            </mml:msub>
                                            <mml:mrow>
                                                <mml:mo stretchy="true">(</mml:mo>
                                                <mml:mi>x</mml:mi>
                                                <mml:mo stretchy="true">)</mml:mo>
                                            </mml:mrow>
                                            <mml:mo>&lt;</mml:mo>
                                            <mml:msub>
                                                <mml:mi>T</mml:mi>
                                                <mml:mi>B</mml:mi>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                </mml:mtable>
                            </mml:mrow>
                        </mml:math>

                        <label>(2)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mi>m</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mi>M</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> are the minimum and maximum thresholds, respectively, and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mi>B</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is the base threshold.</p>
                <p>

                    <bold>Step&#x2013;3:</bold> When 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mi>m</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is reached, the signal breaks until the next 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mi>m</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is reached.</p>
                <p>VAD extracts speech features every 5-40 ms and compares them to base threshold 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mi>B</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula>. Features exceeding 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mi>B</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> yield VAD = 1 (speech present); otherwise VAD = 0 (no speech). Initially assuming a 40 ms segment contains no speech, we analyze frames of 60 samples (6 ms duration) collected at 70 kHz. The average threshold for each frame is determined using 
                    <xref ref-type="disp-formula" rid="e3">Equation (3)</xref>:
                    <disp-formula id="e3">

                        <mml:math display="block">
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mtext>mean</mml:mtext>
                            </mml:msub>
                            <mml:mo>=</mml:mo>
                            <mml:mfrac>
                                <mml:mn>1</mml:mn>
                                <mml:mi>M</mml:mi>
                            </mml:mfrac>
                            <mml:munderover>
                                <mml:mo movablelimits="false">&#x2211;</mml:mo>
                                <mml:mrow>
                                    <mml:mi>n</mml:mi>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0</mml:mn>
                                </mml:mrow>
                                <mml:mi>N</mml:mi>
                            </mml:munderover>
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mi>x</mml:mi>
                            </mml:msub>
                        </mml:math>

                        <label>(3)</label>
</disp-formula>
                </p>
                <p>Since loudness varies among speakers, we focus on minimum loudness. Using Praat,
                    <sup>
                        <xref ref-type="bibr" rid="ref21">21</xref>
                    </sup> we analyzed loudness ranges to categorize 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mi>m</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula>, employing a Python script to eliminate signals at the 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mi>m</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> threshold. For instance, the quietest sound measured 59.3 dB, with quiet segments ranging from 59-62 dB. The first segment below 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mi>B</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is designated as 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mi>m</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula>. Speech typically begins softly, peaks at maximum 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>T</mml:mi>
                                <mml:mi>M</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula>, then decreases, defining the minimum-maximum energy range. The quiet threshold is set at -25.0 dB, with segments below classified as quiet. Temporal constraints include a minimum pause duration of 0.1 seconds between words (longer for sudden loud sounds; shorter durations are not classified as quiet) and a minimum sounding time of 0.05 seconds (representing inter-syllable pauses).</p>
            </sec>
            <sec id="sec6">
                <title>3.3 Feature extraction</title>
                <p>Feature extraction techniques include mel-frequency cepstral coefficients (MFCC),
                    <sup>
                        <xref ref-type="bibr" rid="ref22">22</xref>
                    </sup> vector quantization (VQ),
                    <sup>
                        <xref ref-type="bibr" rid="ref23">23</xref>
                    </sup> artificial neural networks (ANN),
                    <sup>
                        <xref ref-type="bibr" rid="ref24">24</xref>
                    </sup> Hidden Markov Models (HMM),
                    <sup>
                        <xref ref-type="bibr" rid="ref25">25</xref>
                    </sup> and dynamic time warping (DTW).
                    <sup>
                        <xref ref-type="bibr" rid="ref26">26</xref>
                    </sup> This study employs MFCC for framing and HMM for windowing. MFCC-based feature extraction involves two steps: Pre-emphasis and Framing.</p>
                <p>

                    <bold>Pre-emphasis
</bold>: High-frequency sounds typically have lower magnitudes, leading to higher distortion and compromised speech quality. Pre-emphasis counters this by suppressing high-frequency components and boosting magnitude, producing a smoother profile than the original audio. The pre-emphasis factor 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>&#x03b1;</mml:mi>
                        </mml:math>
</inline-formula> is calculated using 
                    <xref ref-type="disp-formula" rid="e4">Equation (4)</xref>:
                    <disp-formula id="e4">

                        <mml:math display="block">
                            <mml:mi>&#x03b1;</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mo>exp</mml:mo>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mo>&#x2212;</mml:mo>
                                <mml:mn>2</mml:mn>
                                <mml:mi mathvariant="italic">&#x03c0;vT</mml:mi>
                                <mml:mo>/</mml:mo>
                                <mml:mi mathvariant="italic">&#x03bb;c</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                        </mml:math>

                        <label>(4)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>f</mml:mi>
                        </mml:math>
</inline-formula> represents the audio signal frequency and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>T</mml:mi>
                        </mml:math>
</inline-formula> represents the sampling period. For each sample except the first, the alteration follows 
                    <xref ref-type="disp-formula" rid="e5">Equation (5)</xref>:
                    <disp-formula id="e5">

                        <mml:math display="block">
                            <mml:msub>
                                <mml:mi>X</mml:mi>
                                <mml:mi>k</mml:mi>
                            </mml:msub>
                            <mml:mo>=</mml:mo>
                            <mml:msub>
                                <mml:mi>X</mml:mi>
                                <mml:mi>k</mml:mi>
                            </mml:msub>
                            <mml:mo>&#x2212;</mml:mo>
                            <mml:mi>&#x03b1;</mml:mi>
                            <mml:msub>
                                <mml:mi>X</mml:mi>
                                <mml:mrow>
                                    <mml:mi>k</mml:mi>
                                    <mml:mo>&#x2212;</mml:mo>
                                    <mml:mn>1</mml:mn>
                                </mml:mrow>
                            </mml:msub>
                        </mml:math>

                        <label>(5)</label>
</disp-formula>
                </p>
                <p>

                    <bold>Framing</bold>: Framing is a lossless process that divides continuous signals into overlapping, time-specific frames to reduce transition discontinuities. Using MFCC filtering, sound samples are represented as time functions with coefficients for frames centered at equally spaced intervals. Each speech segment&#x2014;sounding or silent&#x2014;is treated as a frame, with total frames equal to the sum of utterances and pauses. For example, the sentence &#x201c;The joy of living is to love and respect&#x201d; (5.871 s) includes utterances: &#x201c;the&#x201d; = 0.14 s, &#x201c;joy&#x201d; = 0.39 s, &#x201c;of&#x201d; = 0.08 s, &#x201c;living&#x201d; = 0.56 s, &#x201c;is&#x201d; = 0.10 s, &#x201c;to&#x201d; = 0.12 s, &#x201c;love&#x201d; = 0.47 s, &#x201c;and&#x201d; = 0.24 s, &#x201c;respect&#x201d; = 0.72 s, and pauses: 1.08, 0.26, 0.14, 0.33, 0.16, 1.06 s. The sounding (2.823 s) and silent durations (3.043 s) sum to the total (5.871 s), ensuring accurate, lossless framing.</p>
            </sec>
        </sec>
        <sec id="sec7">
            <title>4. Materials and methods</title>
            <sec id="sec8">
                <title>4.1 Data</title>
                <p>Broad representativeness requires a sufficiently large training dataset including utterances from male and female speakers. Since speech varies significantly across phonetic contexts, a comprehensive model requires at least 100,000 sentences. Manual recording is highly labor-intensive, involving content selection, phonetic variation coverage, participant recruitment, post-processing, and transcription. We utilized publicly available speech corpora, including the British National Corpus (BNC),
                    <sup>
                        <xref ref-type="bibr" rid="ref27">27</xref>
                    </sup> American National Corpus (ANC),
                    <sup>
                        <xref ref-type="bibr" rid="ref28">28</xref>
                    </sup> and Corpus of Contemporary American English (COCA),
                    <sup>
                        <xref ref-type="bibr" rid="ref29">29</xref>
                    </sup> selecting the Buckeye Speech Corpus
                    <sup>
                        <xref ref-type="bibr" rid="ref30">30</xref>
                    </sup> and EMU Speech Database
                    <sup>
                        <xref ref-type="bibr" rid="ref31">31</xref>
                    </sup> for training. Buckeye comprises approximately 40 hours of conversational English (360,000 words or 24,000 sentences at 15 words/sentence). EMU contributes 30,000 sentences, yielding 54,000 total sentences. To meet the desired data volume, we applied augmentation techniques including pitch shifting (adjusting pitch without affecting duration to simulate various speaker profiles), time-stretching (modifying speech speed while preserving pitch for different speaking rates), volume alteration, background noise addition, and reverberation simulation to introduce acoustic variability. These methods increased the effective dataset to approximately 150,000 sentences. For storage, assuming mono audio at 16 kHz sampling rate and 16-bit resolution (32 KB/second), with 150,000 sentences averaging 5 seconds each as described in 
                    <xref ref-type="disp-formula" rid="e6">Equation 6</xref>:
                    <disp-formula id="e6">

                        <mml:math display="block">
                            <mml:mtext>Storage</mml:mtext>
                            <mml:mo>=</mml:mo>
                            <mml:mn>150,000</mml:mn>
                            <mml:mo>&#x00d7;</mml:mo>
                            <mml:mn>5</mml:mn>
                            <mml:mspace width="0.5em"/>
                            <mml:mo>sec</mml:mo>
                            <mml:mo>&#x00d7;</mml:mo>
                            <mml:mn>32</mml:mn>
                            <mml:mspace width="0.5em"/>
                            <mml:mi>KB</mml:mi>
                            <mml:mo>/</mml:mo>
                            <mml:mo>sec</mml:mo>
                            <mml:mo>=</mml:mo>
                            <mml:mn>24,000,000</mml:mn>
                            <mml:mspace width="0.5em"/>
                            <mml:mi>KB</mml:mi>
                            <mml:mspace width="0.25em"/>
                            <mml:mo>&#x2248;</mml:mo>
                            <mml:mspace width="0.25em"/>
                            <mml:mn>24</mml:mn>
                            <mml:mspace width="0.5em"/>
                            <mml:mi>GB</mml:mi>
                        </mml:math>

                        <label>(6)</label>
</disp-formula>
                </p>
                <p>The model is evaluated on all five corpora. Speech recognition tasks were implemented using 

                    <italic toggle="yes">Praat,
</italic>
                    <sup>
                        <xref ref-type="bibr" rid="ref21">21</xref>
                    </sup> a phonetic analysis tool developed by Paul Boersma and David Weenink at the Amsterdam Institute of Phonetic Sciences, facilitating analysis, synthesis, and manipulation of speech signals for phonetics research.</p>
                <p>

                    <bold>4.1.1 Data partitioning and leakage prevention</bold>
                </p>
                <p>To ensure independent training and evaluation, we partitioned all datasets using a 70-15-15 split for training, validation, and testing, respectively. For the British National Corpus (BNC) and American National Corpus (ANC), 70,000 sentences were allocated for training, 15,000 for validation, and 15,000 for testing. The Corpus of Contemporary American English (COCA) followed the same distribution with 70,000 training, 15,000 validation, and 15,000 testing sentences. For the Buckeye Speech Corpus, which contains approximately 24,000 sentences, 16,800 sentences (80% of the total) were used for training, 3,600 sentences (15%) for validation, and the remaining 3,600 sentences (15%) for testing. The EMU Speech Database, comprising 30,000 sentences, was split into 21,000 training sentences (70% of the total), 4,500 validation sentences (15%), and 4,500 testing sentences (15%). For our custom-built dataset of 2000 audio files, 1,400 files were used for training, 300 for validation, and 300 for testing.</p>
                <p>Speaker-level partitioning was enforced to prevent data leakage, ensuring that no individual speaker&#x2019;s voice appeared in both training and testing sets. For the Buckeye corpus, which contains approximately 40 hours of conversational speech from 40 speakers, we allocated 32 speakers exclusively for training, 4 speakers for validation, and 4 speakers for testing. Temporal windows were maintained as disjoint across all splits, meaning no overlapping time segments existed between training, validation, and testing data. All augmented data&#x2014;including pitch shifting, time-stretching, volume alteration, background noise addition, and reverberation&#x2014;were generated only after the initial partitioning was completed. This precaution prevented artificial inflation of performance metrics that could otherwise arise from augmented versions of training data leaking into validation or testing sets through correlated transformations. Cross-validation techniques were additionally employed during hyperparameter tuning to further assess model generalisation on unseen speech samples.</p>
            </sec>
            <sec id="sec9">
                <title>4.2 Windowing through Hidden Markov model</title>
                <p>Each speech signal frame captures cepstral features characterizing the corresponding sound segment. Windowing derives grapheme-level representations for each phoneme within a frame. Hidden Markov Models (HMMs) generate sequences and patterns of hidden states based on observed acoustic features, facilitating phoneme-to-grapheme mapping. During preprocessing, speech signal 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>S</mml:mi>
                        </mml:math>
</inline-formula> is segmented into frames 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mrow>
                                <mml:mo stretchy="true">{</mml:mo>
                                <mml:msub>
                                    <mml:mi>f</mml:mi>
                                    <mml:mi>n</mml:mi>
                                </mml:msub>
                                <mml:mo stretchy="true">}</mml:mo>
                            </mml:mrow>
                        </mml:math>
</inline-formula>, with each frame 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>f</mml:mi>
                                <mml:mi>i</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> subdivided into windows 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mrow>
                                <mml:mo stretchy="true">{</mml:mo>
                                <mml:msub>
                                    <mml:mi>w</mml:mi>
                                    <mml:mi>n</mml:mi>
                                </mml:msub>
                                <mml:mo stretchy="true">}</mml:mo>
                            </mml:mrow>
                        </mml:math>
</inline-formula>, where each window 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>w</mml:mi>
                                <mml:mi>i</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> spans 0.015 s&#x2014;optimal for preserving spectral information without temporal overlap or resolution loss. This is defined in 
                    <xref ref-type="disp-formula" rid="e7">
Equations (7)</xref> and 
                    <xref ref-type="disp-formula" rid="e8">
(8)</xref>:
                    <disp-formula id="e7">

                        <mml:math display="block">
                            <mml:mi>S</mml:mi>
                            <mml:mo>&#x2254;</mml:mo>
                            <mml:mrow>
                                <mml:mo stretchy="true">{</mml:mo>
                                <mml:msub>
                                    <mml:mi>G</mml:mi>
                                    <mml:mn>1</mml:mn>
                                </mml:msub>
                                <mml:mo>,</mml:mo>
                                <mml:mo>&#x2026;</mml:mo>
                                <mml:mo>,</mml:mo>
                                <mml:msub>
                                    <mml:mi>G</mml:mi>
                                    <mml:mi>k</mml:mi>
                                </mml:msub>
                                <mml:mo stretchy="true">}</mml:mo>
                            </mml:mrow>
                            <mml:msub>
                                <mml:mi>f</mml:mi>
                                <mml:mi>k</mml:mi>
                            </mml:msub>
                            <mml:mo>&#x2254;</mml:mo>
                            <mml:mrow>
                                <mml:mo stretchy="true">{</mml:mo>
                                <mml:msub>
                                    <mml:mi>w</mml:mi>
                                    <mml:mn>1</mml:mn>
                                </mml:msub>
                                <mml:mo>,</mml:mo>
                                <mml:mo>&#x2026;</mml:mo>
                                <mml:mo>,</mml:mo>
                                <mml:msub>
                                    <mml:mi>w</mml:mi>
                                    <mml:mi>k</mml:mi>
                                </mml:msub>
                                <mml:mo stretchy="true">}</mml:mo>
                            </mml:mrow>
                        </mml:math>

                        <label>(7)</label>
</disp-formula>

                    <disp-formula id="e8">

                        <mml:math display="block">
                            <mml:mi>S</mml:mi>
                            <mml:mo>&#x2254;</mml:mo>
                            <mml:munderover>
                                <mml:mo movablelimits="false">&#x2211;</mml:mo>
                                <mml:mrow>
                                    <mml:mi>k</mml:mi>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>1</mml:mn>
                                </mml:mrow>
                                <mml:mo>&#x221e;</mml:mo>
                            </mml:munderover>
                            <mml:msub>
                                <mml:mi>f</mml:mi>
                                <mml:mi>k</mml:mi>
                            </mml:msub>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:munderover>
                                    <mml:mo movablelimits="false">&#x2211;</mml:mo>
                                    <mml:mrow>
                                        <mml:mi>s</mml:mi>
                                        <mml:mo>=</mml:mo>
                                        <mml:mn>1</mml:mn>
                                    </mml:mrow>
                                    <mml:mo>&#x221e;</mml:mo>
                                </mml:munderover>
                                <mml:msub>
                                    <mml:mi>w</mml:mi>
                                    <mml:mi>s</mml:mi>
                                </mml:msub>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                            <mml:mo>:</mml:mo>
                            <mml:mo>&#x2200;</mml:mo>
                            <mml:mi>w</mml:mi>
                            <mml:mo>|</mml:mo>
                            <mml:mo>&#x226a;</mml:mo>
                            <mml:mn>0.001</mml:mn>
                            <mml:mo>sec</mml:mo>
                        </mml:math>

                        <label>(8)</label>
</disp-formula>
                </p>
                <p>Each phoneme is modeled using a 
                    <bold>3-state left-to-right HMM</bold> (standard in ASR literature), with states corresponding to: 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>q</mml:mi>
                                <mml:mn>1</mml:mn>
                            </mml:msub>
                        </mml:math>
</inline-formula> (onset), 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>q</mml:mi>
                                <mml:mn>2</mml:mn>
                            </mml:msub>
                        </mml:math>
</inline-formula> (steady-state), 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>q</mml:mi>
                                <mml:mn>3</mml:mn>
                            </mml:msub>
                        </mml:math>
</inline-formula> (offset). The transition matrix 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>A</mml:mi>
                        </mml:math>
</inline-formula> is initialized with zero probability for backward transitions (
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>a</mml:mi>
                                <mml:mi mathvariant="italic">ij</mml:mi>
                            </mml:msub>
                            <mml:mo>=</mml:mo>
                            <mml:mn>0</mml:mn>
                        </mml:math>
</inline-formula> for 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>j</mml:mi>
                            <mml:mo>&lt;</mml:mo>
                            <mml:mi>i</mml:mi>
                        </mml:math>
</inline-formula>). This topology captures the sequential nature of phoneme articulation. For diphthongs and affricates, we tested 5-state models but found no significant WER improvement (0.3% reduction) for 67% increase in parameters, so the 3-state model was retained. Window formation follows 
                    <xref ref-type="boxed-text" rid="B1">Algorithm 1.</xref> Each acoustic feature extracted from a window maps to its corresponding language model component. Training the HMM classifier is crucial for accurate phoneme extraction. During training, known state sequences enable inference of unknown states. Training corpora include sound utterances for all syllable combinations with corresponding phonemic representations. Temporal overlap between consecutive windows or frames captures transitional features from previous states, improving current state learning. The overlap must balance containing at least one complete phoneme structure while avoiding excessive repetition. Based on empirical evaluation, overlap duration was set to 0.5 milliseconds between successive windows and frames.</p>
                <p>

                    <bold>Algorithm 1: HMM-based Windowing Process</bold>
                </p>
                <p>Language features are extracted from each window as follows. For every window (
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>w</mml:mi>
                                <mml:mi>i</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula>), the corresponding phoneme is identified by matching acoustic features with pronunciation dictionary entries. If a unique phoneme is found, it is directly assigned and the process continues. When multiple phoneme candidates exist, probabilities are computed based on previously known state sequences, selecting the most probable phoneme. If no match is identified, an HMM infers the current state from prior known states. Finally, a dynamic text wrapping algorithm structures the phoneme combinations derived through HMM.</p>
                <boxed-text id="B1" orientation="portrait" position="float">
                    <label>Algorithm 1. </label>
                    <caption>
                        <title>HMM-based Windowing Process.</title>
                    </caption>
                    <p>1: 
                        <bold>Input:</bold> Frames: 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>f</mml:mi>
                                    <mml:mi>n</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula>
                    </p>
                    <p>2: 
                        <bold>Output:</bold> Each frame 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>f</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula> was further divided into windows 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>w</mml:mi>
                                    <mml:mi>n</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula> with a length of 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:mn>0.015</mml:mn>
                            </mml:math>
</inline-formula>
 s.</p>
                    <p>3: 
                        <bold>for</bold> each frame 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>f</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula> 
                        <bold>do</bold>
                    </p>
                    <p>4:&#x2003;&#x2003;&#x2003;
                        <bold>for</bold> each word 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>X</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula> 
                        <bold>do</bold>
                    </p>
                    <p>5:&#x2003;&#x2003;&#x2003;&#x2003;Compute the length of 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>X</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula>, denoted as 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>L</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula>.</p>
                    <p>6:&#x2003;&#x2003;&#x2003;&#x2003;Divide 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>L</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula> by 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>l</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula>, where 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>l</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                                <mml:mo>=</mml:mo>
                                <mml:mn>0.015</mml:mn>
                            </mml:math>
</inline-formula>
 sec.</p>
                    <p>7:&#x2003;&#x2003;&#x2003;&#x2003;Consider the fractional part as the number of complete windows and the real part as the last window with adjusted length.</p>
                    <p>8:&#x2003;&#x2003;&#x2003;&#x2003;Count the number of complete windows, denoted as 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>W</mml:mi>
                                    <mml:mi>T</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula>.</p>
                    <p>9:&#x2003;&#x2003;&#x2003;&#x2003;Compute the total sum of window lengths:
                        <disp-formula id="e9">

                            <mml:math display="block">
                                <mml:msub>
                                    <mml:mi>S</mml:mi>
                                    <mml:mi>T</mml:mi>
                                </mml:msub>
                                <mml:mo>=</mml:mo>
                                <mml:munderover>
                                    <mml:mo movablelimits="false">&#x2211;</mml:mo>
                                    <mml:mrow>
                                        <mml:mi>n</mml:mi>
                                        <mml:mo>=</mml:mo>
                                        <mml:mn>1</mml:mn>
                                    </mml:mrow>
                                    <mml:msub>
                                        <mml:mi>W</mml:mi>
                                        <mml:mi>T</mml:mi>
                                    </mml:msub>
                                </mml:munderover>
                                <mml:msub>
                                    <mml:mi>W</mml:mi>
                                    <mml:mi>T</mml:mi>
                                </mml:msub>
                                <mml:mrow>
                                    <mml:mo stretchy="true">(</mml:mo>
                                    <mml:mn>0.015</mml:mn>
                                    <mml:mo stretchy="true">)</mml:mo>
                                </mml:mrow>
                            </mml:math>

                            <label>(9)</label>
</disp-formula>
                    </p>
                    <p>10:&#x2003;&#x2003;&#x2003;&#x2003;Compute the length of the last window:
                        <disp-formula id="e10">

                            <mml:math display="block">
                                <mml:msub>
                                    <mml:mi>L</mml:mi>
                                    <mml:mi>n</mml:mi>
                                </mml:msub>
                                <mml:mo>=</mml:mo>
                                <mml:msub>
                                    <mml:mi>L</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                                <mml:mo>&#x2212;</mml:mo>
                                <mml:msub>
                                    <mml:mi>S</mml:mi>
                                    <mml:mi>T</mml:mi>
                                </mml:msub>
                            </mml:math>

                            <label>(10)</label>
</disp-formula>
                    </p>
                    <p>11:&#x2003;&#x2003;&#x2003;
                        <bold>end for</bold>
                    </p>
                    <p>12: 
                        <bold>end for</bold>
                    </p>
                </boxed-text>
            </sec>
            <sec id="sec11">
                <title>4.3 Backpropagation Neural Network (BPNN) in speech recognition</title>
                <p>BPNN minimizes classification errors in speech-to-text conversion.
                    <sup>
                        <xref ref-type="bibr" rid="ref32">32</xref>
                    </sup> Feature extraction techniques such as mel-frequency cepstral coefficients (MFCCs) transform raw audio into feature vector 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi mathvariant="bold">x</mml:mi>
                        </mml:math>
</inline-formula>, which BPNN processes to classify phonemes. Forward propagation computes neuron outputs in hidden and output layers:
                    <disp-formula id="e11">

                        <mml:math display="block">
                            <mml:msub>
                                <mml:mi>a</mml:mi>
                                <mml:mi>j</mml:mi>
                            </mml:msub>
                            <mml:mo>=</mml:mo>
                            <mml:mi>f</mml:mi>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:munderover>
                                    <mml:mo movablelimits="false">&#x2211;</mml:mo>
                                    <mml:mrow>
                                        <mml:mi>i</mml:mi>
                                        <mml:mo>=</mml:mo>
                                        <mml:mn>1</mml:mn>
                                    </mml:mrow>
                                    <mml:mi>n</mml:mi>
                                </mml:munderover>
                                <mml:msub>
                                    <mml:mi>w</mml:mi>
                                    <mml:mi mathvariant="italic">ij</mml:mi>
                                </mml:msub>
                                <mml:msub>
                                    <mml:mi>x</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                                <mml:mo>+</mml:mo>
                                <mml:msub>
                                    <mml:mi>b</mml:mi>
                                    <mml:mi>j</mml:mi>
                                </mml:msub>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                            <mml:mo>,</mml:mo>
                        </mml:math>

                        <label>(11)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>w</mml:mi>
                                <mml:mi mathvariant="italic">ij</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> represents the weight between the 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>i</mml:mi>
                        </mml:math>
</inline-formula>-th input neuron and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>j</mml:mi>
                        </mml:math>
</inline-formula>-th hidden neuron, 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>b</mml:mi>
                                <mml:mi>j</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is the bias term, and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>f</mml:mi>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mo>&#x22c5;</mml:mo>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                        </mml:math>
</inline-formula> is the activation function (sigmoid or ReLU):
                    <disp-formula id="e12">

                        <mml:math display="block">
                            <mml:mi>f</mml:mi>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>z</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                            <mml:mo>=</mml:mo>
                            <mml:mfrac>
                                <mml:mn>1</mml:mn>
                                <mml:mrow>
                                    <mml:mn>1</mml:mn>
                                    <mml:mo>+</mml:mo>
                                    <mml:msup>
                                        <mml:mi>e</mml:mi>
                                        <mml:mrow>
                                            <mml:mo>&#x2212;</mml:mo>
                                            <mml:mi>z</mml:mi>
                                        </mml:mrow>
                                    </mml:msup>
                                </mml:mrow>
                            </mml:mfrac>
                            <mml:mspace width="1em"/>
                            <mml:mtext>or</mml:mtext>
                            <mml:mspace width="1em"/>
                            <mml:mi>f</mml:mi>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>z</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                            <mml:mo>=</mml:mo>
                            <mml:mo>max</mml:mo>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mn>0</mml:mn>
                                <mml:mo>,</mml:mo>
                                <mml:mi>z</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                            <mml:mo>.</mml:mo>
                        </mml:math>

                        <label>(12)</label>
</disp-formula>
                </p>
                <p>The output layer generates predicted phoneme probability distributions, with error calculated using cross-entropy loss:
                    <disp-formula id="e13">

                        <mml:math display="block">
                            <mml:mi>L</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mo>&#x2212;</mml:mo>
                            <mml:munderover>
                                <mml:mo movablelimits="false">&#x2211;</mml:mo>
                                <mml:mrow>
                                    <mml:mi>k</mml:mi>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>1</mml:mn>
                                </mml:mrow>
                                <mml:mi>m</mml:mi>
                            </mml:munderover>
                            <mml:msub>
                                <mml:mi>y</mml:mi>
                                <mml:mi>k</mml:mi>
                            </mml:msub>
                            <mml:mo>log</mml:mo>
                            <mml:msub>
                                <mml:mover accent="true">
                                    <mml:mi>y</mml:mi>
                                    <mml:mo stretchy="true">&#x0302;</mml:mo>
                                </mml:mover>
                                <mml:mi>k</mml:mi>
                            </mml:msub>
                            <mml:mo>,</mml:mo>
                        </mml:math>

                        <label>(13)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>y</mml:mi>
                                <mml:mi>k</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is the actual phoneme label and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mover accent="true">
                                    <mml:mi>y</mml:mi>
                                    <mml:mo stretchy="true">&#x0302;</mml:mo>
                                </mml:mover>
                                <mml:mi>k</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is the predicted probability.</p>
                <p>During backpropagation, error gradients are computed and propagated backward to adjust weights following gradient descent:
                    <disp-formula id="e14">

                        <mml:math display="block">
                            <mml:msubsup>
                                <mml:mi>w</mml:mi>
                                <mml:mi mathvariant="italic">ij</mml:mi>
                                <mml:mrow>
                                    <mml:mo stretchy="true">(</mml:mo>
                                    <mml:mi>t</mml:mi>
                                    <mml:mo>+</mml:mo>
                                    <mml:mn>1</mml:mn>
                                    <mml:mo stretchy="true">)</mml:mo>
                                </mml:mrow>
                            </mml:msubsup>
                            <mml:mo>=</mml:mo>
                            <mml:msubsup>
                                <mml:mi>w</mml:mi>
                                <mml:mi mathvariant="italic">ij</mml:mi>
                                <mml:mrow>
                                    <mml:mo stretchy="true">(</mml:mo>
                                    <mml:mi>t</mml:mi>
                                    <mml:mo stretchy="true">)</mml:mo>
                                </mml:mrow>
                            </mml:msubsup>
                            <mml:mo>&#x2212;</mml:mo>
                            <mml:mi>&#x03b7;</mml:mi>
                            <mml:mfrac>
                                <mml:mrow>
                                    <mml:mi>&#x2202;</mml:mi>
                                    <mml:mi>L</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mi>&#x2202;</mml:mi>
                                    <mml:msub>
                                        <mml:mi>w</mml:mi>
                                        <mml:mi mathvariant="italic">ij</mml:mi>
                                    </mml:msub>
                                </mml:mrow>
                            </mml:mfrac>
                            <mml:mo>,</mml:mo>
                        </mml:math>

                        <label>(14)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>&#x03b7;</mml:mi>
                        </mml:math>
</inline-formula> is the learning rate. Gradients are computed using the chain rule:
                    <disp-formula id="e15">

                        <mml:math display="block">
                            <mml:mfrac>
                                <mml:mrow>
                                    <mml:mi>&#x2202;</mml:mi>
                                    <mml:mi>L</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mi>&#x2202;</mml:mi>
                                    <mml:msub>
                                        <mml:mi>w</mml:mi>
                                        <mml:mi mathvariant="italic">ij</mml:mi>
                                    </mml:msub>
                                </mml:mrow>
                            </mml:mfrac>
                            <mml:mo>=</mml:mo>
                            <mml:msub>
                                <mml:mi>&#x03b4;</mml:mi>
                                <mml:mi>j</mml:mi>
                            </mml:msub>
                            <mml:msub>
                                <mml:mi>a</mml:mi>
                                <mml:mi>i</mml:mi>
                            </mml:msub>
                            <mml:mo>,</mml:mo>
                        </mml:math>

                        <label>(15)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>&#x03b4;</mml:mi>
                                <mml:mi>j</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is the error term at neuron 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>j</mml:mi>
                        </mml:math>
</inline-formula>.</p>
            </sec>
            <sec id="sec12">
                <title>4.4 Algorithm: Adaptive Phoneme State Learning (APSL)</title>
                <p>This algorithm enhances traditional BPNN-HMM speech recognition by introducing an adaptive mechanism that refines HMM state transitions based on BPNN confidence scores. The Adaptive Phoneme State Learning (APSL) algorithm combines a BPNN and HMM to dynamically learn phoneme transitions. Speech signals are segmented into overlapping 0.015 s windows, with cepstral features extracted using MFCCs. The Viterbi algorithm identifies the most probable phoneme state sequence by maximizing transition likelihoods given the trained HMM parameters,
                    <sup>
                        <xref ref-type="bibr" rid="ref33">33</xref>
                    </sup> while the BPNN classifies phonemes and updates weights via gradient descent using the cross-entropy loss function.
                    <disp-formula id="e16">

                        <mml:math display="block">
                            <mml:mi>L</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mo>&#x2212;</mml:mo>
                            <mml:munderover>
                                <mml:mo movablelimits="false">&#x2211;</mml:mo>
                                <mml:mrow>
                                    <mml:mi>k</mml:mi>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>1</mml:mn>
                                </mml:mrow>
                                <mml:mi>m</mml:mi>
                            </mml:munderover>
                            <mml:msub>
                                <mml:mi>y</mml:mi>
                                <mml:mi>k</mml:mi>
                            </mml:msub>
                            <mml:mo>log</mml:mo>
                            <mml:msub>
                                <mml:mover accent="true">
                                    <mml:mi>y</mml:mi>
                                    <mml:mo stretchy="true">&#x0302;</mml:mo>
                                </mml:mover>
                                <mml:mi>k</mml:mi>
                            </mml:msub>
                        </mml:math>

                        <label>(16)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>y</mml:mi>
                                <mml:mi>k</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is the true phoneme label, and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mover accent="true">
                                    <mml:mi>y</mml:mi>
                                    <mml:mo stretchy="true">&#x0302;</mml:mo>
                                </mml:mover>
                                <mml:mi>k</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is the predicted probability.</p>
                <p>The confidence score in the APSL represents the reliability of phoneme classification using BPNN. This is defined as the posterior probability 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>P</mml:mi>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:msub>
                                    <mml:mi>p</mml:mi>
                                    <mml:mi>j</mml:mi>
                                </mml:msub>
                                <mml:mo>|</mml:mo>
                                <mml:mi>x</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                        </mml:math>
</inline-formula>, where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>p</mml:mi>
                                <mml:mi>j</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is a phoneme and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>x</mml:mi>
                        </mml:math>
</inline-formula> is the feature vector.
                    <sup>
                        <xref ref-type="bibr" rid="ref34">34</xref>
                    </sup> The confidence score helps in adaptive transition refinement, ensuring that phonemes with low classification certainty undergo additional training or an extended analysis.</p>
                <p>If a phoneme&#x2019;s confidence score is below a threshold 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>&#x03b8;</mml:mi>
                        </mml:math>
</inline-formula>, APSL dynamically modifies the HMM transition and emission probabilities. The updated emission probability is computed as:</p>
                <p>Theoretical Derivation of Adaptive Probability Fusion</p>
                <p>Let the posterior probability of the phoneme 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>p</mml:mi>
                                <mml:mi>j</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> given acoustic feature 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>x</mml:mi>
                        </mml:math>
</inline-formula> from the BPNN be denoted as 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>P</mml:mi>
                                <mml:mi mathvariant="italic">NN</mml:mi>
                            </mml:msub>
                            <mml:mfenced close=")" open="(">
                                <mml:mrow>
                                    <mml:msub>
                                        <mml:mi>p</mml:mi>
                                        <mml:mi>j</mml:mi>
                                    </mml:msub>
                                    <mml:mo>|</mml:mo>
                                    <mml:mi>x</mml:mi>
                                </mml:mrow>
                            </mml:mfenced>
                        </mml:math>
</inline-formula>. Let the HMM emission probability be 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>P</mml:mi>
                                <mml:mi mathvariant="italic">HMM</mml:mi>
                            </mml:msub>
                            <mml:mfenced close=")" open="(">
                                <mml:mrow>
                                    <mml:mi>x</mml:mi>
                                    <mml:mo>|</mml:mo>
                                    <mml:msub>
                                        <mml:mi>q</mml:mi>
                                        <mml:mi>i</mml:mi>
                                    </mml:msub>
                                </mml:mrow>
                            </mml:mfenced>
                        </mml:math>
</inline-formula>. By Bayes' theorem, the hybrid posterior is 
                    <xref ref-type="disp-formula" rid="e1 e2 e3">
Equation 17a, 17b and 17c</xref>:
                    <disp-formula id="e44">

                        <mml:math display="block">
                            <mml:mi>P</mml:mi>
                            <mml:mfenced close=")" open="(" separators="|">
                                <mml:msub>
                                    <mml:mi>q</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                                <mml:mi>x</mml:mi>
                            </mml:mfenced>
                            <mml:mo>=</mml:mo>
                            <mml:mfrac>
                                <mml:mrow>
                                    <mml:msub>
                                        <mml:mi>P</mml:mi>
                                        <mml:mi mathvariant="italic">HMM</mml:mi>
                                    </mml:msub>
                                    <mml:mfenced close=")" open="(" separators="|">
                                        <mml:mi>x</mml:mi>
                                        <mml:msub>
                                            <mml:mi>q</mml:mi>
                                            <mml:mi>i</mml:mi>
                                        </mml:msub>
                                    </mml:mfenced>
                                    <mml:mi>P</mml:mi>
                                    <mml:mfenced close=")" open="(">
                                        <mml:msub>
                                            <mml:mi>q</mml:mi>
                                            <mml:mi>i</mml:mi>
                                        </mml:msub>
                                    </mml:mfenced>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:munder>
                                        <mml:mo>&#x2211;</mml:mo>
                                        <mml:mi>j</mml:mi>
                                    </mml:munder>
                                    <mml:msub>
                                        <mml:mi>P</mml:mi>
                                        <mml:mi mathvariant="italic">HMM</mml:mi>
                                    </mml:msub>
                                    <mml:mfenced close=")" open="(" separators="|">
                                        <mml:mi>x</mml:mi>
                                        <mml:msub>
                                            <mml:mi>q</mml:mi>
                                            <mml:mi>j</mml:mi>
                                        </mml:msub>
                                    </mml:mfenced>
                                    <mml:mi>P</mml:mi>
                                    <mml:mfenced close=")" open="(">
                                        <mml:msub>
                                            <mml:mi>q</mml:mi>
                                            <mml:mi>j</mml:mi>
                                        </mml:msub>
                                    </mml:mfenced>
                                </mml:mrow>
                            </mml:mfrac>
                        </mml:math>

                        <label>(17a)</label>
</disp-formula>
                </p>
                <p>However, direct multiplication assumes independence. To account for the fact that BPNN and HMM may have complementary errors, we employ a convex combination with uncertainty weighting:</p>
                <p>

                    <disp-formula id="e45">

                        <mml:math display="block">
                            <mml:msub>
                                <mml:mi>P</mml:mi>
                                <mml:mtext mathvariant="italic">hybrid</mml:mtext>
                            </mml:msub>
                            <mml:mfenced close=")" open="(">
                                <mml:mrow>
                                    <mml:msub>
                                        <mml:mi>w</mml:mi>
                                        <mml:mi>i</mml:mi>
                                    </mml:msub>
                                    <mml:mo>|</mml:mo>
                                    <mml:msub>
                                        <mml:mi>q</mml:mi>
                                        <mml:mi>i</mml:mi>
                                    </mml:msub>
                                </mml:mrow>
                            </mml:mfenced>
                            <mml:mo>=</mml:mo>
                            <mml:mi>&#x03b1;</mml:mi>
                            <mml:mfenced close=")" open="(">
                                <mml:mi>x</mml:mi>
                            </mml:mfenced>
                            <mml:mo>&#x00b7;</mml:mo>
                            <mml:msub>
                                <mml:mi>P</mml:mi>
                                <mml:mi mathvariant="italic">NN</mml:mi>
                            </mml:msub>
                            <mml:mfenced close=")" open="(">
                                <mml:mrow>
                                    <mml:msub>
                                        <mml:mi>p</mml:mi>
                                        <mml:mi>j</mml:mi>
                                    </mml:msub>
                                    <mml:mo>|</mml:mo>
                                    <mml:mi>x</mml:mi>
                                </mml:mrow>
                            </mml:mfenced>
                            <mml:mo>+</mml:mo>
                            <mml:mfenced close=")" open="(">
                                <mml:mrow>
                                    <mml:mn>1</mml:mn>
                                    <mml:mo>&#x2212;</mml:mo>
                                    <mml:mi>&#x03b1;</mml:mi>
                                    <mml:mfenced close=")" open="(">
                                        <mml:mi>x</mml:mi>
                                    </mml:mfenced>
                                </mml:mrow>
                            </mml:mfenced>
                            <mml:mo>&#x00b7;</mml:mo>
                            <mml:msub>
                                <mml:mi>P</mml:mi>
                                <mml:mi mathvariant="italic">HMM</mml:mi>
                            </mml:msub>
                            <mml:mfenced close=")" open="(">
                                <mml:mrow>
                                    <mml:msub>
                                        <mml:mi>w</mml:mi>
                                        <mml:mi>i</mml:mi>
                                    </mml:msub>
                                    <mml:mo>|</mml:mo>
                                    <mml:msub>
                                        <mml:mi>q</mml:mi>
                                        <mml:mi>i</mml:mi>
                                    </mml:msub>
                                </mml:mrow>
                            </mml:mfenced>
                        </mml:math>

                        <label>(17b)</label>
</disp-formula>
                </p>
                <p>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>&#x03b1;</mml:mi>
                            <mml:mfenced close=")" open="(">
                                <mml:mi>x</mml:mi>
                            </mml:mfenced>
                        </mml:math>
</inline-formula> is not a fixed constant but a dynamic confidence-weighting function:
                    <disp-formula id="e46">

                        <mml:math display="block">
                            <mml:mi>&#x03b1;</mml:mi>
                            <mml:mfenced close=")" open="(">
                                <mml:mi>x</mml:mi>
                            </mml:mfenced>
                            <mml:mo>=</mml:mo>
                            <mml:mfrac>
                                <mml:msubsup>
                                    <mml:mi>&#x03c3;</mml:mi>
                                    <mml:mi mathvariant="italic">HMM</mml:mi>
                                    <mml:mn>2</mml:mn>
                                </mml:msubsup>
                                <mml:mrow>
                                    <mml:msubsup>
                                        <mml:mi>&#x03c3;</mml:mi>
                                        <mml:mi mathvariant="italic">NN</mml:mi>
                                        <mml:mn>2</mml:mn>
                                    </mml:msubsup>
                                    <mml:mo>+</mml:mo>
                                    <mml:msubsup>
                                        <mml:mi>&#x03c3;</mml:mi>
                                        <mml:mi mathvariant="italic">HMM</mml:mi>
                                        <mml:mn>2</mml:mn>
                                    </mml:msubsup>
                                </mml:mrow>
                            </mml:mfrac>
                        </mml:math>

                        <label>(17c)</label>
</disp-formula>
                </p>
                <p>Here, 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msubsup>
                                <mml:mi>&#x03c3;</mml:mi>
                                <mml:mi mathvariant="italic">NN</mml:mi>
                                <mml:mn>2</mml:mn>
                            </mml:msubsup>
                        </mml:math>
</inline-formula> and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msubsup>
                                <mml:mi>&#x03c3;</mml:mi>
                                <mml:mi mathvariant="italic">HMM</mml:mi>
                                <mml:mn>2</mml:mn>
                            </mml:msubsup>
                        </mml:math>
</inline-formula> are the empirical error variances estimated from validation data. This formulation is equivalent to Bayesian model averaging under the assumption of Gaussian-distributed estimation errors. When BPNN confidence is high (
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msubsup>
                                <mml:mi>&#x03c3;</mml:mi>
                                <mml:mi mathvariant="italic">NN</mml:mi>
                                <mml:mn>2</mml:mn>
                            </mml:msubsup>
                            <mml:mo>&#x2192;</mml:mo>
                            <mml:mn>0</mml:mn>
                        </mml:math>
</inline-formula>), 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>&#x03b1;</mml:mi>
                            <mml:mfenced close=")" open="(">
                                <mml:mi>x</mml:mi>
                            </mml:mfenced>
                            <mml:mo>&#x2192;</mml:mo>
                            <mml:mn>1</mml:mn>
                        </mml:math>
</inline-formula>; when HMM is more reliable, 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>&#x03b1;</mml:mi>
                            <mml:mfenced close=")" open="(">
                                <mml:mi>x</mml:mi>
                            </mml:mfenced>
                            <mml:mo>&#x2192;</mml:mo>
                            <mml:mn>0</mml:mn>
                        </mml:math>
</inline-formula>.</p>
                <p>To further improve recognition, APSL dynamically adjusts the window size for phonemes with low confidence scores:
                    <disp-formula id="e47">

                        <mml:math display="block">
                            <mml:msubsup>
                                <mml:mi>w</mml:mi>
                                <mml:mi>i</mml:mi>
                                <mml:mo>&#x2032;</mml:mo>
                            </mml:msubsup>
                            <mml:mo>=</mml:mo>
                            <mml:msub>
                                <mml:mi>w</mml:mi>
                                <mml:mi>i</mml:mi>
                            </mml:msub>
                            <mml:mo>+</mml:mo>
                            <mml:mi mathvariant="normal">&#x0394;</mml:mi>
                            <mml:mi>t</mml:mi>
                            <mml:mo>,</mml:mo>
                            <mml:mspace width="1em"/>
                            <mml:mi mathvariant="normal">&#x0394;</mml:mi>
                            <mml:mi>t</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mn>5</mml:mn>
                            <mml:mi mathvariant="italic">ms</mml:mi>
                        </mml:math>

                        <label>(18)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msubsup>
                                <mml:mi>w</mml:mi>
                                <mml:mi>i</mml:mi>
                                <mml:mo>&#x2032;</mml:mo>
                            </mml:msubsup>
                        </mml:math>
</inline-formula> is the updated window length.</p>
                <p>The 5 ms increment was empirically optimized using grid search on a validation set (n=500 utterances), evaluating WER for &#x0394;t &#x2208; {2, 3, 5, 8, 10, 15} ms. &#x0394;t=5 ms yielded the optimal trade-off between phoneme boundary refinement (3.2% WER reduction) and computational overhead (&#x2264;8% latency increase). (i) &#x0394;t=2 ms: +0.8% WER reduction, +2% latency &#x2192; suboptimal, (ii) &#x0394;t=5 ms: +3.2% WER reduction, +8% latency &#x2192; optimal, (iii) &#x0394;t=10 ms: +3.5% WER reduction, +22% latency &#x2192; diminishing returns. Thus, the 5 ms threshold corresponds to approximately one-third of a typical phoneme duration (15 ms), ensuring extension remains within the same phoneme rather than crossing boundaries.</p>
                <p>The final phoneme sequence is determined by the Viterbi decoding process:
                    <disp-formula id="e19">

                        <mml:math display="block">
                            <mml:msup>
                                <mml:mi>Q</mml:mi>
                                <mml:mo>&#x2217;</mml:mo>
                            </mml:msup>
                            <mml:mo>=</mml:mo>
                            <mml:mo>arg</mml:mo>
                            <mml:munder>
                                <mml:mo>max</mml:mo>
                                <mml:mi>Q</mml:mi>
                            </mml:munder>
                            <mml:mi>P</mml:mi>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>Q</mml:mi>
                                <mml:mo>|</mml:mo>
                                <mml:mi>W</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                        </mml:math>

                        <label>(19)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>W</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mrow>
                                <mml:mo stretchy="true">{</mml:mo>
                                <mml:msub>
                                    <mml:mi>w</mml:mi>
                                    <mml:mn>1</mml:mn>
                                </mml:msub>
                                <mml:mo>,</mml:mo>
                                <mml:msub>
                                    <mml:mi>w</mml:mi>
                                    <mml:mn>2</mml:mn>
                                </mml:msub>
                                <mml:mo>,</mml:mo>
                                <mml:mo>&#x2026;</mml:mo>
                                <mml:mo>,</mml:mo>
                                <mml:msub>
                                    <mml:mi>w</mml:mi>
                                    <mml:mi>N</mml:mi>
                                </mml:msub>
                                <mml:mo stretchy="true">}</mml:mo>
                            </mml:mrow>
                        </mml:math>
</inline-formula> represents the sequence of analyzed windows. The APSL model adapts over time by adjusting state transitions based on the observed confidence scores, reducing phoneme classification errors, and improving speech recognition accuracy.</p>
                <boxed-text id="B2" orientation="portrait" position="float">
                    <label>Algorithm 2. </label>
                    <caption>
                        <title>Adaptive Phoneme State Learning (APSL) using BPNN-HMM.</title>
                    </caption>
                    <p>1: 
                        <bold>Input:</bold> Speech signal 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:mi>S</mml:mi>
                            </mml:math>
</inline-formula>, predefined phoneme set 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:mi>P</mml:mi>
                            </mml:math>
</inline-formula>, HMM states 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:mi>Q</mml:mi>
                            </mml:math>
</inline-formula>
                    </p>
                    <p>2: 
                        <bold>Output:</bold> Optimized phoneme sequence 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msup>
                                    <mml:mi>Q</mml:mi>
                                    <mml:mo>&#x2217;</mml:mo>
                                </mml:msup>
                            </mml:math>
</inline-formula>
                    </p>
                    <p>3: 
                        <bold>Step 1: Preprocessing and Feature Extraction</bold>
                    </p>
                    <p>4: Convert speech signal 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:mi>S</mml:mi>
                            </mml:math>
</inline-formula> into frames 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>f</mml:mi>
                                    <mml:mi>n</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula> with 15ms windows 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>w</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula>
                    </p>
                    <p>5: Extract Mel-Frequency Cepstral Coefficients (MFCCs) to form feature vectors 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:mi>x</mml:mi>
                            </mml:math>
</inline-formula>
                    </p>
                    <p>6: 
                        <bold>Step 2: BPNN-Based Phoneme Probability Estimation</bold>
                    </p>
                    <p>7: Train a BPNN model to classify phonemes</p>
                    <p>8: Compute phoneme confidence score 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:mi>P</mml:mi>
                                <mml:mrow>
                                    <mml:mo stretchy="true">(</mml:mo>
                                    <mml:msub>
                                        <mml:mi>p</mml:mi>
                                        <mml:mi>j</mml:mi>
                                    </mml:msub>
                                    <mml:mo>|</mml:mo>
                                    <mml:mi>x</mml:mi>
                                    <mml:mo stretchy="true">)</mml:mo>
                                </mml:mrow>
                            </mml:math>
</inline-formula> for each phoneme 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>p</mml:mi>
                                    <mml:mi>j</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula>
                    </p>
                    <p>9: 
                        <bold>Step 3: Adaptive HMM Transition Refinement</bold>
                    </p>
                    <p>10: 
                        <bold>for</bold> each state 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msub>
                                    <mml:mi>q</mml:mi>
                                    <mml:mi>t</mml:mi>
                                </mml:msub>
                                <mml:mspace width="0.25em"/>
                                <mml:mo>&#x2208;</mml:mo>
                                <mml:mspace width="0.25em"/>
                                <mml:mi>Q</mml:mi>
                            </mml:math>
</inline-formula> 
                        <bold>do</bold>
                    </p>
                    <p>11:&#x2003;&#x2003;&#x2003;Compute modified emission probability:</p>
                    <p>12:&#x2003;&#x2003;&#x2003;
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:mi>P</mml:mi>
                                <mml:mrow>
                                    <mml:mo stretchy="true">(</mml:mo>
                                    <mml:msub>
                                        <mml:mi>w</mml:mi>
                                        <mml:mi>i</mml:mi>
                                    </mml:msub>
                                    <mml:mo>|</mml:mo>
                                    <mml:msub>
                                        <mml:mi>q</mml:mi>
                                        <mml:mi>t</mml:mi>
                                    </mml:msub>
                                    <mml:mo stretchy="true">)</mml:mo>
                                </mml:mrow>
                                <mml:mo>=</mml:mo>
                                <mml:mi mathvariant="italic">&#x03b1;P</mml:mi>
                                <mml:mrow>
                                    <mml:mo stretchy="true">(</mml:mo>
                                    <mml:msub>
                                        <mml:mi>p</mml:mi>
                                        <mml:mi>j</mml:mi>
                                    </mml:msub>
                                    <mml:mo>|</mml:mo>
                                    <mml:mi>x</mml:mi>
                                    <mml:mo stretchy="true">)</mml:mo>
                                </mml:mrow>
                                <mml:mo>+</mml:mo>
                                <mml:mrow>
                                    <mml:mo stretchy="true">(</mml:mo>
                                    <mml:mn>1</mml:mn>
                                    <mml:mo>&#x2212;</mml:mo>
                                    <mml:mi>&#x03b1;</mml:mi>
                                    <mml:mo stretchy="true">)</mml:mo>
                                </mml:mrow>
                                <mml:msub>
                                    <mml:mi>P</mml:mi>
                                    <mml:mi mathvariant="italic">HMM</mml:mi>
                                </mml:msub>
                                <mml:mrow>
                                    <mml:mo stretchy="true">(</mml:mo>
                                    <mml:msub>
                                        <mml:mi>w</mml:mi>
                                        <mml:mi>i</mml:mi>
                                    </mml:msub>
                                    <mml:mo>|</mml:mo>
                                    <mml:msub>
                                        <mml:mi>q</mml:mi>
                                        <mml:mi>t</mml:mi>
                                    </mml:msub>
                                    <mml:mo stretchy="true">)</mml:mo>
                                </mml:mrow>
                            </mml:math>
</inline-formula>
                    </p>
                    <p>13: 
                        <bold>end for</bold>
                    </p>
                    <p>14: 
                        <bold>Step 4: Dynamic Windowing for Phoneme Alignment</bold>
                    </p>
                    <p>15: 
                        <bold>if</bold> 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:mi>P</mml:mi>
                                <mml:mrow>
                                    <mml:mo stretchy="true">(</mml:mo>
                                    <mml:msub>
                                        <mml:mi>p</mml:mi>
                                        <mml:mi>j</mml:mi>
                                    </mml:msub>
                                    <mml:mo>|</mml:mo>
                                    <mml:mi>x</mml:mi>
                                    <mml:mo stretchy="true">)</mml:mo>
                                </mml:mrow>
                                <mml:mo>&lt;</mml:mo>
                                <mml:mi>&#x03b8;</mml:mi>
                            </mml:math>
</inline-formula> (confidence threshold) 
                        <bold>then</bold>
                    </p>
                    <p>16:&#x2003;&#x2003;&#x2003;Extend window: 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msubsup>
                                    <mml:mi>w</mml:mi>
                                    <mml:mover accent="true">
                                        <mml:mi>i</mml:mi>
                                        <mml:mo>&#x0301;</mml:mo>
                                    </mml:mover>
                                    <mml:mo>&#x2032;</mml:mo>
                                </mml:msubsup>
                                <mml:mo>=</mml:mo>
                                <mml:msub>
                                    <mml:mi>w</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                                <mml:mo>+</mml:mo>
                                <mml:mi mathvariant="normal">&#x0394;</mml:mi>
                                <mml:mi>t</mml:mi>
                                <mml:mo>,</mml:mo>
                                <mml:mspace width="1em"/>
                                <mml:mi mathvariant="normal">&#x0394;</mml:mi>
                                <mml:mi>t</mml:mi>
                                <mml:mo>=</mml:mo>
                                <mml:mn>5</mml:mn>
                                <mml:mi mathvariant="italic">ms</mml:mi>
                            </mml:math>
</inline-formula>
                    </p>
                    <p>17: 
                        <bold>end if</bold>
                    </p>
                    <p>18: 
                        <bold>Step 5: Decoding with APSL</bold>
                    </p>
                    <p>19: Apply Viterbi algorithm to obtain optimal phoneme sequence:</p>
                    <p>20: 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msup>
                                    <mml:mi>Q</mml:mi>
                                    <mml:mo>&#x2217;</mml:mo>
                                </mml:msup>
                                <mml:mo>=</mml:mo>
                                <mml:mo>arg</mml:mo>
                                <mml:msub>
                                    <mml:mo>max</mml:mo>
                                    <mml:mi>Q</mml:mi>
                                </mml:msub>
                                <mml:mi>P</mml:mi>
                                <mml:mrow>
                                    <mml:mo stretchy="true">(</mml:mo>
                                    <mml:mi>Q</mml:mi>
                                    <mml:mo>|</mml:mo>
                                    <mml:mi>W</mml:mi>
                                    <mml:mo stretchy="true">)</mml:mo>
                                </mml:mrow>
                            </mml:math>
</inline-formula> where 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:mi>W</mml:mi>
                                <mml:mo>=</mml:mo>
                                <mml:mrow>
                                    <mml:mo stretchy="true">{</mml:mo>
                                    <mml:msub>
                                        <mml:mi>w</mml:mi>
                                        <mml:mn>1</mml:mn>
                                    </mml:msub>
                                    <mml:mo>,</mml:mo>
                                    <mml:msub>
                                        <mml:mi>w</mml:mi>
                                        <mml:mn>2</mml:mn>
                                    </mml:msub>
                                    <mml:mo>,</mml:mo>
                                    <mml:mo>&#x2026;</mml:mo>
                                    <mml:mo>,</mml:mo>
                                    <mml:msub>
                                        <mml:mi>w</mml:mi>
                                        <mml:mi>N</mml:mi>
                                    </mml:msub>
                                    <mml:mo stretchy="true">}</mml:mo>
                                </mml:mrow>
                            </mml:math>
</inline-formula>
                    </p>
                    <p>21: 
                        <bold>Step 6: Training Updates using Backpropagation</bold>
                    </p>
                    <p>22: Compute loss: 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:mi>L</mml:mi>
                                <mml:mo>=</mml:mo>
                                <mml:mo>&#x2212;</mml:mo>
                                <mml:msubsup>
                                    <mml:mo>&#x2211;</mml:mo>
                                    <mml:mrow>
                                        <mml:mi>k</mml:mi>
                                        <mml:mo>=</mml:mo>
                                        <mml:mn>1</mml:mn>
                                    </mml:mrow>
                                    <mml:mi>m</mml:mi>
                                </mml:msubsup>
                                <mml:msub>
                                    <mml:mi>y</mml:mi>
                                    <mml:mi>k</mml:mi>
                                </mml:msub>
                                <mml:mo>log</mml:mo>
                                <mml:msub>
                                    <mml:mover accent="true">
                                        <mml:mi>y</mml:mi>
                                        <mml:mo stretchy="true">&#x0302;</mml:mo>
                                    </mml:mover>
                                    <mml:mi>k</mml:mi>
                                </mml:msub>
                            </mml:math>
</inline-formula>
                    </p>
                    <p>23: Update BPNN weights:</p>
                    <p>24: 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msubsup>
                                    <mml:mi>w</mml:mi>
                                    <mml:mi mathvariant="italic">ij</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>t</mml:mi>
                                        <mml:mo>+</mml:mo>
                                        <mml:mn>1</mml:mn>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                </mml:msubsup>
                                <mml:mo>=</mml:mo>
                                <mml:msubsup>
                                    <mml:mi>w</mml:mi>
                                    <mml:mi mathvariant="italic">ij</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>t</mml:mi>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                </mml:msubsup>
                                <mml:mo>&#x2212;</mml:mo>
                                <mml:mi>&#x03b7;</mml:mi>
                                <mml:mfrac>
                                    <mml:mrow>
                                        <mml:mi>&#x2202;</mml:mi>
                                        <mml:mi>L</mml:mi>
                                    </mml:mrow>
                                    <mml:mrow>
                                        <mml:mi>&#x2202;</mml:mi>
                                        <mml:msub>
                                            <mml:mi>w</mml:mi>
                                            <mml:mi mathvariant="italic">ij</mml:mi>
                                        </mml:msub>
                                    </mml:mrow>
                                </mml:mfrac>
                            </mml:math>
</inline-formula>
                    </p>
                    <p>25: 
                        <bold>Return</bold> Optimized phoneme sequence 
                        <inline-formula>

                            <mml:math display="inline">
                                <mml:msup>
                                    <mml:mi>Q</mml:mi>
                                    <mml:mo>&#x2217;</mml:mo>
                                </mml:msup>
                            </mml:math>
</inline-formula>
                    </p>
                </boxed-text>
                <p>

                    <bold>4.4.1 Optimal hyperparameter tuning using Bayesian optimization</bold>
                </p>
                <p>Hyperparameter tuning is a critical step in machine learning for identifying the optimal set of hyperparameters to enhance model performance. Unlike model parameters learned during training, hyperparameters are predefined and govern the learning process, including the learning rate, number of hidden layers, batch size, and dropout rate. Selecting appropriate hyperparameters is essential for maximizing accuracy and minimizing errors. Bayesian Optimization is an efficient method for hyperparameter tuning, especially for complex models with expensive evaluation costs.
                    <sup>
                        <xref ref-type="bibr" rid="ref35">35</xref>
                    </sup> It constructs a probabilistic model of the objective function and uses an acquisition function to balance exploration and exploitation when selecting new hyperparameter configurations. Using Bayesian Optimization, optimal hyperparameters were determined for both the BPNN and APSL-BPNN-HMM speech recognition models. For the BPNN model, the optimal learning rate was 0.005, with three hidden layers of 256 neurons each, a batch size of 64, and 150 training epochs. The model employed the ReLU activation function with a dropout rate of 0.3, along with the Adam optimizer and cross-entropy loss function. For the APSL-BPNN-HMM model, the optimal learning rate was 0.003, with two hidden layers of 128 neurons each, a batch size of 64, and 200 epochs. The ReLU activation function with a dropout rate of 0.4 was used, while the confidence threshold (
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>&#x03b8;</mml:mi>
                        </mml:math>
</inline-formula>) was set to 0.75, the weighting factor (
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>&#x03b1;</mml:mi>
                        </mml:math>
</inline-formula>) to 0.5, and the dynamic window adjustment size (
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi mathvariant="normal">&#x0394;</mml:mi>
                            <mml:mi>t</mml:mi>
                        </mml:math>
</inline-formula>) to 15 ms. The Adam optimizer and cross-entropy loss function were also applied to ensure stable convergence and improved speech recognition accuracy.</p>
            </sec>
        </sec>
        <sec id="sec13">
            <title>4.5 An illustrated example</title>
            <p>

                <bold>4.5.1 Frequency and probability calculations using HMM approach</bold>
            </p>
            <p>Here, frequency indicates the number of times the corpus encounters the syllable. The probability of an individual syllable is obtained by dividing it by the total number of words in the corpus containing that syllable. 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>&#x03c9;</mml:mi>
                    </mml:math>
</inline-formula> represents any sequence of phonemes. Note: Only 2 words are shown, and the same process is repeated for other words in the given context.</p>
            <p>

                <bold>The</bold>
            </p>
            <p>

                <bold>Frequency = 87</bold>

                <disp-formula id="e20">

                    <mml:math display="block">
                        <mml:mtable columnalign="center" displaystyle="true">
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>t</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo>,</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mtext>Probability of &#x2018;t&#x2019; coming first</mml:mtext>
                                    <mml:mo>=</mml:mo>
                                    <mml:mfrac>
                                        <mml:mn>31</mml:mn>
                                        <mml:mn>87</mml:mn>
                                    </mml:mfrac>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.35</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>h</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mi>t</mml:mi>
                                        <mml:mo>,</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mtext>Probability of &#x2018;h&#x2019; coming after &#x2018;t&#x2019; at the</mml:mtext>
                                    <mml:mspace width="0.5em"/>
                                    <mml:mtext>beginning</mml:mtext>
                                    <mml:mo>=</mml:mo>
                                    <mml:mfrac>
                                        <mml:mn>19</mml:mn>
                                        <mml:mn>87</mml:mn>
                                    </mml:mfrac>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.21</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>e</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mi>h</mml:mi>
                                        <mml:mo>,</mml:mo>
                                        <mml:mi>t</mml:mi>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mtext>Probability of &#x2018;e&#x2019; coming after &#x2018;th&#x2019;</mml:mtext>
                                    <mml:mo>=</mml:mo>
                                    <mml:mfrac>
                                        <mml:mn>29</mml:mn>
                                        <mml:mn>87</mml:mn>
                                    </mml:mfrac>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.33</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                        </mml:mtable>
                    </mml:math>
</disp-formula>
            </p>
            <p>Therefore, each phoneme is now transformed into its corresponding syllable, &#x2018;the&#x2019; 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mo>&#x2192;</mml:mo>
                    </mml:math>
</inline-formula> &#x00f0;&#x01dd;,&#x00f0;&#x026a;,&#x00f0;i&#x02d0;/</p>
            <p>Using the pronunciation of &#x2018;the&#x2019; as trained data, more words containing &#x2018;the&#x2019; sequence such as this, there, these, then, and thesis are tested. These words are correctly recognized and converted to the exact match of a syllable.</p>
            <p>

                <bold>Joy</bold>
            </p>
            <p>

                <bold>Frequency</bold> = 14</p>
            <p>(Joy was rejected, words in the dictionary are: jinx, job, jockey, jury, subject, disjoint, jealous, injury, rejoice, adjective, adjourn, rejected, conjure)
                <disp-formula id="e21">

                    <mml:math display="block">
                        <mml:mtable columnalign="center" displaystyle="true">
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>j</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo>,</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mtext>Probability of &#x2018;j&#x2019; coming first</mml:mtext>
                                    <mml:mo>=</mml:mo>
                                    <mml:mfrac>
                                        <mml:mn>5</mml:mn>
                                        <mml:mn>14</mml:mn>
                                    </mml:mfrac>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.38</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>o</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mi>j</mml:mi>
                                        <mml:mo>,</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mtext>Probability of &#x2018;o&#x2019; coming after &#x2018;j&#x2019; at the beginning</mml:mtext>
                                    <mml:mo>=</mml:mo>
                                    <mml:mfrac>
                                        <mml:mn>2</mml:mn>
                                        <mml:mn>14</mml:mn>
                                    </mml:mfrac>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.15</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>y</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mi>o</mml:mi>
                                        <mml:mo>,</mml:mo>
                                        <mml:mi>j</mml:mi>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mtext>Probability of &#x2018;y&#x2019; coming after &#x2018;jo&#x2019;</mml:mtext>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                        </mml:mtable>
                    </mml:math>
</disp-formula>
            </p>
            <p>&#x201c;joy&#x201d; pattern was not found in the speech corpus. Therefore, with the help of HMM, the given phonemes are split into 2 different probabilities as follows:</p>
            <p>1) &#x2018;jo&#x2019; 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mo>&#x2192;</mml:mo>
                        <mml:mi>P</mml:mi>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:mi>o</mml:mi>
                            <mml:mo>|</mml:mo>
                            <mml:mi>j</mml:mi>
                            <mml:mo>,</mml:mo>
                            <mml:mi>&#x03c9;</mml:mi>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                    </mml:math>
</inline-formula>, probability of &#x2018;o&#x2019; coming after &#x2018;j&#x2019;, that is, 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mfrac>
                            <mml:mn>2</mml:mn>
                            <mml:mn>14</mml:mn>
                        </mml:mfrac>
                    </mml:math>
</inline-formula> + any other words in the dictionary withthe simple combination of &#x2018;jo&#x2019;.of &#x2018;jo&#x2019;. The words rejoice and adjourn are found in the dictionary, suiting this criterion. Thus, the total probability will be
                <disp-formula id="e22">

                    <mml:math display="block">
                        <mml:mfrac>
                            <mml:mrow>
                                <mml:mn>2</mml:mn>
                                <mml:mo>+</mml:mo>
                                <mml:mn>2</mml:mn>
                            </mml:mrow>
                            <mml:mn>14</mml:mn>
                        </mml:mfrac>
                        <mml:mo>=</mml:mo>
                        <mml:mn>0.26</mml:mn>
                    </mml:math>
</disp-formula>
            </p>
            <p>2) &#x2018;oy&#x2019; 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mo>&#x2192;</mml:mo>
                        <mml:mi>P</mml:mi>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:mi>y</mml:mi>
                            <mml:mo>|</mml:mo>
                            <mml:mi>o</mml:mi>
                            <mml:mo>,</mml:mo>
                            <mml:mi>&#x03c9;</mml:mi>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                    </mml:math>
</inline-formula>. When searched in the corpus, the phoneme for &#x2018;oy&#x2019; was found in the word &#x2018;annoy&#x2019; pronunciation. Thus, the probability will be
                <disp-formula id="e23">

                    <mml:math display="block">
                        <mml:mfrac>
                            <mml:mn>1</mml:mn>
                            <mml:mn>14</mml:mn>
                        </mml:mfrac>
                        <mml:mo>=</mml:mo>
                        <mml:mn>0.07</mml:mn>
                    </mml:math>
</disp-formula>
            </p>
            <p>Supposedly, if &#x2018;jo 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>&#x03c9;</mml:mi>
                    </mml:math>
</inline-formula>&#x2019; was not found and &#x2018;
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>&#x03c9;</mml:mi>
                    </mml:math>
</inline-formula> oy&#x2019; was not found, then the HMM model will look for:</p>
            <p>- &#x2018;
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>&#x03c9;</mml:mi>
                    </mml:math>
</inline-formula> j 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>&#x03c9;</mml:mi>
                    </mml:math>
</inline-formula>&#x2019; alone (James) - &#x2018;
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>&#x03c9;</mml:mi>
                    </mml:math>
</inline-formula> o 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>&#x03c9;</mml:mi>
                    </mml:math>
</inline-formula>&#x2019; (of
) - &#x2018;
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>&#x03c9;</mml:mi>
                    </mml:math>
</inline-formula> y 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>&#x03c9;</mml:mi>
                    </mml:math>
</inline-formula>&#x2019; (why)</p>
            <p>Therefore, each phoneme of the word &#x2018;joy&#x2019; 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mo>&#x2192;</mml:mo>
                    </mml:math>
</inline-formula> d&#x0292; &#x0254;&#x026a;/ is now transformed into its corresponding syllable. 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>&#x03c9;</mml:mi>
                    </mml:math>
</inline-formula> represents any sequence of phonemes.</p>
            <p>

                <bold>4.5.2 APSL-BPNN-HMM refinements, where phoneme probabilities are adjusted using BPNN confidence scores.</bold>
            </p>
            <p>Here, frequency indicates the number of times the corpus encounters the syllable. The probability of an individual syllable is obtained by dividing it by the total number of words in the corpus containing that syllable. With APSL, the probability calculations are adjusted dynamically using BPNN-generated phoneme confidence scores. Let 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>&#x03c9;</mml:mi>
                    </mml:math>
</inline-formula> represent any sequence of phonemes.</p>
            <p>

                <bold>The</bold>
            </p>
            <p>

                <bold>Frequency = 87</bold>

                <disp-formula id="e24">

                    <mml:math display="block">
                        <mml:mtable columnalign="left" displaystyle="true">
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>t</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo>,</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mtext>Probability</mml:mtext>
                                    <mml:mspace width="0.5em"/>
                                    <mml:msup>
                                        <mml:mtext>of</mml:mtext>
                                        <mml:mo>&#x2018;</mml:mo>
                                    </mml:msup>
                                    <mml:msup>
                                        <mml:mi mathvariant="normal">t</mml:mi>
                                        <mml:mo>&#x2019;</mml:mo>
                                    </mml:msup>
                                    <mml:mspace width="0.5em"/>
                                    <mml:mtext>coming</mml:mtext>
                                    <mml:mspace width="0.5em"/>
                                    <mml:mtext>first</mml:mtext>
                                    <mml:mo>=</mml:mo>
                                    <mml:mfrac>
                                        <mml:mn>31</mml:mn>
                                        <mml:mn>87</mml:mn>
                                    </mml:mfrac>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.35</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>h</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mi>t</mml:mi>
                                        <mml:mo>,</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mtext>Probability</mml:mtext>
                                    <mml:mspace width="0.5em"/>
                                    <mml:msup>
                                        <mml:mtext>of</mml:mtext>
                                        <mml:mo>&#x2018;</mml:mo>
                                    </mml:msup>
                                    <mml:msup>
                                        <mml:mi mathvariant="normal">h</mml:mi>
                                        <mml:mo>&#x2019;</mml:mo>
                                    </mml:msup>
                                    <mml:mspace width="0.5em"/>
                                    <mml:mtext>coming</mml:mtext>
                                    <mml:mspace width="0.5em"/>
                                    <mml:msup>
                                        <mml:mtext>after</mml:mtext>
                                        <mml:mo>&#x2018;</mml:mo>
                                    </mml:msup>
                                    <mml:msup>
                                        <mml:mi mathvariant="normal">t</mml:mi>
                                        <mml:mo>&#x2019;</mml:mo>
                                    </mml:msup>
                                    <mml:mspace width="0.5em"/>
                                    <mml:mi>at</mml:mi>
                                    <mml:mspace width="0.5em"/>
                                    <mml:mtext>the</mml:mtext>
                                    <mml:mspace width="0.5em"/>
                                    <mml:mtext>beginning</mml:mtext>
                                    <mml:mo>=</mml:mo>
                                    <mml:mfrac>
                                        <mml:mn>19</mml:mn>
                                        <mml:mn>87</mml:mn>
                                    </mml:mfrac>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.21</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>e</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mi>h</mml:mi>
                                        <mml:mo>,</mml:mo>
                                        <mml:mi>t</mml:mi>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mtext>Probability</mml:mtext>
                                    <mml:mspace width="0.5em"/>
                                    <mml:msup>
                                        <mml:mtext>of</mml:mtext>
                                        <mml:mo>&#x2018;</mml:mo>
                                    </mml:msup>
                                    <mml:msup>
                                        <mml:mi mathvariant="normal">e</mml:mi>
                                        <mml:mo>&#x2019;</mml:mo>
                                    </mml:msup>
                                    <mml:mspace width="0.5em"/>
                                    <mml:mtext>coming</mml:mtext>
                                    <mml:mspace width="0.5em"/>
                                    <mml:msup>
                                        <mml:mtext>after</mml:mtext>
                                        <mml:mo>&#x2018;</mml:mo>
                                    </mml:msup>
                                    <mml:msup>
                                        <mml:mi>th</mml:mi>
                                        <mml:mo>&#x2019;</mml:mo>
                                    </mml:msup>
                                    <mml:mo>=</mml:mo>
                                    <mml:mfrac>
                                        <mml:mn>29</mml:mn>
                                        <mml:mn>87</mml:mn>
                                    </mml:mfrac>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.33</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                        </mml:mtable>
                    </mml:math>
</disp-formula>
            </p>
            <p>With APSL-BPNN-HMM, each probability is updated with the BPNN confidence score (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>C</mml:mi>
                    </mml:math>
</inline-formula>) for each phoneme transition:
                <disp-formula id="e25">

                    <mml:math display="block">
                        <mml:msup>
                            <mml:mi>P</mml:mi>
                            <mml:mo>&#x2032;</mml:mo>
                        </mml:msup>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:mi>e</mml:mi>
                            <mml:mo>|</mml:mo>
                            <mml:mi>h</mml:mi>
                            <mml:mo>,</mml:mo>
                            <mml:mi>t</mml:mi>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                        <mml:mo>=</mml:mo>
                        <mml:mi>P</mml:mi>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:mi>e</mml:mi>
                            <mml:mo>|</mml:mo>
                            <mml:mi>h</mml:mi>
                            <mml:mo>,</mml:mo>
                            <mml:mi>t</mml:mi>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                        <mml:mo>&#x00d7;</mml:mo>
                        <mml:mi>C</mml:mi>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:mi>e</mml:mi>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                    </mml:math>
</disp-formula>
            </p>
            <p>If 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>C</mml:mi>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:mi>e</mml:mi>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                        <mml:mo>=</mml:mo>
                        <mml:mn>0.95</mml:mn>
                    </mml:math>
</inline-formula>, the adjusted probability is:
                <disp-formula id="e26">

                    <mml:math display="block">
                        <mml:msup>
                            <mml:mi>P</mml:mi>
                            <mml:mo>&#x2032;</mml:mo>
                        </mml:msup>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:mi>e</mml:mi>
                            <mml:mo>|</mml:mo>
                            <mml:mi>h</mml:mi>
                            <mml:mo>,</mml:mo>
                            <mml:mi>t</mml:mi>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                        <mml:mo>=</mml:mo>
                        <mml:mn>0.33</mml:mn>
                        <mml:mo>&#x00d7;</mml:mo>
                        <mml:mn>0.95</mml:mn>
                        <mml:mo>=</mml:mo>
                        <mml:mn>0.31</mml:mn>
                    </mml:math>
</disp-formula>
            </p>
            <p>Thus, each phoneme is now transformed into its corresponding syllable:</p>
            <p>

                <bold>&#x2018;the&#x2019;</bold> 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mo>&#x2192;</mml:mo>
                    </mml:math>
</inline-formula> &#x00f0;&#x01dd;,&#x00f0;&#x026a;,&#x00f0;i&#x02d0;/</p>
            <p>With APSL-BPNN-HMM, phoneme sequences for words like this, there, these, then, and thesis are dynamically re-evaluated, leading to improved recognition accuracy.</p>
            <p>

                <bold>Joy (Previously Rejected)</bold>
            </p>
            <p>

                <bold>Frequency = 14</bold>
            </p>
            <p>Previous HMM-based probabilities:
                <disp-formula id="e27">

                    <mml:math display="block">
                        <mml:mtable columnalign="left" displaystyle="true">
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>j</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo>,</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mfrac>
                                        <mml:mn>5</mml:mn>
                                        <mml:mn>14</mml:mn>
                                    </mml:mfrac>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.38</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>o</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mi>j</mml:mi>
                                        <mml:mo>,</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mfrac>
                                        <mml:mn>2</mml:mn>
                                        <mml:mn>14</mml:mn>
                                    </mml:mfrac>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.15</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>y</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mi>o</mml:mi>
                                        <mml:mo>,</mml:mo>
                                        <mml:mi>j</mml:mi>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                        </mml:mtable>
                    </mml:math>
</disp-formula>
            </p>
            <p>APSL Adjustment Using BPNN Confidence (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>C</mml:mi>
                    </mml:math>
</inline-formula>):
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>BPNN assigns confidence scores based on phoneme similarity.</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>Let 
                            <inline-formula>

                                <mml:math display="inline">
                                    <mml:mi>C</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>o</mml:mi>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.85</mml:mn>
                                </mml:math>
</inline-formula> and 
                            <inline-formula>

                                <mml:math display="inline">
                                    <mml:mi>C</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>y</mml:mi>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.78</mml:mn>
                                </mml:math>
</inline-formula>.</p>
                    </list-item>
                </list>
            </p>
            <p>Updated probability calculations:
                <disp-formula id="e28">

                    <mml:math display="block">
                        <mml:mtable columnalign="left" displaystyle="true">
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:msup>
                                        <mml:mi>P</mml:mi>
                                        <mml:mo>&#x2032;</mml:mo>
                                    </mml:msup>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>o</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mi>j</mml:mi>
                                        <mml:mo>,</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>o</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mi>j</mml:mi>
                                        <mml:mo>,</mml:mo>
                                        <mml:mn>0</mml:mn>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>&#x00d7;</mml:mo>
                                    <mml:mi>C</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>o</mml:mi>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.15</mml:mn>
                                    <mml:mo>&#x00d7;</mml:mo>
                                    <mml:mn>0.85</mml:mn>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.127</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                            <mml:mtr>
                                <mml:mtd>
                                    <mml:msup>
                                        <mml:mi>P</mml:mi>
                                        <mml:mo>&#x2032;</mml:mo>
                                    </mml:msup>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>y</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mi>o</mml:mi>
                                        <mml:mo>,</mml:mo>
                                        <mml:mi>j</mml:mi>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mi>P</mml:mi>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>y</mml:mi>
                                        <mml:mo>|</mml:mo>
                                        <mml:mi>o</mml:mi>
                                        <mml:mo>,</mml:mo>
                                        <mml:mi>j</mml:mi>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>+</mml:mo>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>C</mml:mi>
                                        <mml:mrow>
                                            <mml:mo stretchy="true">(</mml:mo>
                                            <mml:mi>y</mml:mi>
                                            <mml:mo stretchy="true">)</mml:mo>
                                        </mml:mrow>
                                        <mml:mo>&#x00d7;</mml:mo>
                                        <mml:mn>0.1</mml:mn>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0</mml:mn>
                                    <mml:mo>+</mml:mo>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mn>0.78</mml:mn>
                                        <mml:mo>&#x00d7;</mml:mo>
                                        <mml:mn>0.1</mml:mn>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>0.078</mml:mn>
                                </mml:mtd>
                            </mml:mtr>
                        </mml:mtable>
                    </mml:math>
</disp-formula>
            </p>
            <p>Now, 
                <bold>&#x2018;joy&#x2019;</bold> is re-evaluated under APSL-BPNN-HMM and no longer rejected, as confidence-adjusted probabilities allow for better phoneme transition predictions.</p>
            <p>

                <bold>4.5.3 Dynamic text wrapping</bold>
            </p>
            <p>Dynamic text wrapping is applied each time phonemes are mapped between windows, wrapping and merging words after HMM processing. When acoustic features are involved, this is called dynamic time warping. Consider 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>n</mml:mi>
                    </mml:math>
</inline-formula> lexical pairs formed in each window (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:msub>
                            <mml:mi>w</mml:mi>
                            <mml:mi>i</mml:mi>
                        </mml:msub>
                    </mml:math>
</inline-formula>) for frame (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:msub>
                            <mml:mi>f</mml:mi>
                            <mml:mi>i</mml:mi>
                        </mml:msub>
                    </mml:math>
</inline-formula>). Feature duplication occurs between previous (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:msub>
                            <mml:mi>w</mml:mi>
                            <mml:mrow>
                                <mml:mi>n</mml:mi>
                                <mml:mo>&#x2212;</mml:mo>
                                <mml:mn>1</mml:mn>
                            </mml:mrow>
                        </mml:msub>
                    </mml:math>
</inline-formula>) and present (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:msub>
                            <mml:mi>w</mml:mi>
                            <mml:mi>n</mml:mi>
                        </mml:msub>
                    </mml:math>
</inline-formula>) windows due to the 0.5 ms overlap region. The process:
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>Compare the last alphabet of the previous window (
                            <inline-formula>

                                <mml:math display="inline">
                                    <mml:msub>
                                        <mml:mi>w</mml:mi>
                                        <mml:mrow>
                                            <mml:mi>n</mml:mi>
                                            <mml:mo>&#x2212;</mml:mo>
                                            <mml:mn>1</mml:mn>
                                        </mml:mrow>
                                    </mml:msub>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:msub>
                                            <mml:mi>a</mml:mi>
                                            <mml:mi>n</mml:mi>
                                        </mml:msub>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                </mml:math>
</inline-formula>) with the first alphabet of the present window (
                            <inline-formula>

                                <mml:math display="inline">
                                    <mml:msub>
                                        <mml:mi>w</mml:mi>
                                        <mml:mi>n</mml:mi>
                                    </mml:msub>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:msub>
                                            <mml:mi>a</mml:mi>
                                            <mml:mn>1</mml:mn>
                                        </mml:msub>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                </mml:math>
</inline-formula>).</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>If identical, delete one and concatenate the remaining alphabets.</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>Repeat for all windows (
                            <inline-formula>

                                <mml:math display="inline">
                                    <mml:mi>w</mml:mi>
                                </mml:math>
</inline-formula>) across all frames (
                            <inline-formula>

                                <mml:math display="inline">
                                    <mml:msub>
                                        <mml:mi>f</mml:mi>
                                        <mml:mi>n</mml:mi>
                                    </mml:msub>
                                </mml:math>
</inline-formula>).</p>
                    </list-item>
                </list>
            </p>
            <p>Here, 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>a</mml:mi>
                    </mml:math>
</inline-formula> denotes an alphabet, with subscripts indicating position within a word. According to the HMM model, the phoneme &#x201c;the&#x201d; segments as follows:
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>Window 1 (
                            <inline-formula>

                                <mml:math display="inline">
                                    <mml:msub>
                                        <mml:mi>W</mml:mi>
                                        <mml:mn>1</mml:mn>
                                    </mml:msub>
                                </mml:math>
</inline-formula>) compared with window 2 (
                            <inline-formula>

                                <mml:math display="inline">
                                    <mml:msub>
                                        <mml:mi>W</mml:mi>
                                        <mml:mn>2</mml:mn>
                                    </mml:msub>
                                </mml:math>
</inline-formula>):</p>
                    </list-item>
                </list>

                <disp-formula id="e29">

                    <mml:math display="block">
                        <mml:msub>
                            <mml:mi>W</mml:mi>
                            <mml:mn>1</mml:mn>
                        </mml:msub>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:msub>
                                <mml:mi>a</mml:mi>
                                <mml:mi>n</mml:mi>
                            </mml:msub>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                        <mml:mo>&#x223c;</mml:mo>
                        <mml:msub>
                            <mml:mi>W</mml:mi>
                            <mml:mn>2</mml:mn>
                        </mml:msub>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:msub>
                                <mml:mi>a</mml:mi>
                                <mml:mn>1</mml:mn>
                            </mml:msub>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                        <mml:mo>,</mml:mo>
                        <mml:mspace width="1em"/>
                        <mml:msub>
                            <mml:mi>W</mml:mi>
                            <mml:mn>1</mml:mn>
                        </mml:msub>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:mi>t</mml:mi>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                        <mml:mo>&#x223c;</mml:mo>
                        <mml:msub>
                            <mml:mi>W</mml:mi>
                            <mml:mn>2</mml:mn>
                        </mml:msub>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:mi>t</mml:mi>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                    </mml:math>
</disp-formula>
            </p>
            <p>Since &#x2018;t&#x2019; appears in both, cancel one &#x2018;t&#x2019;. Remaining: &#x201c;t&#x201d;.
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>Window 2 (
                            <inline-formula>

                                <mml:math display="inline">
                                    <mml:msub>
                                        <mml:mi>W</mml:mi>
                                        <mml:mn>2</mml:mn>
                                    </mml:msub>
                                </mml:math>
</inline-formula>) compared with window 3 (
                            <inline-formula>

                                <mml:math display="inline">
                                    <mml:msub>
                                        <mml:mi>W</mml:mi>
                                        <mml:mn>3</mml:mn>
                                    </mml:msub>
                                </mml:math>
</inline-formula>):</p>
                    </list-item>
                </list>

                <disp-formula id="e30">

                    <mml:math display="block">
                        <mml:msub>
                            <mml:mi>W</mml:mi>
                            <mml:mn>2</mml:mn>
                        </mml:msub>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:msub>
                                <mml:mi>a</mml:mi>
                                <mml:mi>n</mml:mi>
                            </mml:msub>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                        <mml:mo>&#x223c;</mml:mo>
                        <mml:msub>
                            <mml:mi>W</mml:mi>
                            <mml:mn>3</mml:mn>
                        </mml:msub>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:msub>
                                <mml:mi>a</mml:mi>
                                <mml:mn>1</mml:mn>
                            </mml:msub>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                        <mml:mo>,</mml:mo>
                        <mml:mspace width="1em"/>
                        <mml:msub>
                            <mml:mi>W</mml:mi>
                            <mml:mn>2</mml:mn>
                        </mml:msub>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:mi>h</mml:mi>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                        <mml:mo>&#x223c;</mml:mo>
                        <mml:msub>
                            <mml:mi>W</mml:mi>
                            <mml:mn>3</mml:mn>
                        </mml:msub>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:mi>h</mml:mi>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                    </mml:math>
</disp-formula>
            </p>
            <p>Since &#x2018;h&#x2019; appears in both, cancel one &#x2018;h&#x2019;. Remaining: &#x201c;th&#x201d;.
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>Window 3 (
                            <inline-formula>

                                <mml:math display="inline">
                                    <mml:msub>
                                        <mml:mi>W</mml:mi>
                                        <mml:mn>3</mml:mn>
                                    </mml:msub>
                                </mml:math>
</inline-formula>) compared with window 4 (
                            <inline-formula>

                                <mml:math display="inline">
                                    <mml:msub>
                                        <mml:mi>W</mml:mi>
                                        <mml:mn>4</mml:mn>
                                    </mml:msub>
                                </mml:math>
</inline-formula>):</p>
                    </list-item>
                </list>

                <disp-formula id="e31">

                    <mml:math display="block">
                        <mml:msub>
                            <mml:mi>W</mml:mi>
                            <mml:mn>3</mml:mn>
                        </mml:msub>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:msub>
                                <mml:mi>a</mml:mi>
                                <mml:mi>n</mml:mi>
                            </mml:msub>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                        <mml:mo>&#x223c;</mml:mo>
                        <mml:msub>
                            <mml:mi>W</mml:mi>
                            <mml:mn>4</mml:mn>
                        </mml:msub>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:msub>
                                <mml:mi>a</mml:mi>
                                <mml:mn>1</mml:mn>
                            </mml:msub>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                        <mml:mo>,</mml:mo>
                        <mml:mspace width="1em"/>
                        <mml:msub>
                            <mml:mi>W</mml:mi>
                            <mml:mn>3</mml:mn>
                        </mml:msub>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:mi>h</mml:mi>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                        <mml:mo>&#x223c;</mml:mo>
                        <mml:msub>
                            <mml:mi>W</mml:mi>
                            <mml:mn>4</mml:mn>
                        </mml:msub>
                        <mml:mrow>
                            <mml:mo stretchy="true">(</mml:mo>
                            <mml:mi>e</mml:mi>
                            <mml:mo stretchy="true">)</mml:mo>
                        </mml:mrow>
                    </mml:math>
</disp-formula>
            </p>
            <p>Since 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mi>h</mml:mi>
                        <mml:mo>&#x2260;</mml:mo>
                        <mml:mi>e</mml:mi>
                    </mml:math>
</inline-formula>, keep &#x2018;e&#x2019;. Final sequence: &#x201c;the&#x201d;.</p>
            <p>

                <graphic id="gr8" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/202186/e8064d8a-91bb-4486-98a2-b01512d69de5_figure8.gif"/>
            </p>
            <p>

                <bold>Memory Efficiency:</bold> Dynamic text wrapping links acoustic features with language parameters without requiring memory storage. An array stores frame contents where: i) array size is determined by the number of frames, ii) memory addresses are allocated in ascending order as words form, and iii) wrapped texts are stored efficiently. At completion, words are concatenated as follows (refer to 
                <xref ref-type="table" rid="T2">
Table 2</xref>):</p>
        </sec>
        <sec id="sec14" sec-type="results|discussions">
            <title>5. Results and discussions</title>
            <sec id="sec15">
                <title>5.1 Experimental set-up
</title>
                <p>The preprocessing phase is crucial for accurate and efficient speech recognition. To establish a robust dataset, 1000 audio files were manually created using mono channel setup with participants spanning ages 15-80, including fluent and non-fluent English speakers. All participants provided informed verbal consent following institutional ethical guidelines, as the study posed minimal risk and involved no sensitive personal data. Each participant used microphones and was presented with varying-length sentences. To introduce real-world variability, recordings were deliberately subjected to white and environmental noise. Noise was subsequently removed using high-pass and low-pass filters based on the Tunable Band Noise Reduction Technique (TBNRT) described in Section 3.2. The high-pass filter suppresses low-frequency noise:
                    <disp-formula id="e32">

                        <mml:math display="block">
                            <mml:msub>
                                <mml:mi>H</mml:mi>
                                <mml:mi mathvariant="italic">hp</mml:mi>
                            </mml:msub>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>f</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                            <mml:mo>=</mml:mo>
                            <mml:mfrac>
                                <mml:mi>f</mml:mi>
                                <mml:msub>
                                    <mml:mi>f</mml:mi>
                                    <mml:mi>c</mml:mi>
                                </mml:msub>
                            </mml:mfrac>
                            <mml:mspace width="1em"/>
                            <mml:mtext>for</mml:mtext>
                            <mml:mspace width="1em"/>
                            <mml:mi>f</mml:mi>
                            <mml:mo>&gt;</mml:mo>
                            <mml:msub>
                                <mml:mi>f</mml:mi>
                                <mml:mi>c</mml:mi>
                            </mml:msub>
                        </mml:math>

                        <label>(20)</label>
</disp-formula>
                </p>
                <p>The low-pass filter attenuates high-frequency noise:
                    <disp-formula id="e33">

                        <mml:math display="block">
                            <mml:msub>
                                <mml:mi>H</mml:mi>
                                <mml:mi mathvariant="italic">lp</mml:mi>
                            </mml:msub>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>f</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                            <mml:mo>=</mml:mo>
                            <mml:mfrac>
                                <mml:msub>
                                    <mml:mi>f</mml:mi>
                                    <mml:mi>c</mml:mi>
                                </mml:msub>
                                <mml:mi>f</mml:mi>
                            </mml:mfrac>
                            <mml:mspace width="1em"/>
                            <mml:mtext>for</mml:mtext>
                            <mml:mspace width="1em"/>
                            <mml:mi>f</mml:mi>
                            <mml:mo>&lt;</mml:mo>
                            <mml:msub>
                                <mml:mi>f</mml:mi>
                                <mml:mi>c</mml:mi>
                            </mml:msub>
                        </mml:math>

                        <label>(21)</label>
</disp-formula>
                </p>
                <p>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>f</mml:mi>
                                <mml:mi>c</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is the cutoff frequency based on detected noise profiles.</p>
                <p>Recordings were conducted at four sampling frequencies: 18000, 32300, 44100, and 56000 Hz. Empirical results demonstrated superior performance at 44100 Hz, providing optimal balance between memory efficiency and audio clarity. Consequently, 44100 Hz was designated as the standardized sampling frequency. Additionally, 1000 audio files were sourced from online platforms featuring male and female speakers with diverse accents, including English and non-English speakers. The dataset covers multiple regions: i) Western European, ii) Eastern European, iii) Central Asia/Middle East/North African, iv) Sub-Saharan Africa, v) South Asia, vi) South East Asia, vii) CJK (Chinese, Japanese, Korean).
                    <sup>
                        <xref ref-type="bibr" rid="ref36">36</xref>
                    </sup> This expanded dataset totals 2000 files (3 seconds to 3 minutes duration, 0.9 GB storage), plus 24 GB from the corpus detailed in Section 4.1, posing substantial memory challenges during training. To address memory overhead, the APSL-BPNN-HMM framework employs an Adaptive Phoneme State Learning (APSL) mechanism for efficient parameter utilization. APSL introduces adaptive parameter sharing, dynamically assigning model parameters across layers to reduce redundancy through shared weight matrices between neighboring phoneme states. Consider a BPNN layer with 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>n</mml:mi>
                        </mml:math>
</inline-formula> input neurons, 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>m</mml:mi>
                        </mml:math>
</inline-formula> hidden neurons, and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>p</mml:mi>
                        </mml:math>
</inline-formula> output neurons. Without APSL, total parameters are:
                    <disp-formula id="e34">

                        <mml:math display="block">
                            <mml:mi mathvariant="normal">&#x0398;</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>n</mml:mi>
                                <mml:mo>&#x00d7;</mml:mo>
                                <mml:mi>m</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                            <mml:mo>+</mml:mo>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>m</mml:mi>
                                <mml:mo>&#x00d7;</mml:mo>
                                <mml:mi>p</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                            <mml:mo>+</mml:mo>
                            <mml:mi>b</mml:mi>
                        </mml:math>

                        <label>(22)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>b</mml:mi>
                        </mml:math>
</inline-formula> represents bias terms. APSL defines shared parameter matrices 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>W</mml:mi>
                                <mml:mi>s</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> for phoneme states with similar acoustic properties, reducing independent parameters:
                    <disp-formula id="e35">

                        <mml:math display="block">
                            <mml:msup>
                                <mml:mi mathvariant="normal">&#x0398;</mml:mi>
                                <mml:mo>&#x2032;</mml:mo>
                            </mml:msup>
                            <mml:mo>=</mml:mo>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>n</mml:mi>
                                <mml:mo>&#x00d7;</mml:mo>
                                <mml:mi>k</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                            <mml:mo>+</mml:mo>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>k</mml:mi>
                                <mml:mo>&#x00d7;</mml:mo>
                                <mml:mi>p</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                            <mml:mo>+</mml:mo>
                            <mml:mi>b</mml:mi>
                        </mml:math>

                        <label>(23)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>k</mml:mi>
                            <mml:mo>&lt;</mml:mo>
                            <mml:mi>m</mml:mi>
                        </mml:math>
</inline-formula> represents the reduced dimensional space through adaptive sharing. APSL dynamically adjusts 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>k</mml:mi>
                        </mml:math>
</inline-formula> based on phoneme similarity, reducing complexity without compromising accuracy.</p>
                <p>APSL integrates dynamic thresholding for parameter sharing control. During training, a similarity matrix 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>S</mml:mi>
                                <mml:mi mathvariant="italic">ij</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is computed between phoneme states 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>i</mml:mi>
                        </mml:math>
</inline-formula> and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>j</mml:mi>
                        </mml:math>
</inline-formula>:
                    <disp-formula id="e36">

                        <mml:math display="block">
                            <mml:msub>
                                <mml:mi>S</mml:mi>
                                <mml:mi mathvariant="italic">ij</mml:mi>
                            </mml:msub>
                            <mml:mo>=</mml:mo>
                            <mml:mfrac>
                                <mml:mrow>
                                    <mml:msubsup>
                                        <mml:mo>&#x2211;</mml:mo>
                                        <mml:mrow>
                                            <mml:mi>t</mml:mi>
                                            <mml:mo>=</mml:mo>
                                            <mml:mn>1</mml:mn>
                                        </mml:mrow>
                                        <mml:mi>T</mml:mi>
                                    </mml:msubsup>
                                    <mml:msub>
                                        <mml:mi>&#x03d5;</mml:mi>
                                        <mml:mi>i</mml:mi>
                                    </mml:msub>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>t</mml:mi>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                    <mml:mo>&#x22c5;</mml:mo>
                                    <mml:msub>
                                        <mml:mi>&#x03d5;</mml:mi>
                                        <mml:mi>j</mml:mi>
                                    </mml:msub>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>t</mml:mi>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:msqrt>
                                        <mml:mrow>
                                            <mml:msubsup>
                                                <mml:mo>&#x2211;</mml:mo>
                                                <mml:mrow>
                                                    <mml:mi>t</mml:mi>
                                                    <mml:mo>=</mml:mo>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                                <mml:mi>T</mml:mi>
                                            </mml:msubsup>
                                            <mml:msubsup>
                                                <mml:mi>&#x03d5;</mml:mi>
                                                <mml:mi>i</mml:mi>
                                                <mml:mn>2</mml:mn>
                                            </mml:msubsup>
                                            <mml:mrow>
                                                <mml:mo stretchy="true">(</mml:mo>
                                                <mml:mi>t</mml:mi>
                                                <mml:mo stretchy="true">)</mml:mo>
                                            </mml:mrow>
                                        </mml:mrow>
                                    </mml:msqrt>
                                    <mml:msubsup>
                                        <mml:mo>&#x2211;</mml:mo>
                                        <mml:mrow>
                                            <mml:mi>t</mml:mi>
                                            <mml:mo>=</mml:mo>
                                            <mml:mn>1</mml:mn>
                                        </mml:mrow>
                                        <mml:mi>T</mml:mi>
                                    </mml:msubsup>
                                    <mml:msubsup>
                                        <mml:mi>&#x03d5;</mml:mi>
                                        <mml:mi>j</mml:mi>
                                        <mml:mn>2</mml:mn>
                                    </mml:msubsup>
                                    <mml:mrow>
                                        <mml:mo stretchy="true">(</mml:mo>
                                        <mml:mi>t</mml:mi>
                                        <mml:mo stretchy="true">)</mml:mo>
                                    </mml:mrow>
                                </mml:mrow>
                            </mml:mfrac>
                        </mml:math>

                        <label>(24)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>&#x03d5;</mml:mi>
                                <mml:mi>i</mml:mi>
                            </mml:msub>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>t</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                        </mml:math>
</inline-formula> and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>&#x03d5;</mml:mi>
                                <mml:mi>j</mml:mi>
                            </mml:msub>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:mi>t</mml:mi>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                        </mml:math>
</inline-formula> are feature vectors of phoneme states 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>i</mml:mi>
                        </mml:math>
</inline-formula> and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>j</mml:mi>
                        </mml:math>
</inline-formula> at time 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>t</mml:mi>
                        </mml:math>
</inline-formula>. If 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>S</mml:mi>
                                <mml:mi mathvariant="italic">ij</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> exceeds threshold 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>&#x03c4;</mml:mi>
                        </mml:math>
</inline-formula>, phoneme states are grouped under a shared parameter layer (see 
                    <xref ref-type="fig" rid="f2">
Figure 2</xref>). This adaptive parameter sharing significantly reduces redundant storage, optimizing memory usage from 24 GB to approximately 15.12 GB.</p>
                <p>

                    <bold>Validation of Memory-Accuracy Trade-off</bold>: To confirm that APSL's 32% memory reduction from 24.0 GB to 15.12 GB does not degrade accuracy, we conducted an ablation study comparing four configurations. The baseline HMM without APSL consumed 20.85 GB of memory and achieved 75.0% accuracy. APSL with full parameters where k equals m consumed 20.15 GB and achieved 95.8% accuracy, representing a 20.8% improvement over baseline. APSL with adaptive sharing where k equals 128 consumed 15.12 GB and achieved 96.0% accuracy, yielding a 21.0% improvement over baseline. APSL with aggressive sharing where k equals 64 consumed 11.80 GB but achieved only 92.3% accuracy, representing a 17.3% improvement over baseline. The adaptive sharing configuration with k equal to 128 actually improved accuracy slightly from 95.8% to 96.0% compared to the full parameter configuration. This occurs because parameter sharing acts as an implicit regularizer, reducing overfitting by limiting model complexity. The similarity threshold tau was set to 0.75, optimized on validation data to balance parameter sharing and state specificity. This reduction mitigates hardware constraints and accelerates model convergence by limiting parameter explosion, ensuring efficient resource utilization and scalability for large-scale speech recognition tasks.</p>
                <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                    <label>
Figure 2. </label>
                    <caption>
                        <title>A flowchart depicting the APSL mechanism from input features through feature extraction, similarity matrix computation, and threshold-based decision making to form shared or independent parameters, culminating in the final prediction.</title>
                    </caption>
                    <graphic id="gr2" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/202186/e8064d8a-91bb-4486-98a2-b01512d69de5_figure2.gif"/>
                </fig>
                <p>
                    <xref ref-type="fig" rid="f3">
Figure 3</xref> demonstrates the optimization impact by comparing memory consumption across iterations for the baseline and proposed APSL-BPNN-HMM framework. The Baseline Model (skyblue) steadily increases memory usage, reaching approximately 20.85 GB at 5000 iterations, while APSL-BPNN-HMM (navy) maintains significantly lower usage, stabilizing around 15.15 GB. This reduction reflects APSL&#x2019;s effectiveness in minimizing redundant parameter storage through dynamic thresholding and shared weight matrices. Adaptive parameter sharing reduces independent parameters, efficiently controlling model complexity without compromising accuracy. Consequently, APSL-BPNN-HMM achieves 32% memory reduction, accelerating convergence and enhancing scalability for large-scale tasks. An embedded subplot illustrates accuracy fluctuations across iterations and memory usage, showing APSL-BPNN-HMM and baseline HMM performance behavior. APSL-BPNN-HMM maintains higher accuracy while optimizing memory utilization. Yellow and blue markers indicate peak accuracies: APSL-BPNN-HMM (96%) and baseline HMM (75%), corresponding to their memory usage at that iteration. An enlarged contour plot emphasizes accuracy peaks for both models, with warmer colors indicating higher accuracy. APSL-BPNN-HMM achieves 96% peak accuracy at approximately 15.15 GB, while HMM reaches 75% at around 20.85 GB.</p>
                <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                    <label>
Figure 3. </label>
                    <caption>
                        <title>A line graph compares memory consumption across training iterations for the Baseline Model and APSL-BPNN-HMM, demonstrating reduced memory usage with adaptive parameter sharing, while an inset plot shows accuracy fluctuations for both models and an extended contour visualization highlights the memory&#x2013;accuracy tradeoff at peak performance points.</title>
                    </caption>
                    <graphic id="gr3" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/202186/e8064d8a-91bb-4486-98a2-b01512d69de5_figure3.gif"/>
                </fig>
                <p>Hardware and Software Environment</p>
                <p>All experiments were conducted on a Windows-based computing workstation configured for speech recognition model training and evaluation. The system was equipped with an Intel Core i9 processor, providing high single-threaded performance suitable for sequential HMM processing tasks. System memory consisted of sufficient RAM for processing the largest training batches without disk swapping. For neural network acceleration, an NVIDIA GeForce RTX 4060 GPU was utilized exclusively for BPNN training, while HMM processing and feature extraction remained CPU-bound. Storage was provided by a high-speed NVMe solid-state drive, offering rapid data loading for the 24 GB dataset. The operating system was Windows (version unspecified but latest stable), chosen for compatibility with Praat and other phonetic analysis tools. Software dependencies included Python for model implementation, PyTorch for neural network operations, Praat for phonetic analysis and speech manipulation, and Librosa for audio feature extraction. Total training time for the APSL-BPNN-HMM model was approximately 72 hours to complete 200 training epochs, with the majority of computation time consumed by Viterbi decoding and adaptive windowing operations.</p>
            </sec>
            <sec id="sec16">
                <title>5.2 Metrics</title>
                <p>

                    <bold>5.2.1 Classification metrics: Recall, precision and F-score</bold>
                </p>
                <p>To compute F-measure, recall and precision calculations are essential. Precision defines the ratio of correctly identified words to all recognized words (
                    <xref ref-type="disp-formula" rid="e37">
Equation 25</xref>). For example, if ten speech features are identified as positive, precision measures transformation accuracy to correct textual information. Recall quantifies the percentage of specified keywords identified relative to all keywords that should have been identified (
                    <xref ref-type="disp-formula" rid="e38">Equation 26</xref>). If 10 positive samples exist, recall measures classifier effectiveness in identifying correct features. F-score is the harmonic mean of recall and precision (
                    <xref ref-type="disp-formula" rid="e39">Equation 27</xref>). These metrics utilize four classes: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN),
                    <sup>
                        <xref ref-type="bibr" rid="ref37">37</xref>
                    </sup> defined as: 
                    <bold>i) True Positive (TP)</bold>: Words present in audio are accurately retrieved as text (e.g., &#x201c;living&#x201d; in audio 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mo>&#x2192;</mml:mo>
                        </mml:math>
</inline-formula> &#x201c;living&#x201d; in text). 
                    <bold>ii) False Positive (FP)</bold>: Words not in audio are retrieved as correct words (e.g., &#x201c;Emanuel run the show&#x201d; 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mo>&#x2192;</mml:mo>
                        </mml:math>
</inline-formula> &#x201c;E manual run the show,&#x201d; where &#x201c;E Manual&#x201d; doesn&#x2019;t exist in audio). 
                    <bold>iii) False Negative (FN)</bold>: Words in audio are not correctly retrieved (e.g., &#x201c;geographical&#x201d; and &#x201c;transmission&#x201d; 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mo>&#x2192;</mml:mo>
                        </mml:math>
</inline-formula> &#x201c;geografical&#x201d; and &#x201c;transmition&#x201d;). 
                    <bold>iv) True Negative (TN)</bold>: Words absent in audio are not retrieved as text (e.g., &#x201c;Hope&#x201d; absent in audio and not transcribed).
                    <disp-formula id="e37">

                        <mml:math display="block">
                            <mml:mtext>Precision</mml:mtext>
                            <mml:mo>=</mml:mo>
                            <mml:mfrac>
                                <mml:mi mathvariant="italic">TP</mml:mi>
                                <mml:mrow>
                                    <mml:mi mathvariant="italic">TP</mml:mi>
                                    <mml:mo>+</mml:mo>
                                    <mml:mi mathvariant="italic">FP</mml:mi>
                                </mml:mrow>
                            </mml:mfrac>
                        </mml:math>

                        <label>(25)</label>
</disp-formula>

                    <disp-formula id="e38">

                        <mml:math display="block">
                            <mml:mtext>Recall</mml:mtext>
                            <mml:mo>=</mml:mo>
                            <mml:mfrac>
                                <mml:mi mathvariant="italic">TP</mml:mi>
                                <mml:mrow>
                                    <mml:mi mathvariant="italic">TP</mml:mi>
                                    <mml:mo>+</mml:mo>
                                    <mml:mi mathvariant="italic">FN</mml:mi>
                                </mml:mrow>
                            </mml:mfrac>
                        </mml:math>

                        <label>(26)</label>
</disp-formula>

                    <disp-formula id="e39">

                        <mml:math display="block">
                            <mml:msub>
                                <mml:mi>F</mml:mi>
                                <mml:mn>1</mml:mn>
                            </mml:msub>
                            <mml:mo>=</mml:mo>
                            <mml:mn>2</mml:mn>
                            <mml:mo>&#x00d7;</mml:mo>
                            <mml:mfrac>
                                <mml:mrow>
                                    <mml:mtext>Precision</mml:mtext>
                                    <mml:mo>&#x00d7;</mml:mo>
                                    <mml:mtext>Recall</mml:mtext>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mtext>Precision</mml:mtext>
                                    <mml:mo>+</mml:mo>
                                    <mml:mtext>Recall</mml:mtext>
                                </mml:mrow>
                            </mml:mfrac>
                        </mml:math>

                        <label>(27)</label>
</disp-formula>
                </p>
                <p>

                    <bold>Accuracy</bold>: Accuracy is calculated based on automatically trained words. For example, &#x201c;joy&#x201d; was not in the corpus but phonemes were automatically trained using HMM and retrieved. Measures considered: of total words in audio (
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>A</mml:mi>
                        </mml:math>
</inline-formula>), how many are exactly present (
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msup>
                                <mml:mi>A</mml:mi>
                                <mml:mo>+</mml:mo>
                            </mml:msup>
                        </mml:math>
</inline-formula>), how many were automatically trained (
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msup>
                                <mml:mi>A</mml:mi>
                                <mml:mo>&#x2217;</mml:mo>
                            </mml:msup>
                        </mml:math>
</inline-formula>), and how many were not identified (
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msup>
                                <mml:mi>A</mml:mi>
                                <mml:mo>&#x2032;</mml:mo>
                            </mml:msup>
                        </mml:math>
</inline-formula>)? (
                    <xref ref-type="disp-formula" rid="e40">Equation 28)</xref>
                    <disp-formula id="e40">

                        <mml:math display="block">
                            <mml:mtext>Accuracy</mml:mtext>
                            <mml:mo>=</mml:mo>
                            <mml:mfrac>
                                <mml:mrow>
                                    <mml:msup>
                                        <mml:mi>A</mml:mi>
                                        <mml:mo>+</mml:mo>
                                    </mml:msup>
                                    <mml:mo>+</mml:mo>
                                    <mml:msup>
                                        <mml:mi>A</mml:mi>
                                        <mml:mo>&#x2217;</mml:mo>
                                    </mml:msup>
                                </mml:mrow>
                                <mml:mi>A</mml:mi>
                            </mml:mfrac>
                            <mml:mo>&#x00d7;</mml:mo>
                            <mml:mn>100</mml:mn>
                        </mml:math>

                        <label>(28)</label>
</disp-formula>
                </p>
            </sec>
            <sec id="sec17">
                <title>5.3 Error metrics</title>
                <p>

                    <bold>5.3.1 BLEU score calculation</bold>
                </p>
                <p>The Bilingual Evaluation Understudy (BLEU) score evaluates machine-translated text quality against reference translations as shown in 
                    <xref ref-type="disp-formula" rid="e41">Equation 29</xref>:
                    <disp-formula id="e41">

                        <mml:math display="block">
                            <mml:mtext>BLEU</mml:mtext>
                            <mml:mo>=</mml:mo>
                            <mml:mi mathvariant="italic">BP</mml:mi>
                            <mml:mo>&#x22c5;</mml:mo>
                            <mml:mo>exp</mml:mo>
                            <mml:mrow>
                                <mml:mo stretchy="true">(</mml:mo>
                                <mml:munderover>
                                    <mml:mo movablelimits="false">&#x2211;</mml:mo>
                                    <mml:mrow>
                                        <mml:mi>n</mml:mi>
                                        <mml:mo>=</mml:mo>
                                        <mml:mn>1</mml:mn>
                                    </mml:mrow>
                                    <mml:mi>N</mml:mi>
                                </mml:munderover>
                                <mml:msub>
                                    <mml:mi>w</mml:mi>
                                    <mml:mi>n</mml:mi>
                                </mml:msub>
                                <mml:mo>log</mml:mo>
                                <mml:msub>
                                    <mml:mi>p</mml:mi>
                                    <mml:mi>n</mml:mi>
                                </mml:msub>
                                <mml:mo stretchy="true">)</mml:mo>
                            </mml:mrow>
                        </mml:math>

                        <label>(29)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi mathvariant="italic">BP</mml:mi>
                        </mml:math>
</inline-formula> = Brevity Penalty (penalizes short translations), 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>w</mml:mi>
                                <mml:mi>n</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> = Weight for n-gram precision, 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>p</mml:mi>
                                <mml:mi>n</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> = Precision for n-grams. BLEU scores range from 0 to 100, with higher values indicating better quality.</p>
                <p>

                    <bold>5.3.2 WER score calculation</bold>
                </p>
                <p>Word Error Rate (WER) evaluates Automatic Speech Recognition (ASR) systems and is given by the 
                    <xref ref-type="disp-formula" rid="e42">Equation 30</xref>:
                    <disp-formula id="e42">

                        <mml:math display="block">
                            <mml:mi mathvariant="italic">WER</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mfrac>
                                <mml:mrow>
                                    <mml:mi>S</mml:mi>
                                    <mml:mo>+</mml:mo>
                                    <mml:mi>D</mml:mi>
                                    <mml:mo>+</mml:mo>
                                    <mml:mi>I</mml:mi>
                                </mml:mrow>
                                <mml:mi>N</mml:mi>
                            </mml:mfrac>
                        </mml:math>

                        <label>(30)</label>
</disp-formula>where 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>S</mml:mi>
                        </mml:math>
</inline-formula> = Number of substitutions, 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>D</mml:mi>
                        </mml:math>
</inline-formula> = Number of deletions, 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>I</mml:mi>
                        </mml:math>
</inline-formula> = Number of insertions, 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>N</mml:mi>
                        </mml:math>
</inline-formula> = Number of words in the reference.</p>
            </sec>
        </sec>
        <sec id="sec18" sec-type="results">
            <title>6. Results</title>
            <p>
                <xref ref-type="fig" rid="f4">
Figure 4</xref> illustrates the ASPL-BPNN-HMM model&#x2019;s accuracy progression over 160 training epochs. The cyan dashed line represents smoothed training accuracy, while the dark blue dotted line represents smoothed testing accuracy. Both curves show rapid accuracy increases during initial epochs, stabilizing after approximately 40 epochs. Training accuracy approaches 100%, while testing accuracy stabilizes slightly below 95%, indicating strong performance with minimal overfitting.</p>
            <fig fig-type="figure" id="f4" orientation="portrait" position="float">
                <label>
Figure 4. </label>
                <caption>
                    <title>A line graph showing the accuracy trends of the ASPL-BPNN-HMM model across 160 training epochs.</title>
                    <p>The cyan dashed line indicates smoothed training accuracy, which rises quickly and nears 100%. The dark blue dotted line shows smoothed testing accuracy, increasing rapidly in early epochs and leveling off just below 95%.</p>
                </caption>
                <graphic id="gr4" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/202186/e8064d8a-91bb-4486-98a2-b01512d69de5_figure4.gif"/>
            </fig>
            <p>
                <xref ref-type="fig" rid="f5">
Figure 5</xref> depicts the distribution of recall, precision, and F-score metrics across three distinct categories. These metrics are calculated across the overall dataset of 2000 files, with values varying within three defined percentage ranges: 1) 90-98%, 2) 87-95%, and 3) 87-94%. The Violin plots in 
                <xref ref-type="fig" rid="f5">Figure 5</xref> showcase the probability density of metric values within the specified percentage ranges. The width of each &#x2018;violin&#x2019; represents the density of values at different levels, with broader sections indicating higher density. The heatmap in 
                <xref ref-type="fig" rid="f5">Figure 5</xref> illustrates the correlation among these metrics. It provides a visual representation of how these metrics are interrelated, with warmer colors indicating stronger positive correlations and cooler colors indicating negative correlations. This exhibit offers insights into the general trends and relationships within the specified percentage intervals, enhancing our understanding of the dataset&#x2019;s characteristics. The average recall is 95.7%, the precision is 92.95%, and the F-score is 94.53%.</p>
            <fig fig-type="figure" id="f5" orientation="portrait" position="float">
                <label>
Figure 5. </label>
                <caption>
                    <title>Distribution of performance metrics and correlation analysis of ASPL-BPNN-HMM Model.</title>
                    <p>The left subplot presents a violin plot illustrating the distribution of Recall, Precision, and F-score across defined percentage ranges, while the right subplot displays a confusion matrix highlighting the correlation among these metrics.</p>
                </caption>
                <graphic id="gr5" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/202186/e8064d8a-91bb-4486-98a2-b01512d69de5_figure5.gif"/>
            </fig>
            <p>To evaluate the performance of our proposed APSL-BPNN-HMM model against the Human and HMM models, we included an audio file containing noise and disturbances. This audio sample served as the input for all models, allowing us to assess their robustness in handling real-world noisy conditions. The noisy input audio file, depicted in 
                <xref ref-type="fig" rid="f6">
Figure 6</xref>, reflects typical background noise scenarios such as murmurs in a cafeteria, constant hums from air conditioning units, and random disturbances like keyboard taps or coughs.</p>
            <fig fig-type="figure" id="f6" orientation="portrait" position="float">
                <label>
Figure 6. </label>
                <caption>
                    <title>Noisy audio input used for testing the APSL-BPNN-HMM model, HMM model, and Human performance.</title>
                </caption>
                <graphic id="gr6" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/202186/e8064d8a-91bb-4486-98a2-b01512d69de5_figure6.gif"/>
            </fig>
            <p>
                <xref ref-type="table" rid="T3">
Table 3</xref> illustrates the performance of the APSL-BPNN-HMM model in terms of Word Error Rate (WER) and BLEU score for audio recordings collected across diverse geographical regions, as discussed in Section 5.1. The upper portion of the table summarizes the ASR WER, where lower values represent improved recognition accuracy. The lower portion presents the BLEU score for audio translation, where higher scores indicate better translation fidelity. Across all regional categories, the APSL-BPNN-HMM model consistently outperforms the baseline HMM model, narrowing the gap with human transcription and translation performance, which serves as a reference benchmark.</p>
            <table-wrap id="T3" orientation="portrait" position="float">
                <label>
Table 3. </label>
                <caption>
                    <title>Performance of ASR and translation models across geographical regions.</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top">Geographical region/Metric</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Human</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">HMM</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">
APSL-BPNN-HMM</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="4" rowspan="1" valign="top">
                                <bold>ASR Word Error Rate (WER) &#x2013; Lower is Better</bold>
</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Western European</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">6</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">2</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Eastern European</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">14</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">6</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">3</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Central Asia/Middle East/North Africa</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">21</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">11</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">5</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Sub-Saharan Africa</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">33</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">17</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">7</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">South Asia</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">35</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">22</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">8</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">South East Asia</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">9</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">5</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">3</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">CJK (CER)</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">5</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">5</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">3</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="4" rowspan="1" valign="top">
                                <bold>BLEU Score &#x2013; Higher is Better</bold>
</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Overall Translation Quality (BLEU Score)</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">29</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">40</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">48</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
            <p>The results of this evaluation, shown in 
                <xref ref-type="fig" rid="f7">
Figure 7</xref>, compare APSL-BPNN-HMM, HMM, and Human performance across five filtering conditions and Word Error Rate (WER). Each subplot shows performance trends as the filtering parameter varies (Hz), highlighting the impact of noise-reduction techniques on accuracy and WER. Control and Core Filtering combines fundamental noise reduction with adaptive mechanisms to suppress noise while preserving essential features, e.g., steady background noise in a cafeteria. APSL-BPNN-HMM maintains high accuracy across parameters, whereas HMM declines sharply after parameter 3, and Human performance remains low. Core Spectral Notch Filtering targets specific frequency bands, e.g., removing 60 Hz AC hum in a conference call. APSL-BPNN-HMM performs best at higher parameter values; HMM deteriorates with aggressive filtering, and Humans show declining accuracy. Spectral Notch Filtering applies frequency-specific filtering without adaptivity, e.g., reducing low-frequency hum in a studio podcast. APSL-BPNN-HMM balances noise reduction and signal preservation, HMM struggles at high parameters, and Human performance stays lowest. Core Temporal Notch Filtering integrates core filtering with temporal suppression to handle transient noise, e.g., coughs or keyboard taps. APSL-BPNN-HMM maintains high accuracy; HMM declines with aggressive filtering, and Humans steadily decline. Temporal Notch Filtering targets time-based noise, e.g., chair movements or pen drops in a conference room. APSL-BPNN-HMM shows superior adaptability, HMM deteriorates at high parameters, and Human accuracy remains low. Word Error Rate (WER) measures incorrect words in speech recognition, with lower values indicating better performance. APSL-BPNN-HMM achieves the lowest WER, especially with larger test samples, followed by HMM and then Humans. Overall, APSL-BPNN-HMM consistently outperforms HMM and Humans across all filtering methods, demonstrating robust noise suppression, improved speech recognition, and resilience under aggressive filtering. Its low WER confirms stability and scalability in large-scale evaluations.</p>
            <fig fig-type="figure" id="f7" orientation="portrait" position="float">
                <label>
Figure 7. </label>
                <caption>
                    <title>Performance comparison of APSL-BPNN-HMM, HMM, and Human across various noise reduction techniques and Word Error Rate (WER).</title>
                    <p>The subplots represent: (1) Control and Core Filtering, (2) Core Spectral Notch Filtering, (3) Spectral Notch Filtering, (4) Core Temporal Notch Filtering, (5) Temporal Notch Filtering, and (6) Word Error Rate (WER). The shaded regions indicate a &#x00b1;5% uncertainty range around the plotted values, representing potential variability in the measurements.</p>
                </caption>
                <graphic id="gr7" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/202186/e8064d8a-91bb-4486-98a2-b01512d69de5_figure7.gif"/>
            </fig>
            <p>
                <xref ref-type="table" rid="T4">
Table 4</xref> presents a detailed comparison of the performance of two models &#x2014; the conventional Hidden Markov Model (HMM) and the proposed APSL-BPNN-HMM &#x2014; across five representative speech corpora: 
                <italic toggle="yes">1) British National Corpus (BNC),
</italic>
                <sup>
                    <xref ref-type="bibr" rid="ref27">27</xref>
                </sup> 
                <italic toggle="yes">2) American National Corpus (ANC),
</italic>
                <sup>
                    <xref ref-type="bibr" rid="ref28">28</xref>
                </sup> 
                <italic toggle="yes">3) Corpus of Contemporary American English (COCA),
</italic>
                <sup>
                    <xref ref-type="bibr" rid="ref29">29</xref>
                </sup> 
                <italic toggle="yes">4) Buckeye Speech Corpus,
</italic>
                <sup>
                    <xref ref-type="bibr" rid="ref30">30</xref>
                </sup> 
                <italic toggle="yes">and 5) Emu Speech Database.</italic>
                <sup>
                    <xref ref-type="bibr" rid="ref31">31</xref>
                </sup> The table reports four key classification metrics for each corpus: Accuracy, Precision, Recall, and F1-Score. Across all corpora, the APSL-BPNN-HMM consistently outperforms the baseline HMM, with notable improvements in recall and F1-score, highlighting its robustness in handling imbalanced and spontaneous speech data. While Buckeye and Emu corpora were partially included in training, rigorous safeguards were implemented to avoid data leakage. Specifically, speaker-level partitioning ensured that no individual&#x2019;s data appeared in both training and testing sets. In addition, temporal segmentation preserved distinct time windows for each data split. Cross-validation techniques were employed to assess generalization, ensuring reliable evaluation of the model on unseen speech samples.</p>
            <table-wrap id="T4" orientation="portrait" position="float">
                <label>
Table 4. </label>
                <caption>
                    <title>Comparison of HMM and APSL&#x2013;BPNN&#x2013;HMM performance across five corpora.</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top">Corpus</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Model</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Accuracy</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Precision</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Recall</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">F1-score</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="2" valign="top">Corpus 1</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">HMM</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.3800</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.8250</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.2200</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.3474</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">APSL&#x2013;BPNN&#x2013;HMM</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7250</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7654</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.9133</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.8328</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="2" valign="top">Corpus 2</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">HMM</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.4150</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7143</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.3667</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.4846</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">APSL&#x2013;BPNN&#x2013;HMM</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7150</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7514</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.9267</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.8299</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="2" valign="top">Corpus 3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">HMM</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.4650</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7590</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.4200</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.5408</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">APSL&#x2013;BPNN&#x2013;HMM</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7200</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7640</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.9067</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.8293</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="2" valign="top">Corpus 4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">HMM</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.4450</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7600</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.3800</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.5067</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">APSL&#x2013;BPNN&#x2013;HMM</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7500</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7500</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">1.0000</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.8571</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="2" valign="top">Corpus 5</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">HMM</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.4500</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7381</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.4133</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.5299</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">APSL&#x2013;BPNN&#x2013;HMM</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7000</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7586</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.8800</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.8148</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
            <p>While modern E2E models (Whisper, Conformer) achieve lower WER on standard benchmarks, our model demonstrates competitive performance (10% WER on complex files) with explicit phoneme-state interpretability - a feature absent in E2E black-box models. On the Buckeye corpus (spontaneous speech), our model&#x2019;s recall (100%) exceeds typical E2E performance (&#x2248;85-90%). However, we acknowledge that on clean, large-vocabulary tasks, Transformer-based models would likely outperform our approach. The strength of APSL-BPNN-HMM lies in memory-constrained environments and phoneme-level diagnostic feedback (refer to 
                <xref ref-type="table" rid="T5">
Table 5</xref>). We chose the HMM-BPNN hybrid over contemporary end-to-end models such as Whisper and Conformer for three primary reasons. First, controlled comparison - our primary claim is improvement over the traditional HMM baseline rather than surpassing state-of-the-art performance on standard benchmarks. Second, interpretability - clinical and forensic applications require phoneme-level confidence scores and explicit state alignments that black-box end-to-end models do not provide. Third, reproducibility - the HMM-BPNN hybrid can be trained on commodity hardware with 16 GB RAM without requiring a GPU, making our results accessible to researchers with limited computational resources. That said, we provide a comparison to Whisper-tiny in 
                <xref ref-type="table" rid="T5">
Table 5</xref> for contextual reference.
                <table-wrap id="T5" orientation="portrait" position="float">
                    <label>
Table 5. </label>
                    <caption>
                        <title>comparison with contemporary ASR systems</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Model</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">WER (%)</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Accuracy (%)</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Memory (GB)</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Inference Time (sec/5sec audio)</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">HMM (baseline)</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">~15-20</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">75</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">20.85</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">~3</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">DeepSpeech (Mozilla)</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">12.3</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">N/A</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">~4 GB</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">~0.5</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Whisper-tiny (OpenAI)</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">10.5</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">N/A</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">~1 GB</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">~0.3</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Conformer-CTC
</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">8.2</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">N/A</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">~8 GB</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">~0.4</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">
                                    <bold>APSL-BPNN-HMM</bold>
</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">
                                    <bold>~5-10</bold>
</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">
                                    <bold>96</bold>
</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">
                                    <bold>15.12</bold>
</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">
                                    <bold>03-Oct</bold>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Human (reference)</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">~4</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">~98</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">N/A</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">N/A</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
</p>
        </sec>
        <sec id="sec19" sec-type="discussion">
            <title>7. Discussion</title>
            <p>The proposed system translates acoustic features into language models, showing promise for effective speech recognition. However, certain phonemes, such as in &#x201c;geographical&#x201d; and &#x201c;transmission,&#x201d; were misidentified due to errors in mapping acoustic features, leading to syllable and spelling mistakes. While we acknowledge these limitations, the APSL framework&#x2019;s adaptive window adjustment directly addresses variability by extending analysis windows for low-confidence phonemes. For words like &#x2018;geographical&#x2019; (misidentified due to /d&#x0292;/ vs /g/ confusion), APSL&#x2019;s confidence threshold (&#x03b8;=0.75) triggers extended windowing, which improved recognition by 18% in ablation studies. Future work will incorporate grapheme-to-phoneme (G2P) models for out-of-vocabulary words. Performance is influenced by diverse speaking styles and speaker-listener dynamics&#x2014;including formal, informal, fearful, threatening, and intimate modes&#x2014;which interact with psychological aspects of speech. The model adapts to unseen data, while corpus size affects memory requirements: larger dictionaries demand more resources, smaller ones are more efficient. Pronunciation variations in common names and dialect differences, such as US vs. UK standards, add complexity. Latency ranges from 3&#x2013;5 seconds for typical inputs and 8&#x2013;10 seconds for complex files, with a word error rate of 10%, indicating efficient recognition of out-of-corpus words. We evaluated the real-time feasibility of our APSL-BPNN-HMM framework across three input scenarios. For clean, short utterances lasting less than three seconds, processing time ranged from three to five seconds, yielding a real-time factor (RTF) between 1.0 and 1.67. With appropriate buffering, this scenario meets real-time requirements for applications such as voice command recognition where slight delays are tolerable. For noisy, long audio inputs exceeding thirty seconds, processing time increased to between eight and ten seconds, resulting in an RTF of 0.27 to 0.33. This scenario does not meet real-time constraints as the system requires substantially more time to process the input than the duration of the speech itself. For streaming applications operating on 250-millisecond chunks, per-chunk processing time ranged from 0.4 to 0.6 seconds with an RTF between 1.6 and 2.4, indicating that optimization is required for true streaming deployment. The primary computational bottleneck is the Viterbi decoding algorithm, which has a time complexity of O(N &#x00d7; S
                <sup>2</sup>), where N represents the number of windows and S denotes the number of HMM states per phoneme. Our APSL adaptive windowing mechanism increases latency by approximately eight percent for phonemes with low confidence scores that trigger dynamic window extension. To achieve real-time performance defined as RTF less than 1.0, we recommend three optimization strategies for deployment scenarios. First, reduce the number of HMM states from five to three per phoneme, which decreases the S
                <sup>2</sup> term in the complexity expression by approximately 64 percent. Second, implement beam search pruning with a beam width of five, which restricts the number of active paths maintained during Viterbi decoding and reduces computational overhead without substantial accuracy degradation. Third, employ GPU acceleration for BPNN inference, which we observed to provide a 3.2 times speedup compared to CPU-only execution. Under these combined optimizations, the real-time factor improves to approximately 0.85 for typical utterances, making the system suitable for real-time deployment in production environments. The ASPL-BPNN-HMM approach enhances phoneme identification and sequence mapping but faces challenges. Its complexity requires substantial computational power and hyperparameter tuning, including feature weights and network depth. Noise interference can degrade speech clarity, especially when background sounds mimic key phonemes. Balancing improved recognition with real-time latency remains critical. As demonstrated in dysarthric speech recognition,
                <sup>
                    <xref ref-type="bibr" rid="ref39">39</xref>
                </sup> hybrid architectures often outperform pure deep learning approaches when training data is limited or when interpretability is required&#x2014;conditions that align with our target use cases for APSL. Despite these issues, ASPL shows strong potential when combined with noise reduction and optimized hyperparameters. Future enhancements include developing models using linguistic features with LSTM for faster text conversion, testing resilience to white Gaussian noise, expanding the database with diverse speaking styles, tuning HMM parameters (states, window size, cepstral coefficients), and evaluating performance on multiple languages to broaden applicability. Future work could integrate APSL with transfer learning approaches
                <sup>
                    <xref ref-type="bibr" rid="ref38">38</xref>
                </sup> to further reduce training data requirements across accent domains.</p>
        </sec>
        <sec id="sec20" sec-type="conclusion">
            <title>8. Conclusion</title>
            <p>This paper introduced the Adaptive Phoneme State Learning (APSL) algorithm, a theoretically grounded framework for confidence-weighted adaptation of HMM transitions using BPNN posteriors. Key findings are as follows: (i) 
                <bold>Performance
</bold>: APSL-BPNN-HMM achieved 96% accuracy (95.7% recall, 92.95% precision, 94.53% F1), significantly outperforming baseline HMM (75%). (ii) 
                <bold>Memory efficiency</bold>: Adaptive parameter sharing reduced memory from 24 GB to 15.12 GB (32% reduction) without accuracy loss, due to implicit regularization. (iii) 
                <bold>Robustness</bold>: The model maintained &gt;85% accuracy across 7 geographical accent groups and 5 noise conditions, with WER &lt;10% on complex files. (iv) 
                <bold>Limitations</bold>: Real-time latency (3-10 seconds) remains above the 1-second threshold for conversational AI. Future work will address this via beam search pruning and GPU acceleration. (v) 
                <bold>Future directions</bold>: (a) Extend APSL to multi-lingual phoneme sets (Hindi, Mandarin, Arabic); (b) Integrate with federated learning for privacy-preserving adaptation; (c) Replace BPNN with lightweight Transformer for improved confidence estimation. The APSL framework demonstrates that hybrid HMM-neural models remain viable for interpretable, memory-efficient ASR, particularly in low-resource and privacy-sensitive domains.</p>
        </sec>
    </body>
    <back>
        <sec id="sec25" sec-type="data-availability">
            <title>Data availability statement</title>
            <p>The datasets used in this research are publicly available and can be accessed from the following sources: the British National Corpus (BNC),
                <sup>
                    <xref ref-type="bibr" rid="ref27">27</xref>
                </sup> the American National Corpus (ANC),
                <sup>
                    <xref ref-type="bibr" rid="ref28">28</xref>
                </sup> the Corpus of Contemporary American English (COCA),
                <sup>
                    <xref ref-type="bibr" rid="ref29">29</xref>
                </sup> the Buckeye Speech Corpus,
                <sup>
                    <xref ref-type="bibr" rid="ref30">30</xref>
                </sup> and the Emu Speech Database.
                <sup>
                    <xref ref-type="bibr" rid="ref31">31</xref>
                </sup> The trained model files and derived artefacts generated during the current study are not publicly hosted due to storage and maintenance constraints. However, these materials can be made available for academic and non-commercial research purposes upon reasonable request. Any additional in-house developed datasets and the model developed in this study are available from the corresponding author upon reasonable request. Interested readers and reviewers may apply for access by contacting the corresponding author at 
                <email xlink:href="mailto:r.siddalingappa@yorksj.ac.uk">r.siddalingappa@yorksj.ac.uk</email>. Access will be granted subject to intended use being consistent with academic research and applicable data usage agreements.</p>
        </sec>
        <ref-list>
            <title>References</title>
            <ref id="ref1">
                <label>1</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Hanumanthappa</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Rashmi</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Jyothi</surname>
                            <given-names>NM</given-names>
                        </name>
</person-group>:
                    <article-title>Impact of phonetics in natural language processing: A literature survey.</article-title>
                    <source>

                        <italic toggle="yes">IIJISET&#x2013;International Journal of Innovative Science, Engineering &amp; Technology.</italic>
</source>
                    <year>2014</year>;<volume>1</volume>(<issue>3</issue>).</mixed-citation>
            </ref>
            <ref id="ref2">
                <label>2</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Patel</surname>
                            <given-names>I</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Srinivasa Rao</surname>
                            <given-names>Y</given-names>
                        </name>
</person-group>:
                    <chapter-title>Speech recognition using hidden markov model with mfcc-subband technique.</chapter-title>
                    <source>

                        <italic toggle="yes">2010 international conference on recent trends in information, telecommunication and computing.</italic>
</source>
                    <publisher-name>IEEE</publisher-name>;<year>2010</year>; pages<fpage>168</fpage>&#x2013;<lpage>172</lpage>.</mixed-citation>
            </ref>
            <ref id="ref3">
                <label>3</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Le</surname>
                            <given-names>VB</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Besacier</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Schultz</surname>
                            <given-names>T</given-names>
                        </name>
</person-group>:
                    <chapter-title>Acoustic-phonetic unit similarities for context dependent acoustic model portability.</chapter-title>
                    <source>

                        <italic toggle="yes">2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.</italic>
</source>
                    <publisher-name>IEEE</publisher-name>;<year>2006</year>; volume<volume>1</volume>: pages<fpage>I</fpage>&#x2013;<lpage>I</lpage>.</mixed-citation>
            </ref>
            <ref id="ref4">
                <label>4</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Shivakumar</surname>
                            <given-names>KM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Jain</surname>
                            <given-names>VV</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Krishna Priya</surname>
                            <given-names>P</given-names>
                        </name>
</person-group>:
                    <chapter-title>A study on impact of language model in improving the accuracy of speech to text conversion system.</chapter-title>
                    <source>

                        <italic toggle="yes">2017 International Conference on Communication and Signal Processing (ICCSP).</italic>
</source>
                    <publisher-name>IEEE</publisher-name>;<year>2017</year>; pages<fpage>1148</fpage>&#x2013;<lpage>1151</lpage>.</mixed-citation>
            </ref>
            <ref id="ref5">
                <label>5</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Gunawan</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <chapter-title>English digits speech recognition system based on hidden markov models.</chapter-title>
                    <source>

                        <italic toggle="yes">Proceedings of International Conference Computer.</italic>
</source>
                    <year>2010</year>.</mixed-citation>
            </ref>
            <ref id="ref6">
                <label>6</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Katre</surname>
                            <given-names>SM</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <source>

                        <italic toggle="yes">A&#x1e63;&#x1e6d;&#x0101;dhy&#x0101;y&#x012b; of P&#x0101;&#x1e47;ini.</italic>
</source>
                    <publisher-name>Motilal Banarsidass Publ</publisher-name>;<year>1989</year>.</mixed-citation>
            </ref>
            <ref id="ref7">
                <label>7</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Vijayalakshmi</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ramani</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Actlin Jeeva</surname>
                            <given-names>MP</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A multilingual to polyglot speech synthesizer for indian languages using a voice-converted polyglot speech corpus.</article-title>
                    <source>

                        <italic toggle="yes">Circuits, Systems, and Signal Processing.</italic>
</source>
                    <year>2018</year>;<volume>37</volume>:<fpage>2142</fpage>&#x2013;<lpage>2163</lpage>.
                    <pub-id pub-id-type="doi">10.1007/s00034-017-0659-6</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref8">
                <label>8</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ling</surname>
                            <given-names>Z-H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Zhou</surname>
                            <given-names>X</given-names>
                        </name>

                        <name name-style="western">
                            <surname>King</surname>
                            <given-names>S</given-names>
                        </name>
</person-group>:
                    <chapter-title>The blizzard challenge 2021.</chapter-title>
                    <source>

                        <italic toggle="yes">Proc. Blizzard Challenge Workshop.</italic>
</source>
                    <year>2021</year>.</mixed-citation>
            </ref>
            <ref id="ref9">
                <label>9</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Mutalib</surname>
                            <given-names>NSA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Noah</surname>
                            <given-names>SA</given-names>
                        </name>
</person-group>:
                    <chapter-title>Phonetic coding methods for malay names retrieval.</chapter-title>
                    <source>

                        <italic toggle="yes">2011 International Conference on Semantic Technology and Information Retrieval.</italic>
</source>
                    <publisher-name>IEEE</publisher-name>;<year>2011</year>; pages<fpage>125</fpage>&#x2013;<lpage>129</lpage>.</mixed-citation>
            </ref>
            <ref id="ref10">
                <label>10</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ogbureke</surname>
                            <given-names>KU</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Carson-Berndsen</surname>
                            <given-names>J</given-names>
                        </name>
</person-group>:
                    <chapter-title>Framework for cross-language automatic phonetic segmentation.</chapter-title>
                    <source>

                        <italic toggle="yes">2010 IEEE International Conference on Acoustics, Speech and Signal Processing.</italic>
</source>
                    <publisher-name>IEEE</publisher-name>;<year>2010</year>; pages<fpage>5266</fpage>&#x2013;<lpage>5269</lpage>.</mixed-citation>
            </ref>
            <ref id="ref11">
                <label>11</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Juneja</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Espy-Wilson</surname>
                            <given-names>C</given-names>
                        </name>
</person-group>:
                    <article-title>Acoustic-phonetic approach to speech recognition based on event detection and linear discriminant analysis.</article-title>
                    <source>

                        <italic toggle="yes">J. Acoust. Soc. Am.</italic>
</source>
                    <year>2001</year>;<volume>109</volume>(<issue>5_Supplement</issue>):<fpage>2493</fpage>&#x2013;<lpage>2493</lpage>.</mixed-citation>
            </ref>
            <ref id="ref12">
                <label>12</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Khanagha</surname>
                            <given-names>V</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Daoudi</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Pont</surname>
                            <given-names>O</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <chapter-title>Improving text-independent phonetic segmentation based on the microcanonical multiscale formalism.</chapter-title>
                    <source>

                        <italic toggle="yes">2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).</italic>
</source>
                    <publisher-name>IEEE</publisher-name>;<year>2011</year>; pages<fpage>4484</fpage>&#x2013;<lpage>4487</lpage>.</mixed-citation>
            </ref>
            <ref id="ref13">
                <label>13</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Gales</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Young</surname>
                            <given-names>S</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The application of hidden markov models in speech recognition.</article-title>
                    <source>

                        <italic toggle="yes">Foundations and Trends</italic>
                        <sup>&#x00ae;</sup> 
                        <italic toggle="yes">in Signal Processing.</italic>
</source>
                    <year>2008</year>;<volume>1</volume>(<issue>3</issue>):<fpage>195</fpage>&#x2013;<lpage>304</lpage>.
                    <pub-id pub-id-type="doi">10.1561/2000000004</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref14">
                <label>14</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Mullah</surname>
                            <given-names>HU</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Pyrtuh</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Joyprakash Singh</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <chapter-title>Development of an hmm-based speech synthesis system for indian english language.</chapter-title>
                    <source>

                        <italic toggle="yes">2015 international symposium on advanced computing and communication (ISACC).</italic>
</source>
                    <publisher-name>IEEE</publisher-name>;<year>2015</year>; pages<fpage>124</fpage>&#x2013;<lpage>127</lpage>.</mixed-citation>
            </ref>
            <ref id="ref15">
                <label>15</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Kumar</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Videla</surname>
                            <given-names>LS</given-names>
                        </name>

                        <name name-style="western">
                            <surname>SivaKumar</surname>
                            <given-names>S</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <chapter-title>Murmured speech recognition using hidden markov model.</chapter-title>
                    <source>

                        <italic toggle="yes">2020 7th International Conference on Smart Structures and Systems (ICSSS).</italic>
</source>
                    <publisher-name>IEEE</publisher-name>;<year>2020</year>; pages<fpage>1</fpage>&#x2013;<lpage>5</lpage>.</mixed-citation>
            </ref>
            <ref id="ref16">
                <label>16</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Kannamal</surname>
                            <given-names>E</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <chapter-title>Investigation of speech recognition system and its performance.</chapter-title>
                    <source>

                        <italic toggle="yes">2020 International Conference on Computer Communication and Informatics (ICCCI).</italic>
</source>
                    <publisher-name>IEEE</publisher-name>;<year>2020</year>; pages<fpage>1</fpage>&#x2013;<lpage>4</lpage>.</mixed-citation>
            </ref>
            <ref id="ref17">
                <label>17</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Siddalingappa</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Lakshmi</surname>
                            <given-names>BA</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Fedge: Federated learning at the edge on space platforms using deep neural network architectures.</article-title>
                    <source>

                        <italic toggle="yes">Int. J. Inf. Technol.</italic>
</source>
                    <year>2025</year>;<fpage>1</fpage>&#x2013;<lpage>12</lpage>.
                    <pub-id pub-id-type="doi">10.1007/s41870-025-03010-0</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref18">
                <label>18</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Shuo Zhang</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Liu, and Dingyu Xue.</surname>
                        </name>
</person-group>:
                    <article-title>Nyquist-based stability analysis of non-commensurate fractional-order delay systems.</article-title>
                    <source>

                        <italic toggle="yes">Appl. Math. Comput.</italic>
</source>
                    <year>2020</year>;<volume>377</volume>:<fpage>125111</fpage>.
                    <pub-id pub-id-type="doi">10.1016/j.amc.2020.125111</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref19">
                <label>19</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Rashmi</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hanumanthappa</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gopala</surname>
                            <given-names>B</given-names>
                        </name>
</person-group>:
                    <chapter-title>Training based noise removal technique for a speech-to-text representation model.</chapter-title>
                    <source>

                        <italic toggle="yes">Journal of Physics: Conference Series.</italic>
</source>
                    <publisher-name>IOP Publishing</publisher-name>;<year>2018</year>; volume<volume>1142</volume>: page<fpage>012019</fpage>.</mixed-citation>
            </ref>
            <ref id="ref20">
                <label>20</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Martynova</surname>
                            <given-names>EV</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Eremeeva</surname>
                            <given-names>GR</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Valieva</surname>
                            <given-names>GF</given-names>
                        </name>
</person-group>:
                    <article-title>The graphical method of pauses detection in english speech signals.</article-title>
                    <source>

                        <italic toggle="yes">Utop&#x00ed;a y Praxis Latinoamericana.</italic>
</source>
                    <year>2019</year>;<volume>24</volume>(<issue>6</issue>):<fpage>26</fpage>&#x2013;<lpage>31</lpage>.</mixed-citation>
            </ref>
            <ref id="ref21">
                <label>21</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Boersma</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Van Heuven</surname>
                            <given-names>V</given-names>
                        </name>
</person-group>:
                    <article-title>Speak and unspeak with praat.</article-title>
                    <source>

                        <italic toggle="yes">Glot International.</italic>
</source>
                    <year>2001</year>;<volume>5</volume>(<issue>9/10</issue>):<fpage>341</fpage>&#x2013;<lpage>347</lpage>.</mixed-citation>
            </ref>
            <ref id="ref22">
                <label>22</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Logan</surname>
                            <given-names>B</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <chapter-title>Mel frequency cepstral coefficients for music modeling.</chapter-title>
                    <source>

                        <italic toggle="yes">Ismir.</italic>
</source>
                    <publisher-loc>Plymouth, MA</publisher-loc>:<year>2000</year>; volume<volume>270</volume>: page<fpage>11</fpage>.</mixed-citation>
            </ref>
            <ref id="ref23">
                <label>23</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Manchanda</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gupta</surname>
                            <given-names>D</given-names>
                        </name>
</person-group>:
                    <chapter-title>Hybrid approach of feature extraction and vector quantization in speech recognition.</chapter-title>
                    <source>

                        <italic toggle="yes">Proceedings of the Second International Conference on Computational Intelligence and Informatics: ICCII 2017.</italic>
</source>
                    <publisher-name>Springer</publisher-name>;<year>2018</year>; pages<fpage>639</fpage>&#x2013;<lpage>645</lpage>.</mixed-citation>
            </ref>
            <ref id="ref24">
                <label>24</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Agarwalla</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Sarma</surname>
                            <given-names>KK</given-names>
                        </name>
</person-group>:
                    <article-title>Machine learning based sample extraction for automatic speech recognition using dialectal assamese speech.</article-title>
                    <source>

                        <italic toggle="yes">Neural Netw.</italic>
</source>
                    <year>2016</year>;<volume>78</volume>:<fpage>97</fpage>&#x2013;<lpage>111</lpage>.
                    <pub-id pub-id-type="pmid">26783204</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.neunet.2015.12.010</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref25">
                <label>25</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Rashmi</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hanumanthappa</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Reddy</surname>
                            <given-names>MV</given-names>
                        </name>
</person-group>:
                    <chapter-title>Hidden markov model for speech recognition system&#x2014;a pilot study and a naive approach for speech-to-text model.</chapter-title>
                    <source>

                        <italic toggle="yes">Speech and Language Processing for Human-Machine Communications: Proceedings of CSI 2015.</italic>
</source>
                    <publisher-name>Springer</publisher-name>;<year>2018</year>; pages<fpage>77</fpage>&#x2013;<lpage>90</lpage>.</mixed-citation>
            </ref>
            <ref id="ref26">
                <label>26</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wong</surname>
                            <given-names>PHW</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Au</surname>
                            <given-names>OC</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wong</surname>
                            <given-names>JWC</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <chapter-title>Reducing computational complexity of dynamic time warping-based isolated word recognition with time scale modification.</chapter-title>
                    <source>

                        <italic toggle="yes">ICSP&#x2019;98. 1998 Fourth International Conference on Signal Processing (Cat. No. 98TH8344).</italic>
</source>
                    <publisher-name>IEEE</publisher-name>;<year>1998</year>; pages<fpage>722</fpage>&#x2013;<lpage>725</lpage>.</mixed-citation>
            </ref>
            <ref id="ref27">
                <label>27</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Aston</surname>
                            <given-names>G</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Burnard</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <source>

                        <italic toggle="yes">The BNC handbook: exploring the British National Corpus with SARA.</italic>
</source>
                    <publisher-name>Edinburgh University Press</publisher-name>;<year>2020</year>.</mixed-citation>
            </ref>
            <ref id="ref28">
                <label>28</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ide</surname>
                            <given-names>N</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Macleod</surname>
                            <given-names>C</given-names>
                        </name>
</person-group>:
                    <chapter-title>The american national corpus: A standardized resource of american english.</chapter-title>
                    <source>

                        <italic toggle="yes">Proceedings of corpus linguistics.</italic>
</source>
                    <publisher-name>Lancaster University Centre for Computer Corpus Research on Language</publisher-name>;<year>2001</year>; volume<volume>3</volume>: pages<fpage>1</fpage>&#x2013;<lpage>7</lpage>.</mixed-citation>
            </ref>
            <ref id="ref29">
                <label>29</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Davies</surname>
                            <given-names>M</given-names>
                        </name>
</person-group>:
                    <article-title>The 385+ million word corpus of contemporary american english (1990&#x2013;2008+): Design, architecture, and linguistic insights.</article-title>
                    <source>

                        <italic toggle="yes">International journal of corpus linguistics.</italic>
</source>
                    <year>2009</year>;<volume>14</volume>(<issue>2</issue>):<fpage>159</fpage>&#x2013;<lpage>190</lpage>.</mixed-citation>
            </ref>
            <ref id="ref30">
                <label>30</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Pitt</surname>
                            <given-names>MA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Johnson</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hume</surname>
                            <given-names>E</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability.</article-title>
                    <source>

                        <italic toggle="yes">Speech Comm.</italic>
</source>
                    <year>2005</year>;<volume>45</volume>(<issue>1</issue>):<fpage>89</fpage>&#x2013;<lpage>95</lpage>.</mixed-citation>
            </ref>
            <ref id="ref31">
                <label>31</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Cassidy</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Harrington</surname>
                            <given-names>J</given-names>
                        </name>
</person-group>:
                    <article-title>Multi-level annotation in the emu speech database management system.</article-title>
                    <source>

                        <italic toggle="yes">Speech Comm.</italic>
</source>
                    <year>2001</year>;<volume>33</volume>(<issue>1-2</issue>):<fpage>61</fpage>&#x2013;<lpage>77</lpage>.
                    <pub-id pub-id-type="doi">10.1016/S0167-6393(00)00069-8</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref32">
                <label>32</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Hecht-Nielsen</surname>
                            <given-names>R</given-names>
                        </name>
</person-group>:
                    <chapter-title>Theory of the backpropagation neural network.</chapter-title>
                    <source>

                        <italic toggle="yes">Neural networks for perception.</italic>
</source>
                    <publisher-name>Elsevier</publisher-name>;<year>1992</year>; pages<fpage>65</fpage>&#x2013;<lpage>93</lpage>.</mixed-citation>
            </ref>
            <ref id="ref33">
                <label>33</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Forney</surname>
                            <given-names>GD</given-names>
                        </name>
</person-group>:
                    <article-title>The viterbi algorithm.</article-title>
                    <source>

                        <italic toggle="yes">Proc. IEEE.</italic>
</source>
                    <year>2005</year>;<volume>61</volume>(<issue>3</issue>):<fpage>268</fpage>&#x2013;<lpage>278</lpage>.
                    <pub-id pub-id-type="doi">10.1109/PROC.1973.9030</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref34">
                <label>34</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Rechkemmer</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Yin</surname>
                            <given-names>M</given-names>
                        </name>
</person-group>:
                    <chapter-title>When confidence meets accuracy: Exploring the effects of multiple performance indicators on trust in machine learning models.</chapter-title>
                    <source>

                        <italic toggle="yes">Proceedings of the 2022 chi conference on human factors in computing systems.</italic>
</source>
                    <year>2022</year>; pages<fpage>1</fpage>&#x2013;<lpage>14</lpage>.</mixed-citation>
            </ref>
            <ref id="ref35">
                <label>35</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Snoek</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Larochelle</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Adams</surname>
                            <given-names>RP</given-names>
                        </name>
</person-group>:
                    <article-title>Practical bayesian optimization of machine learning algorithms.</article-title>
                    <source>

                        <italic toggle="yes">Adv. Neural Inf. Proces. Syst.</italic>
</source>
                    <year>2012</year>;<volume>25</volume>.</mixed-citation>
            </ref>
            <ref id="ref36">
                <label>36</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Conneau</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ma</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Simran Khanuja</surname>
                            <given-names>Y</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <chapter-title>Fleurs: Few-shot learning evaluation of universal representations of speech.</chapter-title>
                    <source>

                        <italic toggle="yes">2022 IEEE Spoken Language Technology Workshop (SLT).</italic>
</source>
                    <publisher-name>IEEE</publisher-name>;<year>2023</year>; pages<fpage>798</fpage>&#x2013;<lpage>805</lpage>.</mixed-citation>
            </ref>
            <ref id="ref37">
                <label>37</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Siddalingappa</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kanagaraj</surname>
                            <given-names>S</given-names>
                        </name>
</person-group>:
                    <article-title>Anomaly detection on medical images using autoencoder and convolutional neural network.</article-title>
                    <source>

                        <italic toggle="yes">Int. J. Adv. Comput. Sci. Appl.</italic>
</source>
                    <year>2021</year>;<volume>12</volume>.
                    <pub-id pub-id-type="doi">10.14569/IJACSA.2021.0120717</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref38">
                <label>38</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Kheddar</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Himeur</surname>
                            <given-names>Y</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Al-Maadeed</surname>
                            <given-names>S</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Deep transfer learning for automatic speech recognition: Towards better generalization.</article-title>
                    <source>

                        <italic toggle="yes">Knowledge-Based Systems.</italic>
</source>
                    <year>2023</year>;<volume>277</volume>:<fpage>110851</fpage>.
                    <pub-id pub-id-type="doi">10.1016/j.knosys.2023.110851</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref39">
                <label>39</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Vidya</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Vaidyanathan</surname>
                            <given-names>GS</given-names>
                        </name>
</person-group>:
                    <article-title>Dysarthric Severity Categorization Based on Speech Intelligibility: A Hybrid Approach.</article-title>
                    <source>

                        <italic toggle="yes">Circuits, Systems, and Signal Processing.</italic>
</source>
                    <year>2024</year>.</mixed-citation>
            </ref>
            <ref id="ref40">
                <label>40</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Djeffal</surname>
                            <given-names>N</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Addou</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kheddar</surname>
                            <given-names>H</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Combined CNN-LSTM for Enhancing Clean and Noisy Speech Recognition.</article-title>
                    <source>

                        <italic toggle="yes">AL-Lisaniyyat.</italic>
</source>
                    <year>2024</year>;<volume>30</volume>(<issue>2</issue>):<fpage>5</fpage>&#x2013;<lpage>26</lpage>.
                    <pub-id pub-id-type="doi">10.61850/allj.v30i2.732</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report479270">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.195635.r479270</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Rizky</surname>
                        <given-names>Ramanda</given-names>
                    </name>
                    <xref ref-type="aff" rid="r479270a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0009-0000-4575-4760</uri>
                </contrib>
                <aff id="r479270a1">
                    <label>1</label>Universitas Lancang Kuning, Pekanbaru, Riau, Indonesia</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>6</day>
                <month>5</month>
                <year>2026</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2026 Rizky R</copyright-statement>
                <copyright-year>2026</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport479270" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.177414.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>This paper presents a hybrid Automatic Speech Recognition (ASR) framework that combines a Backpropagation Neural Network (BPNN) with a Hidden Markov Model (HMM), enhanced by an Adaptive Phoneme State Learning (APSL) mechanism. In terms of structure, this study is organized according to conventional scientific standards, proceeding systematically from preprocessing and feature extraction to modeling and evaluation. This structure supports readability and demonstrates a coherent research workflow. However, a critical review reveals several fundamental issues that limit the scientific robustness of this study, particularly regarding the relevance of benchmarks, theoretical foundation, reproducibility, and the validity of its conclusions.</p>
            <p> The primary concern lies in the benchmarking strategy. Although this study demonstrates that the proposed APSL-BPNN-HMM model outperforms conventional HMM baselines on metrics such as precision, recall, F1 score, and Word Error Rate, this comparison is insufficient in the context of contemporary ASR research. The field has undergone a paradigm shift toward deep learning and end-to-end architectures, including Transformer-based models, Connectionist Temporal Classification (CTC), and Recurrent Neural Network Transducers (RNN-T). By limiting the evaluation to traditional HMM baselines and, unusually, human transcriptions under noisy conditions, this study does not provide a meaningful frame of reference for assessing its contributions. Consequently, the claimed performance improvements lack external validity. To be scientifically valid, this study must include comparisons with modern ASR systems and position its contributions relative to current state-of-the-art approaches.</p>
            <p> Similarly important are issues of theoretical rigor. The APSL mechanism is introduced as the core innovation of this study, yet its formulation is largely procedural rather than analytical. Although the paper provides equations and step-by-step descriptions, the APSL mechanism is not clearly situated within a well-defined probabilistic or machine learning framework. The adaptive adjustment of phoneme transition probabilities using neural confidence scores appears heuristic, with limited justification grounded in established theory. This weakens both the interpretability and generalizability of the approach. A scientifically valid contribution requires not only functional implementation but also a clear theoretical foundation explaining why and under what conditions the method should work. Reinforcing this aspect involves formal derivations, clearer assumptions, and explicit connections to existing probabilistic adaptation or hybrid modeling techniques.</p>
            <p> Moreover, the reproducibility represents another critical limitation. Although this study describes a general workflow and reports some hyperparameters, it does not provide sufficient detail to allow for full replication. Key aspects of the experimental setup remain unclear, including the exact composition of the training, validation, and test splits; the proportions and configuration of the augmented data; and the specific preprocessing applied to each dataset. Furthermore, the absence of publicly available code, trained models, or configuration files further limits reproducibility. While the use of publicly accessible corpora is a positive step, it is insufficient on its own. In contemporary empirical research, reproducibility is closely tied to transparency, and this generally requires open access to implementation resources. Addressing this issue would significantly enhance the credibility and impact of this work.</p>
            <p> Dealing with data availability in this study is also incomplete. Although this study utilizes established corpora such as BNC, ANC, and COCA, the integration of these datasets with expanded data and internally generated data is not fully documented. Without clear documentation on how these datasets were combined, preprocessed, and balanced, it will be difficult for other researchers to replicate the experimental conditions or verify the reported results. Providing a detailed data protocol, including preprocessing scripts and augmentation procedures, would help bridge this gap and align this research with best practices in open and reproducible science.</p>
            <p> Another important limitation concerns the interpretation of results and the strength of conclusions. The findings consistently show that the proposed model outperforms the baseline HMM, which supports the internal validity of this study. However, the conclusions drawn go beyond what the available evidence can support. Claims implying performance approaching human levels or broader application to real-world ASR scenarios are not adequately supported, particularly given the lack of comparison with modern systems and the absence of statistical validation. Most of the analysis is descriptive, relying on average performance metrics without reporting variance, confidence intervals, or significance tests. This makes it difficult to determine whether the observed improvements are robust or merely coincidental. For conclusions to be scientifically justified, they must be more aligned with the scope and limitations of the experimental design.</p>
            <p> Broadly speaking, these issues highlight four core areas that must be addressed for this study to achieve scientific validity. First, the benchmarking framework must be expanded to include contemporary ASR models, ensuring that performance claims are evaluated against relevant standards. Second, the APSL mechanism requires a stronger theoretical foundation, moving beyond heuristic descriptions toward formal justification. Third, the study must improve reproducibility by providing detailed methodological documentation and, ideally, open access to code and data processing workflows. Fourth, conclusions should be moderated and supported by more rigorous analysis, including appropriate validation techniques. Addressing these areas will not only strengthen the internal coherence of the study but also enhance its relevance, credibility, and contribution to the evolving field of speech recognition research.</p>
            <p> </p>
            <p> </p>
            <p>Is the work clearly and accurately presented and does it cite the current literature?</p>
            <p>Partly</p>
            <p>If applicable, is the statistical analysis and its interpretation appropriate?</p>
            <p>I cannot comment. A qualified statistician is required.</p>
            <p>Are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Partly</p>
            <p>Is the study design appropriate and is the work technically sound?</p>
            <p>Partly</p>
            <p>Are the conclusions drawn adequately supported by the results?</p>
            <p>Partly</p>
            <p>Are sufficient details of methods and analysis provided to allow replication by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>English as a Foreign Language, English Language Teaching, Applied Linguistics.</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report474867">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.195635.r474867</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Khujayorov</surname>
                        <given-names>Ilyos</given-names>
                    </name>
                    <xref ref-type="aff" rid="r474867a2">2</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-0573-6303</uri>
                </contrib>
                <contrib contrib-type="author">
                    <name>
                        <surname>Nazarov</surname>
                        <given-names>Fayzulla</given-names>
                    </name>
                    <xref ref-type="aff" rid="r474867a1">1</xref>
                    <role>Co-referee</role>
                </contrib>
                <aff id="r474867a1">
                    <label>1</label>Artificial Intelligence, Samarkand State University named after Sharof Rashidov (Ringgold ID: 187914), Samarkand, Samarkand Province, Uzbekistan</aff>
                <aff id="r474867a2">
                    <label>2</label>Artificial Intelligence, Tashkent University of Information Technologies named after Muhammad al-Khwarizm (Ringgold ID: 187932), Tashkent, Tashkent Province, Uzbekistan</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>6</day>
                <month>5</month>
                <year>2026</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2026 Nazarov F and Khujayorov I</copyright-statement>
                <copyright-year>2026</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport474867" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.177414.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>reject</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The article proposes the APSL algorithm combining traditional BPNN and HMM models.</p>
            <p> The following deficiencies and suggestions were identified during the review: 
                <list list-type="order">
                    <list-item>
                        <p>The introduction part started with basic information related to explaining general definitions and concepts. The introduction part of scientific articles must give information like Research Context motivation, research Aims, Contributions, and others. In short, the modern problem (Problem Statement) and relevance are not revealed.</p>
                    </list-item>
                    <list-item>
                        <p>In the literature review part, mainly sources from 15-20 years ago are presented as main researches, but this does not reflect the current state of this field. It is appropriate to emphasize articles published within the last 3-5 years (modern State-of-the-Art models are left out of attention). Authors should completely revise the literature review, it is recommended to add publications between the years 2022&#x2013;2026, especially to scientifically justify the difference between End-to-End models (Whisper, Conformer) and HMM, and to clearly show the place of the article in the era of these technologies.</p>
                    </list-item>
                    <list-item>
                        <p>There is no information about the hardware used in Training process.</p>
                    </list-item>
                    <list-item>
                        <p>How many states were used for each phoneme (for example, standard 3-state HMM or otherwise) is not clearly stated in the methodology part. This is one of the most important parameters of acoustic modeling.</p>
                    </list-item>
                    <list-item>
                        <p>The dynamic windowing (Eq. 18) process is not clarified. A proposal is given to increase the window size by +5 ms when the confidence coefficient is low. Theoretical or experimental bases are not presented about why it is exactly 5 ms and how this affects the time delay.</p>
                    </list-item>
                    <list-item>
                        <p>It is stated that the APSL mechanism reduced memory from 24 GB to 15.12 GB. However, an analysis proving that such a reduction did not negatively affect accuracy should be given in more detail in the methodology part.</p>
                    </list-item>
                    <list-item>
                        <p>Comparison works of the model results proposed by the authors with modern E2E architectures have not been done. This is considered one of the important issues of checking the reliability of the model. Authors must justify why they chose exactly the HMM-BPNN hybrid compared to Whisper or other modern models.</p>
                    </list-item>
                    <list-item>
                        <p>The mathematical expression of the APSL algorithm (17) has a heuristic appearance and its theoretical basis is not sufficiently revealed.</p>
                    </list-item>
                    <list-item>
                        <p>The conclusion part of the article is written in a very general way. It is appropriate to separately note the most important numerical indicators achieved in the conclusion, the role of the APSL algorithm in saving memory, and also add thoughts regarding real-time requirements.</p>
                        <p> Due to the serious technical and methodological deficiencies noted above, I recommend rejecting this article for publication.</p>
                    </list-item>
                </list>
            </p>
            <p>Is the work clearly and accurately presented and does it cite the current literature?</p>
            <p>Yes</p>
            <p>If applicable, is the statistical analysis and its interpretation appropriate?</p>
            <p>Partly</p>
            <p>Are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>No</p>
            <p>Is the study design appropriate and is the work technically sound?</p>
            <p>Partly</p>
            <p>Are the conclusions drawn adequately supported by the results?</p>
            <p>Partly</p>
            <p>Are sufficient details of methods and analysis provided to allow replication by others?</p>
            <p>Partly</p>
            <p>Reviewer Expertise:</p>
            <p>Digital signal processing,&#x00a0; NLP, speech recognition and synthesis,AI, parallel computing</p>
            <p>We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.</p>
        </body>
        <sub-article article-type="response" id="comment16251-474867">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Siddalingappa</surname>
                            <given-names>Rashmi</given-names>
                        </name>
                        <aff>Computer and Data Science, York St John University, York, London, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>NA</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>20</day>
                    <month>5</month>
                    <year>2026</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Comment 1</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;The introduction part started with basic information related to explaining general definitions and concepts. The introduction part of scientific articles must give information like Research Context motivation, research Aims, Contributions, and others. In short, the modern problem (Problem Statement) and relevance are not revealed.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;We thank the reviewer for this important structural feedback. We have completely rewritten Section 1 (Introduction) to follow the standard scientific structure. The revised introduction now includes: (1) a clear Problem Statement identifying three fundamental challenges in ASR (accent variability, noise robustness, and interpretability); (2) a Research Gap section explicitly stating the question "How can we dynamically adapt HMM state transitions using neural network confidence scores?"; (3) a bulleted list of four specific contributions; and (4) a paper structure outline. The opening now focuses on research context and motivation rather than basic definitions.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Completely restructured Section 1 (Introduction)</p>
                        </list-item>
                        <list-item>
                            <p>Added explicit Problem Statement subsection</p>
                        </list-item>
                        <list-item>
                            <p>Added Research Gap with a focused research question</p>
                        </list-item>
                        <list-item>
                            <p>Listed four specific contributions as bullet points</p>
                        </list-item>
                        <list-item>
                            <p>Removed excessive basic definitions and focused on motivation</p>
                        </list-item>
                    </list> </p>
                <p> Comment 2</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;In the literature review part, mainly sources from 15-20 years ago are presented as main researches, but this does not reflect the current state of this field. It is appropriate to emphasize articles published within the last 3-5 years (modern State-of-the-Art models are left out of attention). Authors should completely revise the literature review, it is recommended to add publications between the years 2022&#x2013;2026, especially to scientifically justify the difference between End-to-End models (Whisper, Conformer) and HMM, and to clearly show the place of the article in the era of these technologies.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;We agree with the reviewer that the original literature review was overly focused on historical works. We have completely restructured Section 2 into four subsections. Section 2.1 (Historical Foundations) now briefly acknowledges the early work (500 BC to 2019) in a condensed manner. Section 2.2 (Traditional HMM-Based Systems) retains Table 1 but now explicitly notes the limitations of these approaches. Section 2.3 (Motivation for Hybrid HMM-Neural Approaches) is entirely new and discusses the continued relevance of hybrid models. Most importantly, we have added discussion of modern E2E architectures including Whisper (Radford et al., 2022), Conformer (Gulati et al., 2020), wav2vec 2.0 (Baevski et al., 2020), and Branchformer (Peng et al., 2022) in Section 2.2. We now clearly position HMM-based systems as complementary to E2E models for specific use cases requiring interpretability and low-resource training.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Restructured Section 2 into four subsections (2.1, 2.2, 2.3, 2.4)</p>
                        </list-item>
                        <list-item>
                            <p>Added discussion of Whisper, Conformer, wav2vec 2.0, and Branchformer</p>
                        </list-item>
                        <list-item>
                            <p>Condensed historical content while preserving key milestones</p>
                        </list-item>
                        <list-item>
                            <p>Added explicit comparison between E2E and HMM approaches in Section 2.3</p>
                        </list-item>
                        <list-item>
                            <p>Added new citations for modern architectures</p>
                        </list-item>
                    </list> </p>
                <p> Comment 3</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;There is no information about the hardware used in Training process.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;We apologize for this omission. We have added a new subsection "Hardware and Software Environment" in Section 5.1 (Experimental Setup). The revised manuscript now specifies the complete hardware configuration including Intel Core i9 processor, NVIDIA GeForce RTX 4060 GPU (used exclusively for BPNN training), sufficient RAM, NVMe storage, Windows operating system, and software dependencies including Python 3.10, PyTorch 2.0, Praat 6.3, and Librosa 0.10. Total training time of 72 hours for 200 epochs is also reported.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Added "Hardware and Software Environment" subsection in Section 5.1</p>
                        </list-item>
                        <list-item>
                            <p>Specified processor, GPU, RAM, storage, and operating system</p>
                        </list-item>
                        <list-item>
                            <p>Listed software versions and dependencies</p>
                        </list-item>
                        <list-item>
                            <p>Reported total training time</p>
                        </list-item>
                    </list> </p>
                <p> Comment 4</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;How many states were used for each phoneme (for example, standard 3-state HMM or otherwise) is not clearly stated in the methodology part. This is one of the most important parameters of acoustic modeling.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;The reviewer is correct that this parameter was omitted. We have now added a clear statement in Section 4.2 (Windowing through Hidden Markov Model) specifying that each phoneme is modeled using a standard 3-state left-to-right HMM with states corresponding to onset, steady-state, and offset. The transition matrix A is initialized with zero probability for backward transitions, and we note that 5-state models were tested but showed no significant improvement (0.3% WER reduction) for 67% more parameters, so 3-state was retained.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Added specification of 3-state left-to-right HMM topology in Section 4.2</p>
                        </list-item>
                        <list-item>
                            <p>Described states as onset, steady-state, and offset</p>
                        </list-item>
                        <list-item>
                            <p>Reported ablation results comparing 3-state vs 5-state models</p>
                        </list-item>
                        <list-item>
                            <p>Justified retention of 3-state based on performance vs complexity trade-off</p>
                        </list-item>
                    </list> </p>
                <p> Comment 5</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;The dynamic windowing (Eq. 18) process is not clarified. A proposal is given to increase the window size by +5 ms when the confidence coefficient is low. Theoretical or experimental bases are not presented about why it is exactly 5 ms and how this affects the time delay.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;We thank the reviewer for identifying this lack of justification. We have now added empirical validation for the 5 ms parameter. Specifically, we conducted a grid search on a 500-utterance validation set evaluating candidate &#x0394;t values of 2, 3, 5, 8, 10, and 15 ms. The results showed that &#x0394;t=5 ms provides the optimal trade-off: 3.2% WER reduction with only 8% latency increase. Values below 5 ms yielded insufficient accuracy gains, while values above 5 ms produced diminishing returns (additional 0.4% WER reduction for 15%+ latency increase). This justification has been added immediately after Equation (18) in Section 4.4. We also note that 5 ms corresponds to approximately one-third of a typical English phoneme duration (15 ms), ensuring the window extension remains within the same phoneme boundary.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Added grid search validation for &#x0394;t parameter in Section 4.4</p>
                        </list-item>
                        <list-item>
                            <p>Reported candidate values tested (2, 3, 5, 8, 10, 15 ms)</p>
                        </list-item>
                        <list-item>
                            <p>Provided quantitative results (3.2% WER reduction, 8% latency increase)</p>
                        </list-item>
                        <list-item>
                            <p>Justified 5 ms based on typical phoneme duration (15 ms)</p>
                        </list-item>
                        <list-item>
                            <p>Added table of grid search results in the revised manuscript</p>
                        </list-item>
                    </list> </p>
                <p> Comment 6</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;It is stated that the APSL mechanism reduced memory from 24 GB to 15.12 GB. However, an analysis proving that such a reduction did not negatively affect accuracy should be given in more detail in the methodology part.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;The reviewer raises an important point about verifying the memory-accuracy trade-off. We have added a new subsection "Validation of Memory-Accuracy Trade-off" in Section 5.1. We present an ablation study comparing four configurations: baseline HMM without APSL, APSL with full parameters (k=m), APSL with adaptive sharing (k=128), and APSL with aggressive sharing (k=64). The results show that adaptive sharing (k=128) achieves 96.0% accuracy compared to 95.8% with full parameters, meaning the memory reduction actually slightly improved accuracy due to implicit regularization. Aggressive sharing (k=64) reduced memory further to 11.80 GB but caused accuracy to drop to 92.3%. The similarity threshold &#x03c4;=0.75 was optimized on validation data to balance sharing and specificity. This analysis confirms that the 32% memory reduction (24 GB to 15.12 GB) does not degrade accuracy and in fact provides a small benefit.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Added "Validation of Memory-Accuracy Trade-off" subsection in Section 5.1</p>
                        </list-item>
                        <list-item>
                            <p>Presented ablation study with four configurations</p>
                        </list-item>
                        <list-item>
                            <p>Showed accuracy improved from 95.8% to 96.0% with adaptive sharing</p>
                        </list-item>
                        <list-item>
                            <p>Identified &#x03c4;=0.75 as optimal similarity threshold</p>
                        </list-item>
                        <list-item>
                            <p>Confirmed memory reduction does not harm accuracy</p>
                        </list-item>
                    </list> </p>
                <p> Comment 7</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;Comparison works of the model results proposed by the authors with modern E2E architectures have not been done. This is considered one of the important issues of checking the reliability of the model. Authors must justify why they chose exactly the HMM-BPNN hybrid compared to Whisper or other modern models.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;We acknowledge this limitation and have addressed it comprehensively. First, we added a new Table 5 in Section 6 comparing our APSL-BPNN-HMM with Whisper-tiny across metrics including WER, accuracy, memory consumption, and inference time. Second, we added an explicit justification paragraph explaining three reasons for choosing HMM-BPNN: (1) controlled comparison - our primary claim is improvement over HMM baseline, not surpassing SOTA; (2) interpretability - clinical and forensic users require phoneme-level confidence scores; (3) reproducibility - our model can be trained on commodity hardware with 16 GB RAM without GPU. We also acknowledge that Whisper achieves lower WER on standard benchmarks but argue that our approach offers complementary strengths.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Added Table 5 comparing APSL-BPNN-HMM with Whisper-tiny</p>
                        </list-item>
                        <list-item>
                            <p>Added justification paragraph in Section 6</p>
                        </list-item>
                        <list-item>
                            <p>Provided three explicit reasons for HMM-BPNN choice</p>
                        </list-item>
                        <list-item>
                            <p>Acknowledged limitations of direct comparison</p>
                        </list-item>
                        <list-item>
                            <p>Positioned work as complementary rather than competitive</p>
                        </list-item>
                    </list> </p>
                <p> Comment 8</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;The mathematical expression of the APSL algorithm (17) has a heuristic appearance and its theoretical basis is not sufficiently revealed.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;We agree with the reviewer that the original formulation lacked theoretical grounding. We have substantially revised Section 4.4 to provide a formal Bayesian derivation. We now show that the convex combination&#x00a0;Phybrid(wi&#x2223;qi)=&#x03b1;(x)&#x22c5;PNN(pj&#x2223;x)+(1&#x2212;&#x03b1;(x))&#x22c5;PHMM(wi&#x2223;qi)Phybrid&#x200b;(wi&#x200b;&#x2223;qi&#x200b;)=&#x03b1;(x)&#x22c5;PNN&#x200b;(pj&#x200b;&#x2223;x)+(1&#x2212;&#x03b1;(x))&#x22c5;PHMM&#x200b;(wi&#x200b;&#x2223;qi&#x200b;)&#x00a0;is equivalent to Bayesian model averaging under the assumption of Gaussian-distributed estimation errors, where the dynamic weighting factor is&#x00a0;&#x03b1;(x)=&#x03c3;HMM2/(&#x03c3;NN2+&#x03c3;HMM2)&#x03b1;(x)=&#x03c3;HMM2&#x200b;/(&#x03c3;NN2&#x200b;+&#x03c3;HMM2&#x200b;). When BPNN confidence is high (&#x03c3;NN2&#x2192;0&#x03c3;NN2&#x200b;&#x2192;0), &#x03b1;(x) &#x2192; 1; when HMM is more reliable, &#x03b1;(x) &#x2192; 0. This formulation replaces the previous heuristic constant &#x03b1; with a theoretically justified confidence-weighting function.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Added Bayesian derivation of Equation (17) in Section 4.4</p>
                        </list-item>
                        <list-item>
                            <p>Reformulated &#x03b1;(x) as dynamic function based on error variances</p>
                        </list-item>
                        <list-item>
                            <p>Connected to Bayesian model averaging framework</p>
                        </list-item>
                        <list-item>
                            <p>Provided interpretation of extreme cases (high confidence &#x2192; &#x03b1;=1, low confidence &#x2192; &#x03b1;=0)</p>
                        </list-item>
                    </list> </p>
                <p> Comment 9</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;The conclusion part of the article is written in a very general way. It is appropriate to separately note the most important numerical indicators achieved in the conclusion, the role of the APSL algorithm in saving memory, and also add thoughts regarding real-time requirements.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;We have completely rewritten Section 8 (Conclusion) to be specific and quantitative. The revised conclusion now includes: (1) key performance metrics (96% accuracy, 95.7% recall, 92.95% precision, 94.53% F1); (2) memory efficiency gains (32% reduction from 24 GB to 15.12 GB) with explanation that adaptive sharing acts as implicit regularizer; (3) robustness across 7 accent groups and 5 noise conditions with WER &lt;10%; (4) real-time limitations and optimization recommendations (RTF improves to 0.85 with beam search pruning and GPU acceleration); (5) specific future directions including multi-lingual extension, federated learning integration, and Transformer-based confidence estimation.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Completely rewrote Section 8 (Conclusion)</p>
                        </list-item>
                        <list-item>
                            <p>Added specific numerical indicators (accuracy, recall, precision, F1)</p>
                        </list-item>
                        <list-item>
                            <p>Highlighted memory reduction (24 GB &#x2192; 15.12 GB, 32%)</p>
                        </list-item>
                        <list-item>
                            <p>Added real-time optimization recommendations and RTF calculation</p>
                        </list-item>
                        <list-item>
                            <p>Listed concrete future work directions</p>
                        </list-item>
                    </list>
                </p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report464331">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.195635.r464331</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Kheddar</surname>
                        <given-names>Hamza</given-names>
                    </name>
                    <xref ref-type="aff" rid="r464331a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-9532-2453</uri>
                </contrib>
                <aff id="r464331a1">
                    <label>1</label>University of Medea,, Medea, Algeria</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>13</day>
                <month>3</month>
                <year>2026</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2026 Kheddar H</copyright-statement>
                <copyright-year>2026</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport464331" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.177414.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>reject</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The topic is interesting; however, the paper needs significant improvement:</p>
            <p> </p>
            <p> </p>
            <p> - The proposed APSL-BPNN-HMM architecture relies on classical models (BPNN and HMM) and does not sufficiently justify its advantages compared to modern deep learning ASR frameworks such as Transformer-based or end-to-end models (e.g., CTC, RNN-T). This limits the perceived novelty and relevance of the work in the current ASR research landscape.</p>
            <p> </p>
            <p> read and compare with the following for example:</p>
            <p> </p>
            <p> Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization</p>
            <p> </p>
            <p> Machine learning approaches for automated detection and classification of dysarthria severity</p>
            <p> </p>
            <p> Noise-robust speech recognition: A comparative analysis of LSTM and CNN approaches</p>
            <p> </p>
            <p> A robust framework for noisy speech recognition using Frequency-Guided-Swin Transformer</p>
            <p> </p>
            <p> - The description of the Adaptive Phoneme State Learning (APSL) algorithm lacks rigorous mathematical formalization and theoretical justification. Several steps appear heuristic, and the derivation of the adaptive transition probabilities is not clearly justified or compared with existing probabilistic adaptation methods.</p>
            <p> </p>
            <p> - The experimental evaluation compares the proposed method mainly against a traditional HMM baseline and human transcription. However, the study does not include comparisons with contemporary ASR systems (e.g., deep neural acoustic models or end-to-end models), making it difficult to assess the true competitiveness of the proposed approach.</p>
            <p> </p>
            <p> - Although multiple speech corpora are used, the experimental protocol and data splitting strategy are not described in sufficient detail. The use of augmented data and partially overlapping corpora raises concerns about possible bias or insufficient independence between training and testing sets.</p>
            <p> </p>
            <p> - The paper claims scalability and real-time applicability, yet the architecture involves multiple processing stages (MFCC extraction, APSL segmentation, BPNN classification, and HMM decoding). The computational cost and latency are only briefly discussed (3&#x2013;10 seconds for recognition), which may limit real-time deployment.</p>
            <p> </p>
            <p> - The authors acknowledge that pronunciation variations, dialect differences, and background noise can degrade performance, leading to phoneme misidentification in words such as &#x201c;geographical&#x201d; or &#x201c;transmission.&#x201d; This suggests the model may struggle with complex linguistic variability and real-world acoustic conditions.</p>
            <p> </p>
            <p> - Most all references are old-dated</p>
            <p>Is the work clearly and accurately presented and does it cite the current literature?</p>
            <p>Yes</p>
            <p>If applicable, is the statistical analysis and its interpretation appropriate?</p>
            <p>Partly</p>
            <p>Are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>No</p>
            <p>Is the study design appropriate and is the work technically sound?</p>
            <p>Partly</p>
            <p>Are the conclusions drawn adequately supported by the results?</p>
            <p>Yes</p>
            <p>Are sufficient details of methods and analysis provided to allow replication by others?</p>
            <p>Partly</p>
            <p>Reviewer Expertise:</p>
            <p>The methodological contribution appears incremental rather than fundamentally novel</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.</p>
        </body>
        <sub-article article-type="response" id="comment16250-464331">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Siddalingappa</surname>
                            <given-names>Rashmi</given-names>
                        </name>
                        <aff>Computer and Data Science, York St John University, York, London, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>NA</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>20</day>
                    <month>5</month>
                    <year>2026</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Comment 1</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;The proposed APSL-BPNN-HMM architecture relies on classical models (BPNN and HMM) and does not sufficiently justify its advantages compared to modern deep learning ASR frameworks such as Transformer-based or end-to-end models (e.g., CTC, RNN-T). This limits the perceived novelty and relevance of the work in the current ASR research landscape. The reviewer suggested reading and comparing with: Deep Transfer Learning for Automatic Speech Recognition, Machine learning approaches for automated detection and classification of dysarthria severity, Noise-robust speech recognition: A comparative analysis of LSTM and CNN approaches, and A robust framework for noisy speech recognition using Frequency-Guided-Swin Transformer.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;We thank the reviewer for this constructive criticism. We acknowledge that the original manuscript did not adequately position our hybrid approach within the context of modern end-to-end ASR architectures. To address this, we have restructured Section 2 (Research Background) into four subsections: Historical Foundations, Traditional HMM-Based Systems, Motivation for Hybrid Approaches, and Research Gap. In Section 2.3, we now explicitly discuss why hybrid HMM-neural models remain relevant for low-resource languages, medical/forensic applications requiring phoneme-level explainability, and edge deployment with memory constraints. We have also added a new subsection in Section 6 titled "Comparison with Modern End-to-End Architectures" including Table 5, which compares our APSL-BPNN-HMM with Whisper-tiny. Furthermore, we provide explicit justification for choosing HMM-BPNN over end-to-end models based on three criteria: controlled comparison with baseline, interpretability for clinical applications, and reproducibility on commodity hardware.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Restructured Section 2 with four subsections (2.1, 2.2, 2.3, 2.4)</p>
                        </list-item>
                        <list-item>
                            <p>Added Section 2.3 explaining continued relevance of hybrid models</p>
                        </list-item>
                        <list-item>
                            <p>Added new Table 5 comparing APSL-BPNN-HMM with Whisper-tiny</p>
                        </list-item>
                        <list-item>
                            <p>Added justification paragraph for choosing HMM-BPNN over end-to-end models</p>
                        </list-item>
                        <list-item>
                            <p>Cited relevant contemporary works including the suggested papers where appropriate</p>
                        </list-item>
                    </list> </p>
                <p> Comment 2</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;The description of the Adaptive Phoneme State Learning (APSL) algorithm lacks rigorous mathematical formalization and theoretical justification. Several steps appear heuristic, and the derivation of the adaptive transition probabilities is not clearly justified or compared with existing probabilistic adaptation methods.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;We agree with the reviewer that the original APSL description was insufficiently formalized. To address this, we have substantially revised Section 4.4. We now provide a formal Bayesian derivation of the confidence-weighted probability fusion, showing that the convex combination&#x00a0;Phybrid(wi&#x2223;qi)=&#x03b1;(x)&#x22c5;PNN(pj&#x2223;x)+(1&#x2212;&#x03b1;(x))&#x22c5;PHMM(wi&#x2223;qi)Phybrid&#x200b;(wi&#x200b;&#x2223;qi&#x200b;)=&#x03b1;(x)&#x22c5;PNN&#x200b;(pj&#x200b;&#x2223;x)+(1&#x2212;&#x03b1;(x))&#x22c5;PHMM&#x200b;(wi&#x200b;&#x2223;qi&#x200b;)&#x00a0;is equivalent to Bayesian model averaging under the assumption of Gaussian-distributed estimation errors, where&#x00a0;&#x03b1;(x)=&#x03c3;HMM2/(&#x03c3;NN2+&#x03c3;HMM2)&#x03b1;(x)=&#x03c3;HMM2&#x200b;/(&#x03c3;NN2&#x200b;+&#x03c3;HMM2&#x200b;). This replaces the previous heuristic description with a theoretically grounded formulation. Additionally, we have added empirical justification for the 5 ms dynamic window adjustment via grid search on a 500-utterance validation set, evaluating candidate &#x0394;t values from 2 ms to 15 ms, with results showing that &#x0394;t=5 ms provides optimal trade-off (3.2% WER reduction with 8% latency increase).</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Added Bayesian derivation of Equation (17) in Section 4.4</p>
                        </list-item>
                        <list-item>
                            <p>Reformulated &#x03b1;(x) as a dynamic confidence-weighting function based on error variance</p>
                        </list-item>
                        <list-item>
                            <p>Added grid search validation for &#x0394;t=5 ms parameter</p>
                        </list-item>
                        <list-item>
                            <p>Included table of grid search results for candidate &#x0394;t values</p>
                        </list-item>
                    </list> </p>
                <p> Comment 3</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;The experimental evaluation compares the proposed method mainly against a traditional HMM baseline and human transcription. However, the study does not include comparisons with contemporary ASR systems (e.g., deep neural acoustic models or end-to-end models), making it difficult to assess the true competitiveness of the proposed approach.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;The reviewer raises a valid point. While our primary contribution is improvement over the traditional HMM baseline, we recognize the need to contextualize our results relative to modern ASR systems. We have therefore added a new comparative analysis in Section 6 (Results) as Table 5, comparing our APSL-BPNN-HMM with Whisper-tiny (OpenAI, 2022) across metrics including WER, accuracy, memory consumption, and inference time. We explicitly acknowledge that Whisper achieves lower WER on standard benchmarks but argue that our approach offers complementary strengths in phoneme-level interpretability and training on commodity hardware without GPU requirements. We have also added a justification paragraph explaining the design trade-offs.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Added Table 5 comparing APSL-BPNN-HMM with Whisper-tiny</p>
                        </list-item>
                        <list-item>
                            <p>Added justification paragraph explaining choice of HMM-BPNN over end-to-end models</p>
                        </list-item>
                        <list-item>
                            <p>Acknowledged limitations of direct comparison due to different evaluation corpora</p>
                        </list-item>
                        <list-item>
                            <p>Positioned our work as complementary to, rather than competitive with, SOTA E2E models</p>
                        </list-item>
                    </list> </p>
                <p> Comment 4</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;Although multiple speech corpora are used, the experimental protocol and data splitting strategy are not described in sufficient detail. The use of augmented data and partially overlapping corpora raises concerns about possible bias or insufficient independence between training and testing sets.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;We apologize for the lack of clarity regarding our experimental protocol. To address this, we have added a new subsection "4.1.1 Data Partitioning and Leakage Prevention" that provides detailed information on the 70-15-15 split for each corpus, including exact sentence counts. Critically, we now specify that speaker-level partitioning was enforced across all corpora. For the Buckeye corpus (40 speakers), we allocated 32 speakers for training, 4 for validation, and 4 for testing, ensuring no individual speaker's voice appears in both training and testing sets. We also clarify that all augmented data (pitch shifting, time-stretching, noise addition) were generated only after partitioning to prevent artificial inflation of performance metrics.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Added subsection 4.1.1 "Data Partitioning and Leakage Prevention"</p>
                        </list-item>
                        <list-item>
                            <p>Specified exact split percentages and sentence counts for each corpus</p>
                        </list-item>
                        <list-item>
                            <p>Described speaker-level partitioning for Buckeye corpus (32/4/4 speakers)</p>
                        </list-item>
                        <list-item>
                            <p>Clarified that augmentation was performed after partitioning</p>
                        </list-item>
                        <list-item>
                            <p>Added cross-validation notes for hyperparameter tuning</p>
                        </list-item>
                    </list> </p>
                <p> Comment 5</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;The paper claims scalability and real-time applicability, yet the architecture involves multiple processing stages (MFCC extraction, APSL segmentation, BPNN classification, and HMM decoding). The computational cost and latency are only briefly discussed (3&#x2013;10 seconds for recognition), which may limit real-time deployment.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;We thank the reviewer for this important practical consideration. We have substantially expanded the real-time performance analysis in Section 7 (Discussion). We now provide a detailed breakdown of processing time across three input scenarios: clean short utterances (3-5 seconds, RTF 1.0-1.67), noisy long utterances (8-10 seconds, RTF 0.27-0.33), and streaming chunks (0.4-0.6 seconds per 250 ms chunk, RTF 1.6-2.4). We identify Viterbi decoding with complexity O(N &#x00d7; S&#x00b2;) as the primary bottleneck. We then provide three concrete optimization recommendations for deployment: reducing HMM states from 5 to 3, implementing beam search pruning with beam width 5, and GPU acceleration for BPNN inference (observed 3.2&#x00d7; speedup). Under these optimizations, we demonstrate that RTF improves to approximately 0.85 for typical utterances, making the system suitable for real-time deployment.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Expanded real-time performance analysis in Section 7</p>
                        </list-item>
                        <list-item>
                            <p>Added processing time and RTF for three input scenarios</p>
                        </list-item>
                        <list-item>
                            <p>Identified Viterbi decoding as primary bottleneck with complexity analysis</p>
                        </list-item>
                        <list-item>
                            <p>Provided three optimization recommendations for deployment</p>
                        </list-item>
                        <list-item>
                            <p>Calculated improved RTF (0.85) under recommended optimizations</p>
                        </list-item>
                    </list> </p>
                <p> Comment 6</p>
                <p> 
                    <bold>Reviewer Comment:</bold>&#x00a0;The authors acknowledge that pronunciation variations, dialect differences, and background noise can degrade performance, leading to phoneme misidentification in words such as "geographical" or "transmission." This suggests the model may struggle with complex linguistic variability and real-world acoustic conditions.</p>
                <p> 
                    <bold>Author Response:</bold>&#x00a0;We acknowledge this as an honest limitation of the current work. However, we note that the APSL framework's adaptive window adjustment directly addresses such variability by extending analysis windows for low-confidence phonemes. For words like "geographical" where /d&#x0292;/ and /g/ confusion occurs, APSL's confidence threshold (&#x03b8;=0.75) triggers extended windowing, which improved recognition by 18% in our ablation studies. We have added this clarification to Section 7. We also note that future work will incorporate grapheme-to-phoneme (G2P) models for out-of-vocabulary words and test on additional dialectal variants.</p>
                <p> 
                    <bold>Author Actions:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Added clarification in Section 7 that APSL's adaptive windowing addresses pronunciation variability</p>
                        </list-item>
                        <list-item>
                            <p>Quantified improvement (18% recognition gain) for low-confidence phonemes</p>
                        </list-item>
                        <list-item>
                            <p>Acknowledged limitation while demonstrating how APSL mitigates it</p>
                        </list-item>
                        <list-item>
                            <p>Added future work direction for G2P integration</p>
                        </list-item>
                    </list>
                </p>
            </body>
        </sub-article>
    </sub-article>
</article>
