<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.178913.1</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Research Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Towards Reliable Prosody Detection in Teaching Practical Phonetics: A Sustainable Digital Approach</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 1; peer review: 2 approved with reservations]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Tolstykh</surname>
                        <given-names>Olesya</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Resources</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-7444-8500</uri>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Svatov</surname>
                        <given-names>Alexey</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Oshchepkova</surname>
                        <given-names>Tamara</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Resources</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-0309-6442</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>The Department of Modern Languages and Communication, National University of Science and Technology MISIS, Moscow, Russian Federation</aff>
                <aff id="a2">
                    <label>2</label>Liberal Arts Department, American University of the Middle East, Egaila, Kuwait, 54200, Kuwait</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:tamara.oshchepkova@aum.edu.kw">tamara.oshchepkova@aum.edu.kw</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>30</day>
                <month>4</month>
                <year>2026</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2026</year>
            </pub-date>
            <volume>15</volume>
            <elocation-id>648</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>8</day>
                    <month>4</month>
                    <year>2026</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2026 Tolstykh O et al.</copyright-statement>
                <copyright-year>2026</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/15-648/pdf"/>
            <abstract>
                <p>This research focuses on providing quality education to specialists in the field of linguistic studies, phonology, theoretical and practical phonetics by creating an innovative automated service which can ease the linguists endeavour to transcribe and examine suprasegmental features of human speech prosody. The authors attempted to design a docker-based architecture for prosody labelling. The pre-processing stage compiled 47 manually marked recordings. The data-processing stage produced a 1.9&#x00a0;GB Parquet corpus of 882 denoised, labelled clips via 
                    <italic toggle="yes">librosa</italic>, 
                    <italic toggle="yes">noisereduce</italic>, and 
                    <italic toggle="yes">Label Studio.</italic> The model selection stage compared 4909 scikit-learn models and 120 CNNs. The strongest classical approach (Random Forest) reached 0.373 accuracy, whereas the best CNN scored 0.455. Having recognised the prototype limitations, the researchers have scheduled a roadmap for improving the tool. The suggested model applies the principles of machine learning to automatically generate prosodic analysis that extends beyond individual sound segments and reflects suprasegmental aspects of prosody including rhythm, intonation, and phrasal stress.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>automated prosody detection system</kwd>
                <kwd>machine learning in phonetics</kwd>
                <kwd>suprasegmental analysis</kwd>
                <kwd>Teaching English as a Foreign Language (TEFL)</kwd>
                <kwd>convolutional neural networks (CNN)</kwd>
                <kwd>Label Studio annotation</kwd>
            </kwd-group>
            <funding-group>
                <funding-statement>The author(s) declared that no grants were involved in supporting this work.</funding-statement>
            </funding-group>
        </article-meta>
    </front>
    <body>
        <def-list>
            <title>Abbreviations The following abbreviations are used in this manuscript</title>
            <def-item>
                <term id="G1">AI</term>
                <def>
                    <p>Artificial Intelligence</p>
                </def>
            </def-item>
            <def-item>
                <term id="G2">AMQP</term>
                <def>
                    <p>Advanced Message Queuing Protocol</p>
                </def>
            </def-item>
            <def-item>
                <term id="G3">CEFR</term>
                <def>
                    <p>Common European Framework of Reference</p>
                </def>
            </def-item>
            <def-item>
                <term id="G4">CNN</term>
                <def>
                    <p>Convolutional Neural Networks</p>
                </def>
            </def-item>
            <def-item>
                <term id="G5">IPA</term>
                <def>
                    <p>International Phonetic Alphabet</p>
                </def>
            </def-item>
            <def-item>
                <term id="G6">L1</term>
                <def>
                    <p>First Language, Mother Tongue</p>
                </def>
            </def-item>
            <def-item>
                <term id="G7">L2</term>
                <def>
                    <p>Second Language</p>
                </def>
            </def-item>
            <def-item>
                <term id="G8">LMS</term>
                <def>
                    <p>Learning Management System</p>
                </def>
            </def-item>
            <def-item>
                <term id="G9">ML</term>
                <def>
                    <p>Machine Learning</p>
                </def>
            </def-item>
            <def-item>
                <term id="G10">RP</term>
                <def>
                    <p>Received Pronunciation</p>
                </def>
            </def-item>
            <def-item>
                <term id="G11">TEFL</term>
                <def>
                    <p>Teaching English as a Foreign Language</p>
                </def>
            </def-item>
        </def-list>
        <sec id="sec1" sec-type="intro">
            <title>Introduction</title>
            <p>Foreign language teaching despite being an integrative process is often subdivided into teaching language aspects such as teaching vocabulary and grammar, developing productive and receptive skills. Teaching phonetics is often treated as a minor subordinate strand, the main purpose of which is to uphold the development of foreign language skills. The rationale behind such a position might be partially explained by the fact that many educators misinterpret communicative approach and place excessive emphasis on teaching the ways to deliver the message by sacrificing accuracy of the utterances (
                <xref ref-type="bibr" rid="ref25">Levis, 2022</xref>; 
                <xref ref-type="bibr" rid="ref27">Low, 2021</xref>). As a result, teaching pronunciation has become a neglected and increasingly marginalized area. The core philosophy of the communicative approach, however, does not imply refusal from teaching pronunciation because the lack of attention to mastering pronunciation may result in less effective communication due to misinterpretations of spoken messages and poorer listening comprehension skills.</p>
            <p>By integrating pronunciation instruction into language education, teachers promote more inclusive and equitable learning opportunities, ensuring that learners develop the communicative competence needed to participate fully in academic, professional, and social contexts. Clear pronunciation enhances confidence and employability, thereby supporting lifelong learning and social mobility, while also fostering cross-cultural understanding. In this way, teaching pronunciation contributes not only to individual empowerment but also to the broader sustainability agenda of reducing barriers to participation in an interconnected world.</p>
            <p>Given the variety of existing accents, there is no agreement among educators about the model of intelligible articulation that should be selected for teaching. In the updated edition of Common European Framework of Reference (CEFR) the requirement of pronunciation of a native speaker as a standard has been replaced by the intelligibility criterion (
                <xref ref-type="bibr" rid="ref9">Council of Europe, 2020</xref>). This implies the ability to articulate phonemes and convey prosody, including intonation, rhythm and stress. Foreign language learners are expected to develop natural pronunciation and intonation that is comprehensible to interlocutors and recognise regional and sociolinguistic varieties of pronunciation. In other words, the mastery of pronunciation should be sufficient to achieve the needs for English as a lingua franca, which undermines the notion of native speakerism and promotes the concept of intelligibility (
                <xref ref-type="bibr" rid="ref1">Almusharraf, 2024</xref>; 
                <xref ref-type="bibr" rid="ref22">Jeong, &amp; Lindemann, 2025</xref>). This approach remains a subject of debate because it is difficult to discern and measure what exactly intelligible accent constitutes (
                <xref ref-type="bibr" rid="ref13">Dillon, &amp; Wells, 2023</xref>).</p>
            <p>Although there has been a shift toward prioritizing intelligibility over standard pronunciation, a preference for a recognised language norm known as Received Pronunciation (RP) persists among educators and students (
                <xref ref-type="bibr" rid="ref35">Pitychoutis, 2024</xref>). Within the context of the current research, authors have also selected RP, the reasons for which are explained in the Methodology section.</p>
            <p>Despite the fact that pronunciation is a significant part of language instruction, some teachers exclude it from their lessons (
                <xref ref-type="bibr" rid="ref10">Couper, 2021</xref>). In classes that address pronunciation, explicit instruction is relatively uncommon and is typically included as part of remedial work. Taking this a step further, teaching pronunciation as an autonomous subject or skill is even less frequent (
                <xref ref-type="bibr" rid="ref34">Pennington, 2021</xref>). Regarding the content of pronunciation-related classroom activities, they often emphasize the importance of correctness of segmentals like phonemes or word stress while many seem to be less concerned about suprasegmentals like word junctions or intonation, meaningful use of chunking and phrasing (
                <xref ref-type="bibr" rid="ref6">B&#x00f8;hn, &amp; Hansen, 2017</xref>; 
                <xref ref-type="bibr" rid="ref10">Couper, 2021</xref>; 
                <xref ref-type="bibr" rid="ref17">Foote, et al., 2016</xref>). Segmental features are believed to be &#x201c;more teachable&#x201d; (
                <xref ref-type="bibr" rid="ref15">Elnagar, 2020</xref>) while suprasegmentals are difficult to describe without reference to specialised terminology. Segmental errors are perceived as more salient and easier to correct. However, rhythm and intonation have special significance in accomplishing reasonable pronunciation with sentence stress accounting for almost 36% of the speaking variance (
                <xref ref-type="bibr" rid="ref28">Ma, et al., 2018</xref>). The context of this research requires explicit instruction with particular emphasis on suprasegmental language elements because mastering them is a requirement for the specialists in the field of linguistic studies, phonology, theoretical and practical phonetics (
                <xref ref-type="bibr" rid="ref19">Gordon, 2023</xref>; 
                <xref ref-type="bibr" rid="ref39">Stratton, 2023</xref>).</p>
            <p>Most pronunciation teaching methods are rooted in the audio-lingual approach and are aimed at developing accuracy via repetition and imitation. The most frequently applied classroom techniques are listening to target sounds, controlled mechanical imitation, comparison of pronunciation models, choral repetition, explanation of the sound&#x2013;spelling correspondence, and corrective feedback.</p>
            <p>
Modern technologies are widely used in foreign language education, with AI-powered tools gaining popularity and proving to be effective for second language (L2) pronunciation teaching. Almost 70% of the participants in the study conducted by 
                <xref ref-type="bibr" rid="ref1">Almusharraf (2024)</xref> reported using technology for teaching pronunciation. Namely, teachers are using ChatGPT for practicing, obtaining explanations and examples of L2 phonetic features (
                <xref ref-type="bibr" rid="ref30">Mompean, 2024</xref>). Among potent AI-powered applications for improving English pronunciation and speaking abilities experts name a few prominent speech recognition technologies including Speechling, Duolingo, and Google Assistant (
                <xref ref-type="bibr" rid="ref12">Dennis, 2024</xref>). AI-powered chatbots leverage voice recognition technology to simulate structured conversations with users and offer immediate feedback on their spoken input (
                <xref ref-type="bibr" rid="ref40">Sonsaat-Hegelheimer, &amp; Kurt, 2024</xref>). These developments are supported by machine learning, which has enabled automated scoring of learners&#x2019; performance, including pronunciation accuracy. For instance, 
                <xref ref-type="bibr" rid="ref33">Pearson Education Inc. (2022)</xref> claims that their scoring system can measure &#x201c;the position and length of pauses, the stress and segmental forms of the words, and the pronunciation of the segments in the words within their lexical and phrasal context&#x201d; (p. 19). The observations of some researchers (
                <xref ref-type="bibr" rid="ref34">Pennington, 2021</xref>), however, point to the limited value of the available technology for teaching and learning pronunciation, especially in the areas of suprasegmental features such as intonations, rhythm, and pitch. Besides, educators report multiple fallacies in error detection and feedback provided by automated speech recognition systems as well as issues of reliability and validity with automated scoring generated by AI (
                <xref ref-type="bibr" rid="ref36">Rogerson-Revell, 2021</xref>).</p>
            <p>One of the applications that has been used by phoneticians for over two decades is Praat authored by 
                <xref ref-type="bibr" rid="ref4">Paul Boersma and David Weenink (2025)</xref> from the University of Amsterdam. It is a free software that allows users to automate routine activities of analysing acoustic parameters of speech (
                <xref ref-type="bibr" rid="ref21">Jadoul, et al., 2024</xref>). The core functionalities of Praat include detailed examination of pitch, formants, intensity, voice quality, and the production of spectrograms and cochleagrams. The programme also supports acoustic- and articulatory-based speech synthesis, signal manipulation (e.g., controlled changes in pitch, intensity, and duration), annotation based on International Phonetic Alphabet (IPA), and segmentation into words or phonemes. Advanced modules address Optimality Theory, neural-network modelling, and multivariate statistics (
                <xref ref-type="bibr" rid="ref5">Boersma, et al., 2020</xref>).</p>
            <p>Interestingly, Praat has been extensively used in medical research as a tool helping patients with speech disorders (
                <xref ref-type="bibr" rid="ref18">G&#x00f6;la&#x00e7;, et al., 2025</xref>; 
                <xref ref-type="bibr" rid="ref23">Karia, 2023</xref>; 
                <xref ref-type="bibr" rid="ref37">Sonkaya, et al. 2024</xref>). The number of empirical research papers that discuss the use of Praat for educational purposes is considerably lower. Still, those scientists who have analysed the impact of Praat-based language learning activities shared their observations about benefits and limitations of applying Pratt in an English-learning classroom. Thus, several researchers and practitioners have reported Praat as an effective tool for teaching various aspects of pronunciation. In particular, educators found useful the visual feedback provided by Praat. Automatically generated sound wave spectrogram and the pitch tone graph, like the ones presented in 
                <xref ref-type="fig" rid="f1">
Figure 1</xref>, allow learners to compare their pitch contours to the provided samples, identifying discrepancies in intonation (
                <xref ref-type="bibr" rid="ref20">Guo, 2025</xref>; 
                <xref ref-type="bibr" rid="ref24">Larassati et al., 2022</xref>; 
                <xref ref-type="bibr" rid="ref44">Wang, 2021</xref>; 
                <xref ref-type="bibr" rid="ref46">Zeng, &amp; Huang, 2025</xref>). This feature helps in overcoming the challenges that the abstract nature of pronunciation poses for language learners (
                <xref ref-type="bibr" rid="ref41">Topal, 2024</xref>).</p>
            <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                <label>
Figure 1. </label>
                <caption>
                    <title>A sample of a spectrogram created by Praat.</title>
                </caption>
                <graphic id="gr1" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/197356/bbe22f99-3ba4-42d7-88c7-3ae4394b5933_figure1.gif"/>
            </fig>
            <p>Another benefit created by Praat is the opportunity for autonomous practice since students are empowered to self-monitor their pronunciation outside classrooms by recording and analyzing their speech (
                <xref ref-type="bibr" rid="ref32">Osatananda, &amp; Thinchan, 2021</xref>). Therefore, empirical studies confirmed that students who were introduced to Praat as a tool for enhancing their pronunciation demonstrated significantly greater progress than the control group (
                <xref ref-type="bibr" rid="ref8">Chen, 2022</xref>; 
                <xref ref-type="bibr" rid="ref14">El-Garawany, 2021</xref>). Researchers believe that this software also helps teachers to analyse pronunciation difficulties that learners face and plan remedial work accordingly (
                <xref ref-type="bibr" rid="ref43">Wang, 2024</xref>).</p>
            <p>Despite the stated aids provided by Praat, there are distinct limitations that restrict its wide use in the classroom. First of all, although the software is simple to operate and has intuitive graphics, understanding the generated spectograms requires some knowledge of acoustics and phonetics (
                <xref ref-type="bibr" rid="ref36">Rogerson-Revell, 2021</xref>; 
                <xref ref-type="bibr" rid="ref45">Yang, &amp; Zhao, 2021</xref>). Besides, teaching students to interpret spectrograms might be time-consuming. If teachers decide to encourage learners to use Praat for self-study, they have to provide preliminary training in reading the visuals provided by the software.</p>
            <p>While its specialised focus constitutes a major strength, Praat&#x2019;s interface and architecture remain relatively archaic. Since Praat was first released in 1991, there have been multiple attempts to improve it not only on behalf of its developers but also users. For instance, 
                <xref ref-type="bibr" rid="ref11">de Jong, et al. (2021)</xref> endeavoured to create Praat script for measuring fluency by detecting filled pauses. Another group of researchers has been working on making Praat functionality available in Python which could provide alternative means of interaction with Praat&#x2019;s algorithms (
                <xref ref-type="bibr" rid="ref21">Jadoul, et al., 2024</xref>). However, none of these attempts meet the needs of educators and linguists who intend to use Praat for deeper understanding of pronunciation phenomena.</p>
            <p>To sum it up, there is clear evidence of extensive research being done in the area of teaching foreign language pronunciation. Scientists emphasize the importance of intelligible articulation for enhancing communication and mutual understanding. Instructions related to suprasegmental elements contribute significantly to producing meaningful discourse. Mastering pronunciation is significantly enhanced and eased if modern computer technologies and AI-powered tools are applied. However, none of the research projects that the authors reviewed analyse the specific settings of teaching pronunciation to the students who specialize in the field of linguistic studies, phonology, theoretical and practical phonetics. The requirement of intelligibility is not sufficient at this level. These learners are required to recognize different varieties of pronunciation, analyse acceptable and unacceptable deviations from the standard, back up their explanations with the theoretical justifications, and perform the role of a model of the pronunciation that is close to the language norm. Therefore, some specialized tools might be required apart from those that are already used for teaching pronunciation. The current research demonstrates an attempt to develop a tool that is suitable for the stated purposes.</p>
        </sec>
        <sec id="sec2">
            <title>Research context and objectives</title>
            <p>The current research is being implemented at the National University of Science and Technology (NUST &#x201c;MISIS&#x201d;) within the context of the course &#x201c;Practical Phonetics of English&#x201d;. The original syllabus for the course was developed by Professor PhD Sukhova N. V., with the academic support of her colleagues Professor PhD Tolstykh O. M. and Professor Gendelev I. D. This course is compulsory for the students who major in linguistics and specialize in translation, Teaching English as a Foreign Language (TEFL), or media and communication.</p>
            <p>RP has been selected as a reference point for teaching. When students aim to build a career in language &#x2013; whether as a linguist, educator, or interpreter &#x2013; their command of precise articulation and accurate perception is fundamental to professional competence. Moreover, in these professions, intonation serves as an indispensable tool that guides meaning, complements words and grammar, and organized spoken discourse. Despite the fact that some linguists question the status of RP as a model in L2 learning and there is a general shift towards acquiescence of other speakers&#x2019; accents of English (
                <xref ref-type="bibr" rid="ref3">Baratta, &amp; Halenko, 2022</xref>), this course developers find it important to focus on RP. Mastering this type of articulation, that is recognised as a language norm, might enhance the learners&#x2019; intelligibility and help them fulfil their professional duties by serving as models of standard pronunciation. As professional linguists they need more than just producing intelligible utterances and recognizing regional accents as required by CEFR. They should be able to recognize, analyse, and explain the causes of such phonological phenomena as sound assimilation, reduction, palatalization, individual pronunciation irregularities, cross-linguistic interference with the L1, etc. Such focus on RP does not mean, however, that course instructors intentionally restrict exposure to other language varieties.</p>
            <p>The course is structured across two semesters. It combines face-to-face sessions with guided independent study, providing students with flexibility while maintaining consistent academic support. In 
                <bold>the first term</bold>, students focus on mastering sound articulation, transcription, and basic intonation models. The instructor-led sessions are complemented by self-paced practice, which includes both independent training and the use of the AI Pronunciation Trainer (
                <xref ref-type="bibr" rid="ref26">Lobato, 2025</xref>). This free service supports learners by offering immediate, phoneme-level feedback and visualizations via IPA transcription, encouraging repeated practice of individual sounds and sentence-level prosody. Although the functionality of the tool is limited, it is still helpful at the early stages, particularly for learners whose native language interferes with accurate sound production. The main drawbacks are the preset nature of the sentence bank and the rigid scoring system that rewards hyper-articulation without recognizing the natural use of reduction and assimilation. Thus, the usage of this AI-driven tool is complemented by corrective guidance from the professors during classes.</p>
            <p>Another supplementary tool used at this stage of the course is 
                <xref ref-type="bibr" rid="ref38">Speechace (n.d.)</xref>, a speech recognition platform designed for evaluating and giving feedback on pronunciation and fluency. Its AI-driven analysis allows learners to independently assess their pronunciation and share their results with instructors for progress tracking. The system provides a grade and a descriptive evaluation of the spoken input; however, it does not offer detailed analysis of suprasegmental features essential to our syllabus, such as rhythm, various types of stress, or intonation contours.</p>
            <p>Speech analyzer 
                <xref ref-type="bibr" rid="ref16">Elsa Speak (2025)</xref> allows to evaluate both segmental and suprasegmental features of real-time user recordings. It provides scoring for vowel and consonant sounds and marks mispronounced phonemes in a color-coded transcript, which helps students visually track deviations from RP. The service also evaluates fluency through measurable metrics such as speaking pace, pause duration, and overall flow, and provides a graphical pitch line that reflects intonation control. Intonation feedback is expressed through both numeric ratings and visual indicators, supporting the development of pitch range awareness and rhythmical delivery. This detailed and accessible feedback makes this tool particularly useful for pronunciation assessment.</p>
            <p>During 
                <bold>the second term</bold>, instruction shifts toward more complex aspects of connected speech, including style-specific intonation and pitch variation. Students work extensively with authentic texts, which they first read aloud, then practise using the shadowing technique to refine pronunciation and prosody. As part of their work, students also transcribe the texts, mark phonetic phenomena such as assimilation, reduction, and intonation contours, and draw tonograms. This intensive work culminates in memorized recitation of the original texts. Through this progression students gradually develop the skills needed to create and perform their own phonetically accurate and communicatively meaningful texts.</p>
            <p>None of the currently available AI-powered pronunciation assessment tools are suitable for the advanced prosodic analysis expected at the specialized university-level phonetics education. The available services focus primarily on segmental features and offer limited feedback on suprasegmental elements like stress placement and pitch movement. The students of the target group are trained to become professional linguists who will be required to produce utterances that reflect the phonetical features of the language norm, perceive and analyze intonation contours and rhythm in natural speech. The spectrograms, pitch traces, and timelines generated by Praat and other available tools enable students to visualize speech melody but do not provide sufficient data for detailed linguistic theoretically informed interpretation and instruction. Therefore, the main research question this project seeks to answer is how to develop an advanced automated prosody detection system based on machine learning.</p>
        </sec>
        <sec id="sec3">
            <title>Materials and methods</title>
            <p>The project implementation was organised in 3 steps.</p>
            <p>The aim of the 
                <bold>preprocessing stage</bold> was to generate a high-quality dataset suitable for training a machine-learning model that can provide automatically generated prosodical analysis. The fact that our learners are Russian native speakers prompted us to refer to extensive experience of Russian linguists who have established a unique scholarly tradition for analysing the prosodic aspects of a language. Intonational analysis in contemporary Russian phonetics studies relies extensively on formally defined symbol sets proposed by phoneticians (
                <xref ref-type="bibr" rid="ref2">Antipova, et al., 1985</xref>; 
                <xref ref-type="bibr" rid="ref29">Mitrofanova, 2012</xref>; 
                <xref ref-type="bibr" rid="ref42">Vereninova, 2011</xref>). Some examples of intonation markers and graphical representations of tonograms used for teaching pronunciation are illustrated in 
                <xref ref-type="fig" rid="f2">
Figure 2</xref>. In this study, the researchers draw upon this scheme with slight modifications that allow us to accommodate both the instructional needs of the above-mentioned practical phonetics course and the algorithmic requirements of the system.</p>
            <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                <label>
Figure 2. </label>
                <caption>
                    <title>Examples of intonation markers and pitch contour diagrams.</title>
                </caption>
                <graphic id="gr2" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/197356/bbe22f99-3ba4-42d7-88c7-3ae4394b5933_figure2.gif"/>
            </fig>
            <p>This set of symbols was used to examine forty-seven audio samples. Each recording was analyzed, fully transcribed, and intonational markers were placed in alignment with the speakers&#x2019; prosodic patterns. 
                <xref ref-type="fig" rid="f3">
Figure 3</xref> illustrates a part of such text with the completed intonational analysis.</p>
            <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                <label>
Figure 3. </label>
                <caption>
                    <title>An example of text with phonetic markers reflecting the speaker&#x2019;s prosody.</title>
                </caption>
                <graphic id="gr3" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/197356/bbe22f99-3ba4-42d7-88c7-3ae4394b5933_figure3.gif"/>
            </fig>
            <p>The 
                <bold>data processing stage</bold> involved applying software libraries suited for analysing the acoustic properties of audio files. Because the audio recordings constitute audio of unstructured modality, the standard data-science stack (
                <italic toggle="yes">pandas, NumPy, Matplotlib</italic>) was augmented with 
                <italic toggle="yes">librosa.</italic> Data exploration and preparation were performed in 
                <italic toggle="yes">JupyterLab 4.3.6</italic> running 
                <italic toggle="yes">Python 3.13.</italic> Several package versions were employed, including 
                <italic toggle="yes">librosa 0.11.0</italic> for handling audio input, time-series, slicing, mel-spectrogram construction, and on-screen display via 
                <italic toggle="yes">librosa.display</italic>; 
                <italic toggle="yes">Matplotlib 3.10.1</italic> for exploratory plots; 
                <italic toggle="yes">NumPy 2.1.3</italic> for numerical arrays; and 
                <italic toggle="yes">pandas 2.2.3</italic> for tables. Background noise was suppressed with 
                <italic toggle="yes">noisereduce 3.0.3.</italic>
            </p>
            <p>The 
                <bold>model selection stage</bold> involved training machine learning models and evaluating their accuracy. For modelling, 
                <italic toggle="yes">scikit-learn 1.6.1</italic> was utilised to train classical algorithms, while 
                <italic toggle="yes">PyTorch 2.3</italic> (CPU-only) was employed for the convolutional network. Each training iteration was tracked with a 
                <italic toggle="yes">tqdm</italic> progress bar. Initially, annotation dashboards were created with 
                <italic toggle="yes">Bokeh</italic>, 
                <italic toggle="yes">HoloViews</italic>, and 
                <italic toggle="yes">Panel.</italic> Eventually, they were replaced with 
                <italic toggle="yes">LabelStudio 1.17.0</italic> running in 
                <italic toggle="yes">Docker</italic> for manual verification. The prototype comprised several containerised services, including 
                <italic toggle="yes">FastAPI</italic>, 
                <italic toggle="yes">RabbitMQ</italic>, 
                <italic toggle="yes">MinIO</italic>, 
                <italic toggle="yes">PostgreSQL</italic>, and a 
                <italic toggle="yes">React SPA</italic>, which communicated through 
                <italic toggle="yes">Docker Compose</italic> and were prepared for deployment on 
                <italic toggle="yes">Kubernetes.</italic> The system architecture was documented in 
                <italic toggle="yes">PlantUML</italic> and visualised using 
                <italic toggle="yes">C4 Model</italic> diagrams.</p>
            <p>The project architecture presented in 
                <xref ref-type="fig" rid="f4">
Figure 4</xref> employs a modular, micro-service-styled design in which data ingestion, storage, task routing, and presentation layers are separated into distinct components, following the principles described by 
                <xref ref-type="bibr" rid="ref31">Newman (2021)</xref>. To document this design clearly, the C4 Model (
                <xref ref-type="bibr" rid="ref7">Brown, 2023</xref>) was used as the guiding framework for architectural diagrams, as it provides hierarchical views (context, containers, components, code) that align with our service-based architecture. The main advantage of C4Model lies in its support for incremental elaboration, as not all layers need to be formalised simultaneously.</p>
            <fig fig-type="figure" id="f4" orientation="portrait" position="float">
                <label>
Figure 4. </label>
                <caption>
                    <title>Container-level view of the software architecture for intonation processing.</title>
                </caption>
                <graphic id="gr4" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/197356/bbe22f99-3ba4-42d7-88c7-3ae4394b5933_figure4.gif"/>
            </fig>
            <p>At the container level, the C4Model depicts a micro-service back-end in which each functional task is isolated in its own Docker container. Inter-service communication is managed by a 
                <italic toggle="yes">RabbitMQ</italic> message queue, with all connecting arrows annotated as AMQP. When a service needs to read or write an audio file or its derived spectrogram, it issues a simple HTTP request to 
                <italic toggle="yes">MinIO</italic>, an S3-compatible object store. HTTP label is deliberately retained for clarity because S3 may refer either to the storage engine or to its protocol.</p>
            <p>Database operations are routed through a specialized repository service that submits SQL queries to a PostgreSQL server. The connector is annotated SQL because the language of the request conveys more meaningful information than the transport protocol, and that mirrors the way of tagging AMQP for message exchange. The remaining containers that represent microservices responsible for the logic of the application are described in 
                <xref ref-type="table" rid="T1">
Table 1</xref>.</p>
            <table-wrap id="T1" orientation="portrait" position="float">
                <label>
Table 1. </label>
                <caption>
                    <title>Functional roles of containers in the back-end architecture.</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top">Container</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Its main function</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="center" colspan="1" rowspan="1" valign="top">Connector</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">Sends audio to SaluteSpeech (speech-to-text) and posts the transcript back to the queue.</td>
                        </tr>
                        <tr>
                            <td align="center" colspan="1" rowspan="1" valign="top">Frequencies</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">Computes and plots the recording&#x2019;s frequency spectrum.</td>
                        </tr>
                        <tr>
                            <td align="center" colspan="1" rowspan="1" valign="top">Markers</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">Applies the ML model to label intonation markers.</td>
                        </tr>
                        <tr>
                            <td align="center" colspan="1" rowspan="1" valign="top">Syntagmas</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">Splits the transcribed text into syntagmatic units.</td>
                        </tr>
                        <tr>
                            <td align="center" colspan="1" rowspan="1" valign="top">Intonation Contour</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">Aligns predicted markers with the transcript, producing a pitch-contour overlay.</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
            <p>The steps below trace the path of an audio recording through the containers shown in 
                <xref ref-type="fig" rid="f4">
Figure 4</xref>.
                <list list-type="order">
                    <list-item>
                        <label>1.</label>
                        <p>The user uploads an audio file through the web interface (
                            <italic toggle="yes">SPA</italic>).</p>
                    </list-item>
                    <list-item>
                        <label>2.</label>
                        <p>The file object reaches the 
                            <italic toggle="yes">API-Gateway
</italic> endpoint.</p>
                    </list-item>
                    <list-item>
                        <label>3.</label>
                        <p>The 
                            <italic toggle="yes">API-Gateway
</italic> stores the file in the object repository (
                            <italic toggle="yes">Storage</italic>).</p>
                    </list-item>
                    <list-item>
                        <label>4.</label>
                        <p>Using the 
                            <italic toggle="yes">Message Bus</italic>, the gateway enqueues three work items, routing them to 
                            <italic toggle="yes">Connector</italic>, 
                            <italic toggle="yes">Frequencies</italic>, and 
                            <italic toggle="yes">Markers</italic> accordingly.</p>
                    </list-item>
                    <list-item>
                        <label>5.</label>
                        <p>

                            <italic toggle="yes">Connector</italic> retrieves the transcription task, downloads the audio file from 
                            <italic toggle="yes">Storage</italic>, and sends it to 
                            <italic toggle="yes">SaluteSpeech.</italic> When the transcript is returned, 
                            <italic toggle="yes">Connector</italic> posts a new task for 
                            <italic toggle="yes">Syntagmas.</italic>
                        </p>
                    </list-item>
                    <list-item>
                        <label>6.</label>
                        <p>

                            <italic toggle="yes">Frequencies</italic> container processes its queue entry, downloads the audio file, generates spectral plots, saves the plots back to 
                            <italic toggle="yes">Storage</italic>, and queues a 
                            <italic toggle="yes">Repository</italic> task that carries the plot IDs.</p>
                    </list-item>
                    <list-item>
                        <label>7.</label>
                        <p>

                            <italic toggle="yes">Markers</italic> container downloads the audio file, runs the ML model to identify intonation markers, and publishes the results to 
                            <italic toggle="yes">Repository</italic> via 
                            <italic toggle="yes">Message Bus.</italic>
                        </p>
                    </list-item>
                    <list-item>
                        <label>8.</label>
                        <p>

                            <italic toggle="yes">Syntagmas</italic> container retrieves the segmentation task and transcript, splits the text into syntagmatic units, and enqueues the segmented output for Intonation Contour.</p>
                    </list-item>
                    <list-item>
                        <label>9.</label>
                        <p>

                            <italic toggle="yes">Intonation Contour</italic> combines the marker data (retrieved via 
                            <italic toggle="yes">Repository</italic>) with the segmented text, creates the final prosodic overlay, and sends the package to 
                            <italic toggle="yes">Repository.</italic>
                        </p>
                    </list-item>
                    <list-item>
                        <label>10.</label>
                        <p>

                            <italic toggle="yes">Repository</italic> executes all read/write requests against 
                            <italic toggle="yes">PostgreSQL.</italic>
                        </p>
                    </list-item>
                    <list-item>
                        <label>11.</label>
                        <p>Once every task for the given recording is complete, 
                            <italic toggle="yes">API-Gateway
</italic> notifies the SPA over 
                            <italic toggle="yes">WebSocket</italic>; the client then requests the results.</p>
                    </list-item>
                </list>
            </p>
        </sec>
        <sec id="sec4" sec-type="results">
            <title>Results</title>
            <sec id="sec5">
                <title>Data-processing stage</title>
                <p>At the preprocessing stage, as was mentioned earlier, forty-seven audio samples were transcribed and annotated to serve as input for machine learning tasks.</p>
                <p>
The initial data-processing stage involved applying software libraries suited to analyse the acoustic properties of audio files. As the recordings constituted unstructured data, the standard data-science stack (
                    <italic toggle="yes">pandas</italic>, 
                    <italic toggle="yes">NumPy</italic>, 
                    <italic toggle="yes">Matplotlib</italic>) was augmented with 
                    <italic toggle="yes">librosa.</italic> Data exploration and preparation were conducted in 
                    <italic toggle="yes">JupyterLab 4.3.6</italic> running 
                    <italic toggle="yes">Python 3.13.</italic> The employed package versions were: 
                    <italic toggle="yes">librosa 0.11.0</italic>; 
                    <italic toggle="yes">Matplotlib 3.10.1</italic>; 
                    <italic toggle="yes">NumPy 2.1.3</italic>; and 
                    <italic toggle="yes">pandas 2.2.3.</italic> With 
                    <italic toggle="yes">librosa.load</italic> function, two essential parameters were extracted from each audio track: an array representing the waveform and the sampling frequency which is set by default to 22.050&#x00a0;Hz. This function accommodates full-length loading as well as partial loading through the optional duration and offset parameters. 
                    <xref ref-type="fig" rid="f5">
Figure 5</xref> illustrates the loading command and the data structures it returned.</p>
                <fig fig-type="figure" id="f5" orientation="portrait" position="float">
                    <label>
Figure 5. </label>
                    <caption>
                        <title>

                            <italic toggle="yes">Librosa</italic> workflow in 
                            <italic toggle="yes">JupyterLab</italic> illustrating the output of the load method.</title>
                    </caption>
                    <graphic id="gr5" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/197356/bbe22f99-3ba4-42d7-88c7-3ae4394b5933_figure5.gif"/>
                </fig>
                <p>Data returned by 
                    <italic toggle="yes">librosa.load</italic> were used to create visualisations that revealed the characteristics of the analysed waveform. Similar to how the 
                    <italic toggle="yes">head()</italic> method in a 
                    <italic toggle="yes">pandasDataFrame</italic> offered a preliminary view of a dataset, 
                    <italic toggle="yes">librosa.display.waveshow</italic> provided a quick visual inspection of the audio signal a sample of which is shown in 
                    <xref ref-type="fig" rid="f6">
Figure 6</xref>.</p>
                <fig fig-type="figure" id="f6" orientation="portrait" position="float">
                    <label>
Figure 6. </label>
                    <caption>
                        <title>Displaying and exporting an audio waveform using visualization tools.</title>
                    </caption>
                    <graphic id="gr6" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/197356/bbe22f99-3ba4-42d7-88c7-3ae4394b5933_figure6.gif"/>
                </fig>
                <p>
For the current investigation, a logarithmically scaled spectrogram, rendered as a heat map, provided the most informative visual support. As can be seen in 
                    <xref ref-type="fig" rid="f7">
Figure 7</xref>, in this representation, the hottest regions &#x2013; corresponding to the highest decibel values &#x2013; form patterns that resemble the tonograms phonetician for intonation annotation.</p>
                <fig fig-type="figure" id="f7" orientation="portrait" position="float">
                    <label>
Figure 7. </label>
                    <caption>
                        <title>Spectrogram of the selected audio excerpt plotted on a logarithmic scale.</title>
                    </caption>
                    <graphic id="gr7" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/197356/bbe22f99-3ba4-42d7-88c7-3ae4394b5933_figure7.gif"/>
                </fig>
                <p>After the trial with separate recordings, a Python script was developed to automatically extract an acoustical dataset from all forty-seven audio files. Each record included the following attributes: timestamp, amplitude, harmonic signal component, percussive signal component, denoised amplitude, sample number, recording ID, aggregate duration of the recording, sampling frequency.</p>
                <p>Special attention was paid to the noise-filtered amplitude and the sample index. Background noise was suppressed with 
                    <italic toggle="yes">noisereduce</italic> 3.0.3 (function 
                    <italic toggle="yes">reduce_noise</italic>). It allowed to process raw-amplitude arrays and the sampling frequency. Then, 
                    <italic toggle="yes">librosa.effects.split</italic> was applied to the denoised signal. This utility segmented a 
                    <italic toggle="yes">NumPy</italic> amplitude vector wherever the sound-pressure level dropped below the silence threshold (60&#x00a0;dB by default). A sample code for denoising an audio signal is presented in 
                    <xref ref-type="fig" rid="f8">
Figure 8</xref>. The combined use of 
                    <italic toggle="yes">noisereduce</italic> and 
                    <italic toggle="yes">librosa.effects.split</italic> produced a noise-free waveform and temporal segments, which were later reused for raw, harmonic, or percussive amplitude streams. Noise removal resulted in silence gaps at previously noisy segments.</p>
                <fig fig-type="figure" id="f8" orientation="portrait" position="float">
                    <label>
Figure 8. </label>
                    <caption>
                        <title>Code snippet demonstrating the use of 
                            <italic toggle="yes">noisereduce</italic> and 
                            <italic toggle="yes">librosa.effects.split (</italic>threshold&#x00a0;=&#x00a0;60&#x00a0;dB).</title>
                    </caption>
                    <graphic id="gr8" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/197356/bbe22f99-3ba4-42d7-88c7-3ae4394b5933_figure8.gif"/>
                </fig>
                <p>As a result, the script generated logarithmically scaled spectrograms for each of the forty-seven recordings and exported them as 300 dpi 
                    <italic toggle="yes">PNG</italic> images that were compressed into a 413&#x00a0;MB ZIP folder. The dataset was stored in 
                    <italic toggle="yes">Parquet</italic> format, resulting in a reduction in size from 6.8&#x00a0;GB (CSV) to 1.9&#x00a0;GB. This format was selected due to its high efficiency and broad compatibility. Specifically, 
                    <italic toggle="yes">Parquet</italic> can be seamlessly integrated with both 
                    <italic toggle="yes">pandas</italic> and 
                    <italic toggle="yes">PySpark</italic>, and it can be efficiently utilized as a data source within the 
                    <italic toggle="yes">ClickHouse</italic> environment. Its column-oriented architecture not only optimizes storage space but also facilitates column-wise data retrieval and predicate push-down filtering during data loading, thereby improving overall performance and analytical efficiency.</p>
                <p>After audio preprocessing, the dataset was segmented into a primary subset containing filenames, sampling frequencies, and file durations, and secondary subsets comprising raw, denoised, percussive, and harmonic amplitude data.</p>
                <p>The subsequent step of data processing involved normalization and intonation labelling. Normalization was performed using the 
                    <italic toggle="yes">StandardScaler.fit_transform</italic> method from 
                    <italic toggle="yes">scikit-learn 1.6.1.</italic> For labelling, a 
                    <italic toggle="yes">CategoricalDtype</italic> was defined in 
                    <italic toggle="yes">pandas</italic>, with categories corresponding to the specified intonation contours (
                    <xref ref-type="bibr" rid="ref2">Antipova et al., 1985</xref>). This categorical type was then assigned to a newly added field, 
                    <italic toggle="yes">mark</italic>, where the default value 
                    <italic toggle="yes">plain</italic> indicated the absence of a distinctive pitch contour, as illustrated in 
                    <xref ref-type="fig" rid="f9">
Figure 9</xref>.</p>
                <fig fig-type="figure" id="f9" orientation="portrait" position="float">
                    <label>
Figure 9. </label>
                    <caption>
                        <title>Code snippet for defining intonation categories.</title>
                    </caption>
                    <graphic id="gr9" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/197356/bbe22f99-3ba4-42d7-88c7-3ae4394b5933_figure9.gif"/>
                </fig>
                <p>After defining the intonation categories and adding the 
                    <italic toggle="yes">mark</italic> field, the labelling process was carried out according to a nine-step pipeline.
                    <list list-type="order">
                        <list-item>
                            <label>1.</label>
                            <p>Retrieving the normalized data corresponding to the selected sample and file from the relevant dataset.</p>
                        </list-item>
                        <list-item>
                            <label>2.</label>
                            <p>Generating a spectrogram with an accompanying playback cursor.</p>
                        </list-item>
                        <list-item>
                            <label>3.</label>
                            <p>Loading the primary dataset into memory.</p>
                        </list-item>
                        <list-item>
                            <label>4.</label>
                            <p>Identifying the segment of interest within the primary dataset.</p>
                        </list-item>
                        <list-item>
                            <label>5.</label>
                            <p>Verifying the audio by cross-checking the spectrogram with the marked text.</p>
                        </list-item>
                        <list-item>
                            <label>6.</label>
                            <p>Determining the temporal boundaries of the intonation marker.</p>
                        </list-item>
                        <list-item>
                            <label>7.</label>
                            <p>Assigning intonation labels to all data points within the defined interval.</p>
                        </list-item>
                        <list-item>
                            <label>8.</label>
                            <p>Reviewing the newly assigned labels to ensure consistency.</p>
                        </list-item>
                        <list-item>
                            <label>9.</label>
                            <p>Saving the updated primary dataset.</p>
                        </list-item>
                    </list>
                </p>
                <p>The implementation of the approach described above required the use of additional libraries, namely 
                    <italic toggle="yes">bokeh 3.7.2; holoviews 1.20.2; panel 1.6.2; scipy 1.15.2.</italic> As a result, labelling one file took 6&#x2013;8&#x00a0;hours on average. The process was not only time-consuming but also revealed a critical memory issue because 
                    <italic toggle="yes">pandas</italic> stored data frames in 
                    <italic toggle="yes">RAM</italic>, and 
                    <italic toggle="yes">Python</italic> did not trigger the 
                    <italic toggle="yes">garbage collector</italic> automatically when a variable was reassigned. In a series of labelling cycles, the system exhausted its available RAM and swap space, which caused the operating system to terminate 
                    <italic toggle="yes">JupyterLab.</italic> Preventive measures required continuous monitoring of memory consumption and either explicit garbage collection calls or regular 
                    <italic toggle="yes">JupyterLab kernel</italic> restarts.</p>
                <p>Due to these multiple factors that slowed the process, it was suggested to replace the toolset with 
                    <italic toggle="yes">Label Studio 1.17.0</italic> to complete the data-processing stage. The service&#x2019;s web interface was modified, including adjustments to waveform height, playback controls, and default zoom. Annotation was conducted in 
                    <italic toggle="yes">Chromium 136.0.7103.113.</italic> All audio files and intonation labels were uploaded in a project in 
                    <italic toggle="yes">Label Studio.</italic> The main features of this annotation process are shown in 
                    <xref ref-type="fig" rid="f10">
Figure 10</xref>. The new environment reproduced the stages formerly executed in 

                    <italic toggle="yes">JupyterLab,
</italic> reducing the average labelling time to 1&#x00a0;hour per file as resource monitoring was not required.</p>
                <fig fig-type="figure" id="f10" orientation="portrait" position="float">
                    <label>
Figure 10. </label>
                    <caption>
                        <title>

                            <italic toggle="yes">Label Studio</italic> annotation process.</title>
                    </caption>
                    <graphic id="gr10" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/197356/bbe22f99-3ba4-42d7-88c7-3ae4394b5933_figure10.gif"/>
                </fig>
            </sec>
            <sec id="sec6">
                <title>Model selection stage</title>
                <p>Having processed the raw audio into an annotated corpus of 882 manually labelled audio clips with durations from 0.5&#x00a0;seconds to 4&#x00a0;seconds, researchers could shift from data procession to model selection. The objective of this stage was to determine which machine learning approach could replicate manual labelling.</p>
                <p>To facilitate comparison, the algorithms were grouped into two categories &#x2013; classical models and deep-learning models. Classical models refer to methods that rely on algorithms other than neural networks, such as 
                    <italic toggle="yes">k-nearest-neigbours</italic>, 
                    <italic toggle="yes">support-vector machines</italic>, and single-layer perceptron variants. Deep-learning models rely on multi-layer neural architectures. Such grouping allowed us to determine whether observed accuracy gains were a consequence of the learning architecture or the set of descriptors supplied to the models.</p>
                <p>Model generation and testing were automated to perform a hyper-parameter grid search and log performance metrics for each configuration. Each run recorded the model ID, the hyper-parameter configuration, and the accuracy score. Experiments were conducted in 
                    <italic toggle="yes">JupyterLab</italic> with a 
                    <italic toggle="yes">tqdm</italic> progress bar on an 
                    <italic toggle="yes">AMD Ryzen 7 5700G</italic> processor with 8 cores, 16 threads, and a 4.6&#x00a0;GHz turbo frequency.</p>
                <p>All classical models were trained in 
                    <italic toggle="yes">scikit-learn 1.6.1</italic> using four low-level acoustic descriptors: mel-spectogram amplitudes, mel-frequency cepstral coefficients (MFCCs), spectral slope, and zero-crossing rate. The 
                    <italic toggle="yes">GaussianNB</italic> was selected as the baseline algorithm because of its minimal hyper-parameter set. It produced the lowest pilot accuracy, reaching 0.13208. In total, 4909 classical configurations were evaluated: 
                    <italic toggle="yes">MLPClassifier</italic> (4332), 
                    <italic toggle="yes">KneighborsClassifier</italic> (380), 
                    <italic toggle="yes">RandomForestClassifier</italic> (150), 
                    <italic toggle="yes">LogisticRegression</italic> (36), Support-Vector Classifier SVC (10), and 
                    <italic toggle="yes">GaussianNB</italic> (1). An overview is provided in 
                    <xref ref-type="table" rid="T2">
Table 2</xref>.</p>
                <table-wrap id="T2" orientation="portrait" position="float">
                    <label>
Table 2. </label>
                    <caption>
                        <title>Hyperparameters and their tested ranges for classical machine learning classifiers in 
                            <italic toggle="yes">scikit-learn.</italic>
</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Model (skirit-learn class)</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Hyper-parameter
</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Values/Range</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">GaussianNB</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">&#x2013;</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">&#x2013;</td>
                            </tr>
                            <tr>
                                <td align="center" colspan="1" rowspan="3" valign="top">KNeighborsClassifier</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <italic toggle="yes">n_neighbors</italic>
</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">5&#x2013;100</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <italic toggle="yes">weights</italic>
</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">uniform, distance</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <italic toggle="yes">p</italic> (Minkowski metric)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1,2</td>
                            </tr>
                            <tr>
                                <td align="center" colspan="1" rowspan="2" valign="top">Support-Vector Classifier (SVC)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <italic toggle="yes">
decision_function_shape
</italic>
</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">ovo, ovr</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <italic toggle="yes">kernel</italic>
</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">linear, poly, rbf, sigmoid, precomputed</td>
                            </tr>
                            <tr>
                                <td align="center" colspan="1" rowspan="3" valign="top">LogisticRegression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">penalty</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">l1, l2, elasticnet
                                    <break/>lbfgs, liblinear, newton-cg, newton-cholesky, sag, saga</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">solver</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">l1, l2, elasticnet
                                    <break/>lbfgs, liblinear, newton-cg, newton-cholesky, sag, saga</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">multi_class</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">multinomial, ovr</td>
                            </tr>
                            <tr>
                                <td align="center" colspan="1" rowspan="3" valign="top">RandomForestClassifier</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">n_estimators</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">100&#x2013;1000 (step 100)</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">max_depth</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">10&#x2013;50 (step 10)</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">criterion</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">gini, entropy, log_loss</td>
                            </tr>
                            <tr>
                                <td align="center" colspan="1" rowspan="4" valign="top">Multilayer Perception (MLPClassifier)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">solver</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">adam, lbfgs, sgd</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">max_iter</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2000</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">alpha</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.00001, 0.0001, 0.001, 0.01</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">hidden_layer_sizes
</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2 layers; 10&#x2013;100 (step 5)</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>

                    <italic toggle="yes">GaussianNB</italic> yielded the lowest performance, with an accuracy of approximately 0.13. 
                    <italic toggle="yes">KNeighborsClassifier</italic> achieved a peak accuracy of 0.31638 across multiple configurations, with the number of 
                    <italic toggle="yes">n_neighbors</italic> being the primary driver of computational cost and performance variation. 
                    <italic toggle="yes">SVC</italic> produced the same accuracy (0.31638) under 
                    <italic toggle="yes">poly</italic> and 
                    <italic toggle="yes">rbf kernels.</italic> 
                    <italic toggle="yes">LogisticRegressio</italic>n achieved an accuracy of 0.32203 when trained with 
                    <italic toggle="yes">saga solver</italic> and 
                    <italic toggle="yes">multinomial</italic> option. The strongest results were obtained through 
                    <italic toggle="yes">RandomForestClassifier.</italic> An accuracy of 0.37288 was achieved with 
                    <italic toggle="yes">n_estimators</italic>&#x00a0;=&#x00a0;600 and also with 
                    <italic toggle="yes">n_estimators</italic>&#x00a0;=&#x00a0;900 (using max_depth&#x00a0;=&#x00a0;10, criterion&#x00a0;=&#x00a0;&#x2018;gini&#x2019;). Since the larger forest incurs higher computational costs without improving performance, the 600-tree configuration is preferred. The 
                    <italic toggle="yes">MLPClassifier</italic> reached an accuracy of 0.36158 with 
                    <italic toggle="yes">solver</italic>&#x00a0;=&#x00a0;
                    <italic toggle="yes">lbfgs</italic>, 
                    <italic toggle="yes">alpha</italic>&#x00a0;=&#x00a0;0.01, and a 2-layer topology of (50, 10) neurons. Experiments at the largest hidden-layer sizes saturated all CPU threads on the hardware which limited stability. In general, no classical model exceeded 0,38 accuracy. Feature-only approaches proved insufficient for accurately capturing prosodic markings. Consequently, the experiment was shifted to convolutional neural networks (CNNs), which can automatically learn relevant features from raw data.</p>
                <p>The CNN developed for this experiment was implemented in PyTorch as a subclass of 
                    <italic toggle="yes">nn. Module</italic> (
                    <xref ref-type="fig" rid="f11">
Figure 11</xref>). CNN is implemented in PyTorch as a subclass of 
                    <italic toggle="yes">nn. Module</italic> (see 
                    <xref ref-type="fig" rid="f11">
Figure 11</xref>). The network comprises up to five identical blocks, each applying a 5x5 filter to the mel-spectrogram, normalizing the output, and then passing the result through a non-linear activation function: either ReLU or LogSigmoid. After the last block, the network compresses the entire time-frequency representation into a vector and feeds this vector into a linear layer that produces 19 output scores - one for each intonation class. In this way, the CNN learns to transform a colored mel-spectrogram into a set of probabilities for the intonation labels.</p>
                <fig fig-type="figure" id="f11" orientation="portrait" position="float">
                    <label>
Figure 11. </label>
                    <caption>
                        <title>Code snippet defining a CNN.</title>
                    </caption>
                    <graphic id="gr11" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/197356/bbe22f99-3ba4-42d7-88c7-3ae4394b5933_figure11.gif"/>
                </fig>
                <p>To evaluate the model&#x2019;s performance under different configurations, six key training parameters were systematically varied, as summarised in 
                    <xref ref-type="table" rid="T3">
Table 3</xref>.</p>
                <table-wrap id="T3" orientation="portrait" position="float">
                    <label>
Table 3. </label>
                    <caption>
                        <title>Hyperparameter grid defining the range of values explored during CNN training.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Hyper-parameter
</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Values tested</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Activation function</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">ReLU, LogSigmoid</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Loss function</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">MSELoss, CrossEntropyLoss</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Optimiser (update rule)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">SGD, Adam</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Learning rate</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.1, 0.01, 0.001</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Number of convolutional blocks</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1&#x2013;5</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Training epochs</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">5&#x2013;100 (in steps of 5)</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>Systematically varying these hyper-parameters produced 120 distinct CNN configurations. Each configuration was trained and evaluated under identical conditions to isolate the factors contributing to accuracy improvements such as network depth, choice of optimiser, and duration of training. As illustrated in 
                    <xref ref-type="table" rid="T4">
Table 4</xref>, the highest accuracy of 0.45455 was achieved by two configurations, both employing the Adam optimiser but differing in loss function (MSELoss versus CrossEntropyLoss), learning rates, and number of convolutional blocks. The accuracy margin between these top-performing models and the next best configurations exceeded three percentage points, indicating that model depth and learning rate exert a decisive influence on performance.</p>
                <table-wrap id="T4" orientation="portrait" position="float">
                    <label>
Table 4. </label>
                    <caption>
                        <title>Hyper-parameters of the two best-scoring CNNs.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Huper-parameter
</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">ReLU/Adam</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">LogSigmoid/Adam</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Learning rate</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.001</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.100</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Number of convolutional blocks</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Epochs</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">25</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">70</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Accuracy</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.45455</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.45455</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>It can be concluded that the CNN applied to the audio data represented a deep-learning approach within the broader domain of natural language processing. Although it outperformed the classical, feature-based classifiers, even the best-performing CNN configurations did not achieve sufficient accuracy for reliable automatic intonation marking.</p>
            </sec>
        </sec>
        <sec id="sec7" sec-type="discussion|conclusions">
            <title>Discussion and conclusions</title>
            <p>Mastering pronunciation remains an important aspect of foreign language learning because it helps speakers convey shades of meanings and serves as a strong non-verbal clue for comprehension. Modern AI-powered tools are often capable of serving as a pronunciation model and providing considerable assistance in analysing prosodical aspects of utterances produced by learners. However, when it comes to a more profound analysis required by professional linguistics or students who are training to be phoneticians, such functionality does not suffice.</p>
            <p>The described project is aimed at creating a platform that could use the principles of machine learning to automatically generate prosodic analysis that extends beyond individual sound segments and reflects suprasegmental aspects of prosody including rhythm, intonation, and phrasal stress. The experiment involved training and evaluating two types of models: classical machine-learning classifiers and a deep convolutional neural network (CNN). All models were trained on 882 annotated audio fragments containing prosodic information. The classical models relied on manually engineered prosodic features, requiring detailed contour statistics derived from the speech signal to represent intonation patterns. In contrast, the CNN operated directly on mel-spectrograms, learning relevant features automatically from the audio data. To improve generalization, the CNN was trained on an augmented dataset that included pitch shifts, noise overlays, and time-frequency masks. Both model types were trained and validated under identical conditions to compare their effectiveness in automatic intonation marking.</p>
            <p>The experimental results showed that the CNN model achieved an accuracy of 0.45455, while the best classical alternative, 
                <italic toggle="yes">RandomForestClassifier</italic>, reached 0.37288. This represents an approximate 22% relative improvement in classification accuracy, confirming the advantage of a deeper architecture. However, despite this gain, the performance remains insufficient for reliable automatic intonation marking.</p>
            <p>Overall, the transition from classical, feature-based classifiers to convolutional neural networks (CNNs) led to a measurable improvement in accuracy and demonstrated the potential of deep learning for prosodic analysis. However, despite outperforming traditional models, the CNN configurations tested in this study did not yet achieve sufficient reliability for practical intonation marking. These results highlight both the promise of deep-learning approaches and the need for further optimisation and data enrichment.</p>
            <p>In light of the identified prototype limitations, a roadmap for further improvement has been developed. The first stage of improvement will involve finalising an authenticated minimum viable product (MVP) capable of ingesting audio, storing it in MinIO, routing processing tasks through RabbitMQ, and displaying basic intonation labels in the single-page application (SPA). The subsequent development stages will introduce stress detection, tone-scale plotting, and script-based annotation, followed by the integration of rhythm analytics to enhance prosodic analysis capabilities. A further step will involve releasing a containerised version of the tool with user control features, PDF export, change-approval tracking, and learning management system (LMS) integration to support deployment and classroom use.</p>
            <p>An accompanying development task is to expand the training corpus. This involves the continuous addition of new audio-text pairs manually labelled with the same 19 intonation markers used in the experiment. For each text already included in the experimental dataset, multiple readings will be collected from new speakers. This procedure will introduce diversity in timbre and speech rate while preserving the prosodic pitch contour. The enlarged corpus is expected to provide the statistical power necessary to improve accuracy beyond the current 0.46 level.</p>
            <p>Despite these planned improvements, several limitations of the current project should be acknowledged. Although the suggested tool incorporates additional functionality, it might be of interest to a limited group of linguists and speech analysts. English language teachers might find it less appealing since tonogram is not widely applied in contemporary pedagogical practice; however, its potential use in linguistic research might lead to valuable observations. Besides, it might find its place in the language teaching classroom designed for philologists, interpreters, and language teachers.</p>
            <p>Moreover, the development of an automatic prosody detection tool aligns with the principles of the Sustainable Development Goals by enhancing access to quality education and promoting innovation and infrastructure. Such technology provides learners with equitable opportunities to receive consistent and personalized feedback on their speech, reducing barriers caused by limited instructional time or resources. By supporting effective language learning and communication skills, automatic prosody detection tools contribute to lifelong learning, employability, and social inclusion, thereby reinforcing broader commitments to sustainable and inclusive development in a digitally driven society.</p>
        </sec>
        <sec id="sec8">
            <title>Ethical statement</title>
            <p>This study was approved by the NUST MISIS Ethics Committee, and written informed consent was obtained from all participants prior to data collection. No reference number was assigned.</p>
        </sec>
        <sec id="sec9">
            <title>Software availability</title>
            <p>Source code available from: - 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/DoctorKutuzov/ProjectLinguist">https://github.com/DoctorKutuzov/ProjectLinguist</ext-link>; 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/DoctorKutuzov/project-linguist-jupyter">https://github.com/DoctorKutuzov/project-linguist-jupyter</ext-link>; 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/DoctorKutuzov/project-linguist-api-gateway">https://github.com/DoctorKutuzov/project-linguist-api-gateway
</ext-link>
            </p>
            <p>
Archived software available from: 
                <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.19186923">https://doi.org/10.5281/zenodo.19186923</ext-link>; 
                <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.19186941">https://doi.org/10.5281/zenodo.19186941</ext-link>; 
                <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.19186929">https://doi.org/10.5281/zenodo.19186929</ext-link>
            </p>
            <p>License: MIT License.</p>
            <p>The GitHub repository associated with this study provides access to the codebase developed to date. However, the software as a complete standalone tool is not yet finalized, since development is still in progress. As noted in the article, the present study describes the current stage of development rather than a fully completed software product.</p>
        </sec>
    </body>
    <back>
        <sec id="sec13" sec-type="data-availability">
            <title>Data availability statement</title>
            <p>No underlying data are associated with this article.</p>
        </sec>
        <ref-list>
            <title>References</title>
            <ref id="ref1">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Almusharraf</surname>
                            <given-names>A</given-names>
                        </name>
</person-group>:
                    <article-title>Pronunciation instruction in the context of world English: Exploring university EFL instructors&#x2019; perceptions and practices.</article-title>
                    <source>

                        <italic toggle="yes">Humanit Soc Sci Commun.</italic>
</source>
                    <year>2024</year>;<volume>11</volume>:<fpage>847</fpage>.
                    <pub-id pub-id-type="doi">10.1057/s41599-024-03365-y</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref2">
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Antipova</surname>
                            <given-names>EY</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kanevskaya</surname>
                            <given-names>SL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Pigulevskaya</surname>
                            <given-names>GA</given-names>
                        </name>
</person-group>:
                    <source>

                        <italic toggle="yes">Posobie po angliiskoy intonatsii (na angliiskom yazyke). English Intonation Manual (in English).</italic>
</source>
                    <publisher-name>Prosveshchenie</publisher-name>;<year>1985</year>.</mixed-citation>
            </ref>
            <ref id="ref3">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Baratta</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Halenko</surname>
                            <given-names>N</given-names>
                        </name>
</person-group>:
                    <article-title>Attitudes toward regional British accents in EFL teaching: Student and teacher perspectives.</article-title>
                    <source>

                        <italic toggle="yes">Linguist. Educ.</italic>
</source>
                    <year>2022</year>;<volume>67</volume>:<fpage>101018</fpage>.
                    <pub-id pub-id-type="doi">10.1016/j.linged.2022.101018</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref4">
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Boersma</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Weenink</surname>
                            <given-names>D</given-names>
                        </name>
</person-group>:
                    <article-title>PRAAT: Doing Phonetics by Computer (Version 6.4.34) [Computer software].</article-title>
                    <year>2025</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://www.fon.hum.uva.nl/praat/">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref5">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Boersma</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Benders</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Seinhorst</surname>
                            <given-names>K</given-names>
                        </name>
</person-group>:
                    <article-title>Neural network models for phonology and phonetics.</article-title>
                    <source>

                        <italic toggle="yes">J Lang Model.</italic>
</source>
                    <year>2020</year>;<volume>8</volume>(<issue>1</issue>):<fpage>103</fpage>&#x2013;<lpage>177</lpage>.
                    <pub-id pub-id-type="doi">10.15398/jlm.v8i1.224</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref6">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>B&#x00f8;hn</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hansen</surname>
                            <given-names>T</given-names>
                        </name>
</person-group>:
                    <article-title>Assessing pronunciation in an EFL context: Teachers&#x2019; orientations towards nativeness and intelligibility.</article-title>
                    <source>

                        <italic toggle="yes">Lang. Assess. Q.</italic>
</source>
                    <year>2017</year>;<volume>14</volume>(<issue>1</issue>):<fpage>54</fpage>&#x2013;<lpage>68</lpage>.
                    <pub-id pub-id-type="doi">10.1080/15434303.2016.1256407</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref7">
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Brown</surname>
                            <given-names>S</given-names>
                        </name>
</person-group>:
                    <source>

                        <italic toggle="yes">The C4 Model for Vyisualising Software Architecture.</italic>
</source>
                    <publisher-name>O'Reilly Media, Inc</publisher-name>;<year>2023</year>.</mixed-citation>
            </ref>
            <ref id="ref8">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>M</given-names>
                        </name>
</person-group>:
                    <article-title>Computer-aided feedback on the pronunciation of Mandarin Chinese tones: Using Praat to promote multimedia foreign language learning.</article-title>
                    <source>

                        <italic toggle="yes">Comput. Assist. Lang. Learn.</italic>
</source>
                    <year>2022</year>;<volume>37</volume>(<issue>3</issue>):<fpage>363</fpage>&#x2013;<lpage>388</lpage>.
                    <pub-id pub-id-type="doi">10.1080/09588221.2022.2037652</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref9">
                <mixed-citation publication-type="book">
                    <collab>Council of Europe</collab>:
                    <source>

                        <italic toggle="yes">Common European Framework of Reference for Languages: Learning, teaching, assessment &#x2013; Companion volume.</italic>
</source>
                    <publisher-name>Council of Europe Publishing</publisher-name>;<year>2020</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://www.coe.int/lang-cefr">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref10">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Couper</surname>
                            <given-names>G</given-names>
                        </name>
</person-group>:
                    <article-title>Pronunciation teaching issues: Answering teachers&#x2019; questions.</article-title>
                    <source>

                        <italic toggle="yes">RELC J.</italic>
</source>
                    <year>2021</year>;<volume>52</volume>(<issue>1</issue>):<fpage>128</fpage>&#x2013;<lpage>143</lpage>.
                    <pub-id pub-id-type="doi">10.1177/0033688220964041</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref11">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Jong</surname>
                            <given-names>NH</given-names>
                            <prefix>de</prefix>
                        </name>

                        <name name-style="western">
                            <surname>Pacilly</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Heeren</surname>
                            <given-names>W</given-names>
                        </name>
</person-group>:
                    <article-title>PRAAT scripts to measure speed fluency and breakdown fluency in speech automatically.</article-title>
                    <source>

                        <italic toggle="yes">Assessment in Education: Principles, Policy &amp; Practice.</italic>
</source>
                    <year>2021</year>;<volume>28</volume>(<issue>4</issue>):<fpage>456</fpage>&#x2013;<lpage>476</lpage>.
                    <pub-id pub-id-type="doi">10.1080/0969594X.2021.1951162</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref12">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Dennis</surname>
                            <given-names>NK</given-names>
                        </name>
</person-group>:
                    <article-title>Using AI-Powered Speech Recognition Technology to Improve English Pronunciation and Speaking Skills.</article-title>
                    <source>

                        <italic toggle="yes">IAFOR J Educ.</italic>
</source>
                    <year>2024</year>;<volume>12</volume>(<issue>2</issue>):<fpage>107</fpage>&#x2013;<lpage>126</lpage>.
                    <pub-id pub-id-type="doi">10.22492/ije.12.2.05</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref13">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Dillon</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wells</surname>
                            <given-names>D</given-names>
                        </name>
</person-group>:
                    <article-title>Effects of pronunciation training using automatic speech recognition on pronunciation accuracy of Korean English language learners.</article-title>
                    <source>

                        <italic toggle="yes">Engl Teach.</italic>
</source>
                    <year>2023</year>;<volume>78</volume>(<issue>1</issue>):<fpage>3</fpage>&#x2013;<lpage>23</lpage>.
                    <pub-id pub-id-type="doi">10.15858/engtea.78.1.202303.3</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref14">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>El-Garawany</surname>
                            <given-names>MSM</given-names>
                        </name>
</person-group>:
                    <article-title>Using Praat to develop English majors' EFL intonation production.</article-title>
                    <source>

                        <italic toggle="yes">&#x0627;&#x0644;&#x0645;&#x062c;&#x0644;&#x0629; &#x0627;&#x0644;&#x062a;&#x0631;&#x0628;&#x0648;&#x064a;&#x0629; &#x0644;&#x06a9;&#x0644;&#x064a;&#x0629; &#x0627;&#x0644;&#x062a;&#x0631;&#x0628;&#x064a;&#x0629; &#x0628;&#x0633;&#x0648;&#x0647;&#x0627;&#x062c;.</italic>
</source>
                    <year>2021</year>;<volume>92</volume>:<fpage>91</fpage>&#x2013;<lpage>125</lpage>.
                    <pub-id pub-id-type="doi">10.21608/edusohag.2021.208708</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref15">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Elnagar</surname>
                            <given-names>BMA</given-names>
                        </name>
</person-group>:
                    <article-title>An Investigation of Instructors' Approaches in Teaching Pronunciation: A Case Study.</article-title>
                    <source>

                        <italic toggle="yes">Engl. Lang. Teach.</italic>
</source>
                    <year>2020</year>;<volume>13</volume>(<issue>8</issue>):<fpage>185</fpage>&#x2013;<lpage>199</lpage>.
                    <pub-id pub-id-type="doi">10.5539/elt.v13n8p185</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref16">
                <mixed-citation publication-type="other">
                    <collab>ELSA</collab>:
                    <article-title>Elsa Speak. [Speech Analyzer].</article-title>
                    <year>2025</year>.
&gt;
                    <ext-link ext-link-type="uri" xlink:href="https://speechanalyzer.elsaspeak.com/">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref17">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Foote</surname>
                            <given-names>JA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Trofimovich</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Collins</surname>
                            <given-names>L</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Pronunciation teaching practices in communicative second language classes.</article-title>
                    <source>

                        <italic toggle="yes">Lang. Learn. J.</italic>
</source>
                    <year>2016</year>;<volume>44</volume>(<issue>2</issue>):<fpage>181</fpage>&#x2013;<lpage>196</lpage>.
                    <pub-id pub-id-type="doi">10.1080/09571736.2013.784345</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref18">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>G&#x00f6;la&#x00e7;</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>G&#x00fc;la&#x00e7;t&#x0131;</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Atal&#x0131;k</surname>
                            <given-names>G</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>What do the voice-related parameters tell us? The multiparametric index scores, cepstral-based methods, patient-reported outcomes, and durational measurements.</article-title>
                    <source>

                        <italic toggle="yes">Eur. Arch. Otorrinolaringol.</italic>
</source>
                    <year>2025</year>;<volume>282</volume>:<fpage>1355</fpage>&#x2013;<lpage>1365</lpage>.
                    <pub-id pub-id-type="pmid">39828788</pub-id>
                    <pub-id pub-id-type="doi">10.1007/s00405-024-09192-w</pub-id>
                    <pub-id pub-id-type="pmcid">PMC11890346</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref19">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Gordon</surname>
                            <given-names>J</given-names>
                        </name>
</person-group>:
                    <article-title>Implementing explicit pronunciation instruction: The case of a nonnative English-speaking teacher.</article-title>
                    <source>

                        <italic toggle="yes">Lang. Teach. Res.</italic>
</source>
                    <year>2023</year>;<volume>27</volume>(<issue>3</issue>):<fpage>718</fpage>&#x2013;<lpage>745</lpage>.
                    <pub-id pub-id-type="doi">10.1177/1362168820941991</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref20">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Guo</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <article-title>An empirical study on English phonetic teaching reform based on Praat phonetic software.</article-title>
                    <source>

                        <italic toggle="yes">Int J Cogn Inf Nat Intell.</italic>
</source>
                    <year>2025</year>;<volume>19</volume>(<issue>1</issue>):<fpage>1</fpage>&#x2013;<lpage>16</lpage>.
                    <pub-id pub-id-type="doi">10.4018/IJCINI.368243</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref21">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Jadoul</surname>
                            <given-names>Y</given-names>
                        </name>

                        <name name-style="western">
                            <surname>De Boer</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ravignani</surname>
                            <given-names>A</given-names>
                        </name>
</person-group>:
                    <article-title>Parselmouth for bioacoustics: automated acoustic analysis in Python.</article-title>
                    <source>

                        <italic toggle="yes">Bioacoustics.</italic>
</source>
                    <year>2024</year>;<volume>33</volume>(<issue>1</issue>):<fpage>1</fpage>&#x2013;<lpage>19</lpage>.
                    <pub-id pub-id-type="doi">10.1080/09524622.2023.2259327</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref22">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Jeong</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Lindemann</surname>
                            <given-names>S</given-names>
                        </name>
</person-group>:
                    <article-title>Beyond ideologies of nativeness in the intelligibility principle for L2 English pronunciation: A corpus-supported review.</article-title>
                    <source>

                        <italic toggle="yes">System.</italic>
</source>
                    <year>2025</year>;<volume>129</volume>:<fpage>103599</fpage>.
                    <pub-id pub-id-type="doi">10.1016/j.system.2025.103599</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref23">
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Karia</surname>
                            <given-names>MK</given-names>
                        </name>
</person-group>:
                    <chapter-title>Using acoustic phonetics in the yassessment and treatment of speech disorders.</chapter-title>
                    <person-group person-group-type="editor">

                        <name name-style="western">
                            <surname>L&#x00fc;dtke</surname>
                            <given-names>UM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kija</surname>
                            <given-names>E</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Karia</surname>
                            <given-names>MK</given-names>
                        </name>
</person-group>, editors.
                    <source>

                        <italic toggle="yes">Handbook of Speech-Language Therapy in Sub-Saharan Africa.</italic>
</source>
                    <publisher-loc>Cham</publisher-loc>:
                    <publisher-name>Springer</publisher-name>;<year>2023</year>.
                    <pub-id pub-id-type="doi">10.1007/978-3-031-04504-2_20</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref24">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Larassati</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Setyaningsih</surname>
                            <given-names>N</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Suryaningtyas</surname>
                            <given-names>VW</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Using Praat for EFL English pronunciation class: Defining the errors of question tags intonation.</article-title>
                    <source>

                        <italic toggle="yes">Lang Circle J Lang Lit.</italic>
</source>
                    <year>2022</year>;<volume>16</volume>(<issue>2</issue>):<fpage>245</fpage>&#x2013;<lpage>254</lpage>.
                    <pub-id pub-id-type="doi">10.15294/lc.v16i2.34393</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref25">
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Levis</surname>
                            <given-names>JM</given-names>
                        </name>
</person-group>:
                    <chapter-title>Teaching pronunciation: Truths and lies.</chapter-title>
                    <person-group person-group-type="editor">

                        <name name-style="western">
                            <surname>Bardel</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hedman</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Rejman</surname>
                            <given-names>K</given-names>
                        </name>

                        <etal/>
</person-group>, editors.
                    <source>

                        <italic toggle="yes">Exploring Language Education: Global and local perspectives.</italic>
</source>
                    <publisher-name>Stockholm University Press</publisher-name>;<year>2022</year>; pp.<fpage>39</fpage>&#x2013;<lpage>72</lpage>.
                    <pub-id pub-id-type="doi">10.16993/bbz.c</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref26">
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Lobato</surname>
                            <given-names>THG</given-names>
                        </name>
</person-group>:
                    <article-title>AI Pronunciation Trainer. [Speech analyzer].</article-title>
                    <year>2025</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://aipronunciationtr.com/">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref27">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Low</surname>
                            <given-names>EL</given-names>
                        </name>
</person-group>:
                    <article-title>EIL pronunciation research and practice: Issues, challenges, and future directions.</article-title>
                    <source>

                        <italic toggle="yes">RELC J.</italic>
</source>
                    <year>2021</year>;<volume>52</volume>(<issue>1</issue>):<fpage>22</fpage>&#x2013;<lpage>34</lpage>.
                    <pub-id pub-id-type="doi">10.1177/0033688220987318</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref28">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ma</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Henrichsen</surname>
                            <given-names>LE</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Cox</surname>
                            <given-names>TL</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Pronunciation&#x2019;s role in English speaking-proficiency ratings.</article-title>
                    <source>

                        <italic toggle="yes">Journal of Second Language Pronunciation.</italic>
</source>
                    <year>2018</year>;<volume>4</volume>(<issue>1</issue>):<fpage>73</fpage>&#x2013;<lpage>102</lpage>.
                    <pub-id pub-id-type="doi">10.1075/jslp.00004.ma</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref29">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Mitrofanova</surname>
                            <given-names>Y</given-names>
                        </name>
</person-group>:
                    <article-title>Raising EFL students&#x2019; awareness of English intonation functioning.</article-title>
                    <source>

                        <italic toggle="yes">Lang. Aware.</italic>
</source>
                    <year>2012</year>;<volume>21</volume>(<issue>3</issue>):<fpage>279</fpage>&#x2013;<lpage>291</lpage>.
                    <pub-id pub-id-type="doi">10.1080/09658416.2011.609621</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref30">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Mompean</surname>
                            <given-names>JA</given-names>
                        </name>
</person-group>:
                    <article-title>ChatGPT for L2 pronunciation teaching and learning.</article-title>
                    <source>

                        <italic toggle="yes">ELT J.</italic>
</source>
                    <year>2024</year>;<volume>78</volume>(<issue>4</issue>):<fpage>423</fpage>&#x2013;<lpage>434</lpage>.
                    <pub-id pub-id-type="doi">10.1093/elt/ccae050</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref31">
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Newman</surname>
                            <given-names>S</given-names>
                        </name>
</person-group>:
                    <source>

                        <italic toggle="yes">Building microservices: Designing fine-grained systems.</italic>
</source>
                    <publisher-name>O'Reilly Media, Inc.</publisher-name>;<year>2021</year>.</mixed-citation>
            </ref>
            <ref id="ref32">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Osatananda</surname>
                            <given-names>V</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Thinchan</surname>
                            <given-names>W</given-names>
                        </name>
</person-group>:
                    <article-title>Using Praat for English pronunciation self-practice outside the classroom: Strengths, weaknesses, and its application.</article-title>
                    <source>

                        <italic toggle="yes">Learn J Lang Educ Acquis Res Netw.</italic>
</source>
                    <year>2021</year>;<volume>14</volume>(<issue>2</issue>):<fpage>372</fpage>&#x2013;<lpage>396</lpage>.</mixed-citation>
            </ref>
            <ref id="ref33">
                <mixed-citation publication-type="other">
                    <collab>Pearson Education Inc</collab>:
                    <article-title>Versant professional English test.</article-title>
                    <source>

                        <italic toggle="yes">Test description and validation summary.</italic>
</source>
                    <year>2022</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://www.pearson.com/content/dam/one-dot-com/one-dot-com/pearson-languages/en-gb/pdfs/versant-resources/versant-professional-english-test-description-validation-summary.pdf">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref34">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Pennington</surname>
                            <given-names>MC</given-names>
                        </name>
</person-group>:
                    <article-title>Teaching Pronunciation: The State of the Art 2021.</article-title>
                    <source>

                        <italic toggle="yes">RELC J.</italic>
</source>
                    <year>2021</year>;<volume>52</volume>(<issue>1</issue>):<fpage>3</fpage>&#x2013;<lpage>21</lpage>.
                    <pub-id pub-id-type="doi">10.1177/00336882211002283</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref35">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Pitychoutis</surname>
                            <given-names>KM</given-names>
                        </name>
</person-group>:
                    <article-title>Pronunciation pedagogy revisited: Voices from Omani B. Ed. students.</article-title>
                    <source>

                        <italic toggle="yes">J Lang Teach Res.</italic>
</source>
                    <year>2024</year>;<volume>15</volume>(<issue>4</issue>):<fpage>1039</fpage>&#x2013;<lpage>1050</lpage>.
                    <pub-id pub-id-type="doi">10.17507/jltr.1504.02</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref36">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Rogerson-Revell</surname>
                            <given-names>PM</given-names>
                        </name>
</person-group>:
                    <article-title>Computer-assisted pronunciation training (CAPT): Current issues and future directions.</article-title>
                    <source>

                        <italic toggle="yes">RELC J.</italic>
</source>
                    <year>2021</year>;<volume>52</volume>(<issue>1</issue>):<fpage>189</fpage>&#x2013;<lpage>205</lpage>.
                    <pub-id pub-id-type="doi">10.1177/0033688220977406</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref37">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Sonkaya</surname>
                            <given-names>ZZ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>&#x00d6;zturk</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Sonkaya</surname>
                            <given-names>R</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Using objective speech analysis techniques for the clinical diagnosis and assessment of speech disorders in patients with multiple sclerosis.</article-title>
                    <source>

                        <italic toggle="yes">Brain Sci.</italic>
</source>
                    <year>2024</year>;<volume>14</volume>(<issue>4</issue>):<fpage>384</fpage>.
                    <pub-id pub-id-type="doi">10.3390/brainsci14040384</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref38">
                <mixed-citation publication-type="other">
                    <collab>Speechace</collab>:
                    <article-title>Speaking Assessments. [Speech recognition platform].</article-title>
                    <year>n.d</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://speak.speechace.co/placement/">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref39">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Stratton</surname>
                            <given-names>JM</given-names>
                        </name>
</person-group>:
                    <article-title>Explicit pronunciation instruction in the second language classroom: An acoustic analysis of German final devoicing.</article-title>
                    <source>

                        <italic toggle="yes">J Second Lang Pronunc.</italic>
</source>
                    <year>2023</year>;<volume>9</volume>(<issue>1</issue>):<fpage>71</fpage>&#x2013;<lpage>102</lpage>.
                    <pub-id pub-id-type="doi">10.1075/jslp.22038.str</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref40">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Sonsaat-Hegelheimer</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kurt</surname>
                            <given-names>&#x015e;</given-names>
                        </name>
</person-group>:
                    <article-title>The impact of generative AI-powered chatbots on L2 comprehensibility.</article-title>
                    <source>

                        <italic toggle="yes">J Second Lang Pronunc.</italic>
</source>
                    <year>2024</year>;<volume>10</volume>(<issue>3</issue>):<fpage>339</fpage>&#x2013;<lpage>374</lpage>.
                    <pub-id pub-id-type="doi">10.1075/jslp.24053.son?locatt=mode:legacy</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref41">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Topal</surname>
                            <given-names>IH</given-names>
                        </name>
</person-group>:
                    <article-title>An edusemiotic approach to teaching intonation in the context of English language teacher education.</article-title>
                    <source>

                        <italic toggle="yes">Semiotica.</italic>
</source>
                    <year>2024</year>;<volume>2024</volume>(<issue>259</issue>):<fpage>185</fpage>&#x2013;<lpage>216</lpage>.
                    <pub-id pub-id-type="doi">10.1515/sem-2023-0203</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref42">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Vereninova</surname>
                            <given-names>ZB</given-names>
                        </name>
</person-group>:
                    <article-title>Shaping phonetic competencies in the first year of a linguistic university: instructional techniques and recommendations.</article-title>
                    <source>

                        <italic toggle="yes">Vestnik of Moscow State Linguistic University.</italic>
</source>
                    <year>2011</year>;<volume>1</volume>(<issue>607</issue>):<fpage>64</fpage>&#x2013;<lpage>75</lpage>. (In Russ.).</mixed-citation>
            </ref>
            <ref id="ref43">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wang</surname>
                            <given-names>X</given-names>
                        </name>
</person-group>:
                    <article-title>Experimental research on using Praat software to assist English phonetic teaching.</article-title>
                    <source>

                        <italic toggle="yes">Int J Mechatron Appl Mech.</italic>
</source>
                    <year>2024</year>;<volume>18</volume>:<fpage>113</fpage>&#x2013;<lpage>117</lpage>.
                    <pub-id pub-id-type="doi">10.17683/ijomam/issue18.13</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref44">
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wang</surname>
                            <given-names>Y</given-names>
                        </name>
</person-group>:
                    <chapter-title>Intelligent acquisition method of English online voice teaching information based on Praat system.</chapter-title>
                    <source>

                        <italic toggle="yes">2021 Global Reliability and Prognostics and Health Management.</italic>
</source>
                    <publisher-name>IEEE</publisher-name>;<year>2021</year>; pp.<fpage>1</fpage>&#x2013;<lpage>6</lpage>.
                    <pub-id pub-id-type="doi">10.1109/PHM-Nanjing52125.2021.9613066</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref45">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Yang</surname>
                            <given-names>W</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Zhao</surname>
                            <given-names>X</given-names>
                        </name>
</person-group>:
                    <article-title>Research on the function of visual phonetic software Praat in vocational English phonetics teaching.</article-title>
                    <source>

                        <italic toggle="yes">J. Phys. Conf. Ser.</italic>
</source>
                    <year>2021</year>;<volume>1856</volume>(<issue>1</issue>):<fpage>012057</fpage>.
                    <pub-id pub-id-type="doi">10.1088/1742-6596/1856/1/012057</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref46">
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Zeng</surname>
                            <given-names>W</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Huang</surname>
                            <given-names>Y</given-names>
                        </name>
</person-group>:
                    <chapter-title>On the effectiveness of visual speech software Praat in English pronunciation teaching research.</chapter-title>
                    <person-group person-group-type="editor">

                        <name name-style="western">
                            <surname>Atiquzzaman</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Yen</surname>
                            <given-names>N</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Xu</surname>
                            <given-names>Z</given-names>
                        </name>
</person-group>, editors.
                    <source>

                        <italic toggle="yes">Lecture Notes on Data Engineering and Communications Technologies.</italic>
</source>
                    <publisher-name>Springer</publisher-name>;<year>2025</year>;<volume>235</volume>.
                    <pub-id pub-id-type="doi">10.1007/978-981-96-0211-7_1</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report488319">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.197356.r488319</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Panhwar</surname>
                        <given-names>Abdul Hameed</given-names>
                    </name>
                    <xref ref-type="aff" rid="r488319a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-8528-7335</uri>
                </contrib>
                <aff id="r488319a1">
                    <label>1</label>University of Sindh, Hyderabad, Pakistan</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>12</day>
                <month>6</month>
                <year>2026</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2026 Panhwar AH</copyright-statement>
                <copyright-year>2026</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport488319" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.178913.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The paper appears to be well organised and potentially research-oriented. However, at some places, more evidence/references are required. For example, paragraph 2 in Section</p>
            <p> 'Introduction' has no evidence/references for claims. I would suggest authors should go through and substantiate paper with references/evidence where require.</p>
            <p> The paper also lacks clear stated objectives and research questions or hypotheses. I would suggest that the authors should include and build the study on these.</p>
            <p> Moreover, since I am not a qualified statistician, I cannot competently confirm reliability of data. Therefore, suggest that statistical analysis and findings be rechecked through a qualified statistician.</p>
            <p>Is the work clearly and accurately presented and does it cite the current literature?</p>
            <p>Yes</p>
            <p>If applicable, is the statistical analysis and its interpretation appropriate?</p>
            <p>I cannot comment. A qualified statistician is required.</p>
            <p>Are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Is the study design appropriate and is the work technically sound?</p>
            <p>Yes</p>
            <p>Are the conclusions drawn adequately supported by the results?</p>
            <p>Yes</p>
            <p>Are sufficient details of methods and analysis provided to allow replication by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>TESOL, Action Research, collaborative learning</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report488321">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.197356.r488321</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Marzuki</surname>
                        <given-names>Dony</given-names>
                    </name>
                    <xref ref-type="aff" rid="r488321a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r488321a1">
                    <label>1</label>Politeknik Negeri Padang, Padang, Indonesia</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>3</day>
                <month>6</month>
                <year>2026</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2026 Marzuki D</copyright-statement>
                <copyright-year>2026</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport488321" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.178913.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The paper describes a prototype machine-learning system for automatic prosody recognition as a tool for advanced phonetics teaching. The paper is relevant, organized and cites up-to-date research, however its scientific quality is compromised by some methodological problems. The datasets are relatively small for a 19-class intonation classification task. It only has 47 recordings and 882 clips without key information on participant characteristics, class distribution, train-test splits and validation procedures. Annotations reliability is not reported. Statistical analysis is largely based on accuracy without precision, recall, F1-scores and significance tests. Furthermore, the inability to access the underlying dataset and annotations prevents full replication. For an article to be published, the dataset has to be fully documented, measures of annotation reliability, comprehensive evaluation metrics, robust validation processes and public access to the underlying data should be provided.</p>
            <p>Is the work clearly and accurately presented and does it cite the current literature?</p>
            <p>Partly</p>
            <p>If applicable, is the statistical analysis and its interpretation appropriate?</p>
            <p>Partly</p>
            <p>Are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>No</p>
            <p>Is the study design appropriate and is the work technically sound?</p>
            <p>Partly</p>
            <p>Are the conclusions drawn adequately supported by the results?</p>
            <p>Yes</p>
            <p>Are sufficient details of methods and analysis provided to allow replication by others?</p>
            <p>Partly</p>
            <p>Reviewer Expertise:</p>
            <p>EFL/ESL instruction, speech fluency, strategy training, autonomous learning</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
    </sub-article>
</article>
