<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="other" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.175198.1</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Study Protocol</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 1; peer review: 2 approved with reservations, 1 not approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="no" equal-contrib="yes">
                    <name>
                        <surname>&#x00c7;ali&#x015f;kan</surname>
                        <given-names>S. Ayhan</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-9714-6249</uri>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no" equal-contrib="yes">
                    <name>
                        <surname>Bello Abubakar</surname>
                        <given-names>Firdaus</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="yes" equal-contrib="yes">
                    <name>
                        <surname>Magzoub</surname>
                        <given-names>Mohi Eldin</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-6721-4500</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Department of Medical Education, United Arab Emirates University College of Medicine and Health Sciences, Al Ain, Abu Dhabi, United Arab Emirates</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:mmagzoub@uaeu.ac.ae">mmagzoub@uaeu.ac.ae</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>13</day>
                <month>3</month>
                <year>2026</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2026</year>
            </pub-date>
            <volume>15</volume>
            <elocation-id>395</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>10</day>
                    <month>2</month>
                    <year>2026</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2026 &#x00c7;ali&#x015f;kan SA et al.</copyright-statement>
                <copyright-year>2026</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/15-395/pdf"/>
            <abstract>
                <p>Assessment plays a central role in medical education by evaluating learners&#x2019; knowledge, skills, and professional competencies. While multiple-choice questions (MCQs) are widely used due to their efficiency and broad content coverage, they primarily assess recall and recognition, limiting their ability to measure higher-order reasoning. Short-answer questions (SAQs), in contrast, promote deeper cognitive processing and provide better discrimination between levels of student performance. However, SAQs are resource-intensive to grade and susceptible to scorer inconsistency and rater bias, highlighting a need for more efficient and reliable assessment solutions.</p>
                <p>Artificial Intelligence (AI) has emerged as a transformative tool in medical education, enhancing learning, supporting adaptive instruction, and automating assessment processes. AI-driven systems using machine learning and natural language processing have been increasingly applied to automated scoring of SAQs. These systems offer potential benefits, including reduced grading burden, greater scoring consistency, and timely feedback to learners. Despite promising developments, concerns persist regarding algorithmic transparency, data privacy, and the reliability and validity of automated scoring compared with human graders. Existing studies report mixed results, underscoring the need for a comprehensive examination of current approaches.</p>
                <p>This scoping review aims to systematically map the literature on AI-based models used for automated scoring of SAQs in medical education. Specifically, it seeks to identify the types of AI models employed, evaluate their accuracy and reliability relative to human graders, describe reported advantages and challenges, and assess fairness and feasibility within educational settings. Following the Joanna Briggs Institute methodology and the Population&#x2013;Concept&#x2013;Context framework, the review will include empirical studies published since 2015 involving medical students and AI-driven SAQ scoring. Findings will provide an evidence-based overview of current practices, highlight gaps in the literature, and inform future research and implementation strategies for AI-assisted assessment in medical education.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>Artificial intelligence</kwd>
                <kwd>Automated scoring</kwd>
                <kwd>Auto grading</kwd>
                <kwd>Short-Answer Questions</kwd>
                <kwd>SAQs</kwd>
                <kwd>Assessment</kwd>
                <kwd>Medical education</kwd>
                <kwd>Machine learning</kwd>
                <kwd>Natural language processing</kwd>
                <kwd>Assessment reliability</kwd>
                <kwd>Assessment validity</kwd>
                <kwd>Autoscoring tools</kwd>
            </kwd-group>
            <funding-group>
                <funding-statement>The author(s) declared that no grants were involved in supporting this work.</funding-statement>
            </funding-group>
        </article-meta>
    </front>
    <body>
        <sec id="sec1" sec-type="intro">
            <title>Introduction</title>
            <sec id="sec2">
                <title>Assessment in medical education</title>
                <p>Assessment is a central component of medical education, serving to evaluate learners&#x2019; knowledge, skills, and attitudes through both formative and summative assessment methods (
                    <xref ref-type="bibr" rid="ref16">Schuwirth &amp; van der Vleuten, 2020</xref>; 
                    <xref ref-type="bibr" rid="ref19">Yudkowsky et al., 2019</xref>). Effective assessment not only guides and motivates students&#x2019; learning but also provides an essential mechanism for determining whether they have attained the competencies required of medical professionals. Medical schools employ a range of assessment strategies to monitor students&#x2019; progress in acquiring core knowledge and clinical competencies (
                    <xref ref-type="bibr" rid="ref10">Norcini et al., 2018</xref>; 
                    <xref ref-type="bibr" rid="ref18">Shumway &amp; Harden, 2009</xref>). Common assessment methods used in medical education include Multiple-Choice Questions (MCQs), Modified Essay Questions (MEQ), Short Answer Questions (SAQs), Objective Structured Clinical Examination (OSCE) and Key Feature Problems (KFPs) (
                    <xref ref-type="bibr" rid="ref2">Boursicot et al., 2018</xref>; 
                    <xref ref-type="bibr" rid="ref8">Jolly &amp; Dalton, 2018</xref>). Assessing students&#x2019; learning is therefore fundamental to ensuring high-quality medical training, and the choice of assessment method directly influences the breadth and depth of knowledge or skills that can be evaluated (
                    <xref ref-type="bibr" rid="ref12">Preston et al., 2020</xref>). Crucially, each assessment method must be aligned with the targeted competencies, the instructional approaches used, and the desired impact on student learning (
                    <xref ref-type="bibr" rid="ref19">Yudkowsky et al., 2019</xref>).</p>
            </sec>
            <sec id="sec3">
                <title>Short answer questions vs MCQ&#x2019;s</title>
                <p>Multiple-choice questions (MCQs) and SAQs are among the commonly used written assessment formats in medical education (
                    <xref ref-type="bibr" rid="ref8">Jolly &amp; Dalton, 2018</xref>; 
                    <xref ref-type="bibr" rid="ref12">Preston et al., 2020</xref>; 
                    <xref ref-type="bibr" rid="ref18">Shumway &amp; Harden, 2009</xref>). MCQs, in which students select a response from predetermined options, are widely favored because they allow broad sampling of content, assess a wide range of knowledge areas efficiently, and facilitate rapid, objective grading (
                    <xref ref-type="bibr" rid="ref8">Jolly &amp; Dalton, 2018</xref>; 
                    <xref ref-type="bibr" rid="ref15">Schuwirth &amp; Van Der Vleuten, 2004</xref>). Although MCQs effectively evaluate recall, recognition, and factual knowledge, they have been criticized for their limited ability to promote critical thinking or accurately assess deeper mastery of subject matter (
                    <xref ref-type="bibr" rid="ref15">Schuwirth &amp; Van Der Vleuten, 2004</xref>). Their reliance on recognition-based answering may discourage deep learning and higher-order reasoning, making it challenging to determine whether students have truly internalized the material (
                    <xref ref-type="bibr" rid="ref15">Schuwirth &amp; Van Der Vleuten, 2004</xref>; 
                    <xref ref-type="bibr" rid="ref18">Shumway &amp; Harden, 2009</xref>).</p>
                <p>SAQs, by contrast, require students to generate concise written responses, thereby encouraging active retrieval, deeper cognitive processing, and higher-order reasoning (
                    <xref ref-type="bibr" rid="ref6">Gr&#x00e9;visse, 2024</xref>; 
                    <xref ref-type="bibr" rid="ref8">Jolly &amp; Dalton, 2018</xref>; 
                    <xref ref-type="bibr" rid="ref11">Potter &amp; McLachlan, 2025</xref>). SAQs enable assessment of a broad range of competencies and cognitive skills and often demonstrate higher reliability and better discrimination 
                    <italic toggle="yes">-in other words, stronger ability to differentiate between high- and low-performing students-</italic> than MCQs, making them a valuable component of medical assessment systems (
                    <xref ref-type="bibr" rid="ref8">Jolly &amp; Dalton, 2018</xref>; 
                    <xref ref-type="bibr" rid="ref15">Schuwirth &amp; Van Der Vleuten, 2004</xref>). However, SAQs are more resource-intensive to grade, and concerns about scorer inconsistency and rater bias can pose challenges to their validity (
                    <xref ref-type="bibr" rid="ref6">Gr&#x00e9;visse, 2024</xref>; 
                    <xref ref-type="bibr" rid="ref11">Potter &amp; McLachlan, 2025</xref>). The emergence of AI-based autoscoring tools offers a promising solution, with early evidence suggesting improved efficiency, reduced bias, and enhanced scoring consistency (
                    <xref ref-type="bibr" rid="ref6">Gr&#x00e9;visse, 2024</xref>).</p>
            </sec>
            <sec id="sec4">
                <title>Emergence of AI in medical education assessment</title>
                <p>Artificial Intelligence (AI) is emerging as a powerful and transformative tool in medical education. AI technologies are increasingly integrated into educational systems to enhance students&#x2019; learning experiences, prepare them for an AI-driven healthcare environment, personalize learning, and improve assessment processes (
                    <xref ref-type="bibr" rid="ref7">Hallquist et al., 2025</xref>; 
                    <xref ref-type="bibr" rid="ref14">Rinc&#x00f3;n et al., 2025</xref>). AI tools employ Machine Learning (ML), Deep Learning (DL), and Natural Language Processing (NLP) techniques to support adaptive learning platforms, automate question generation, evaluate student responses, and deliver timely, personalized feedback (
                    <xref ref-type="bibr" rid="ref7">Hallquist et al., 2025</xref>; 
                    <xref ref-type="bibr" rid="ref14">Rinc&#x00f3;n et al., 2025</xref>).</p>
                <p>The incorporation of AI has enhanced students&#x2019; learning through AI-assisted assessment platforms that enable learners to practice applying their knowledge while receiving immediate feedback (
                    <xref ref-type="bibr" rid="ref5">Gordon et al., 2024</xref>). AI-driven simulated case presentations 
                    <italic toggle="yes">-where AI functions as a virtual physician or simulated patient-</italic> have been shown to improve students&#x2019; communication and clinical skills (
                    <xref ref-type="bibr" rid="ref9">Merritt et al., 2022</xref>; 
                    <xref ref-type="bibr" rid="ref14">Rinc&#x00f3;n et al., 2025</xref>). For medical educators, AI has reduced workload by automating assessment processes and generating examination questions (
                    <xref ref-type="bibr" rid="ref7">Hallquist et al., 2025</xref>; 
                    <xref ref-type="bibr" rid="ref17">Seneviratne &amp; Manathunga, 2025</xref>). Educators can use AI to develop assessment items, evaluate item reliability, and automatically score student responses. Despite these advantages, concerns remain regarding data privacy, ethical use of learner information, and, notably, the transparency of AI algorithms employed in automated assessment and feedback systems.</p>
            </sec>
            <sec id="sec5">
                <title>Automatic scoring of short answer questions</title>
                <p>Automated assessment systems use Machine Learning (ML) and Natural Language Processing (NLP) techniques to automated scoring of SAQs and essays (
                    <xref ref-type="bibr" rid="ref6">Gr&#x00e9;visse, 2024</xref>; 
                    <xref ref-type="bibr" rid="ref17">Seneviratne &amp; Manathunga, 2025</xref>). These systems can process large numbers of student responses and provide timely feedback, thereby reducing educator workload, minimizing grader bias, and supporting improved student learning and performance (
                    <xref ref-type="bibr" rid="ref4">Clauser et al., 2024</xref>; 
                    <xref ref-type="bibr" rid="ref6">Gr&#x00e9;visse, 2024</xref>; 
                    <xref ref-type="bibr" rid="ref17">Seneviratne &amp; Manathunga, 2025</xref>).</p>
                <p>Automated scoring of short-answer questions (ASAQ) was first introduced in the 1960s and has since undergone substantial development, incorporating increasingly sophisticated statistical, ML, and NLP approaches to improve scoring accuracy and reliability. Recently, the application of ASAQ has gained considerable attention in medical education, where grading large volumes of SAQs is both time-consuming and susceptible to rater bias (
                    <xref ref-type="bibr" rid="ref1">Bolgova et al., 2025</xref>; 
                    <xref ref-type="bibr" rid="ref4">Clauser et al., 2024</xref>; 
                    <xref ref-type="bibr" rid="ref6">Gr&#x00e9;visse, 2024</xref>; 
                    <xref ref-type="bibr" rid="ref13">Rajan et al., 2025</xref>; 
                    <xref ref-type="bibr" rid="ref17">Seneviratne &amp; Manathunga, 2025</xref>). Research on ASAQ has reported mixed findings: while many studies demonstrate that automated scoring can achieve results comparable to human graders, some highlight concerns related to the system&#x2019;s reliability, validity, and the opacity of its algorithms (
                    <xref ref-type="bibr" rid="ref1">Bolgova et al., 2025</xref>; 
                    <xref ref-type="bibr" rid="ref4">Clauser et al., 2024</xref>; 
                    <xref ref-type="bibr" rid="ref6">Gr&#x00e9;visse, 2024</xref>; 
                    <xref ref-type="bibr" rid="ref17">Seneviratne &amp; Manathunga, 2025</xref>). Consequently, although ASAQ holds great potential to enhance assessment practices in medical education, further empirical research is needed to ensure fairness, robustness, and broad acceptance within educational settings.</p>
            </sec>
            <sec id="sec6">
                <title>Rationale</title>
                <p>Although the use of AI models for automated scoring of short-answer questions has gained increasing attention in medical education research, the breadth and depth of the existing literature remain unclear. A notable gap exists in understanding the range and types of AI models employed by medical educators to evaluate short-answer responses in student assessments. Consequently, a scoping review is warranted to systematically explore the extent and nature of current research and to map the available evidence on this topic.</p>
                <p>Beyond identifying the AI models used, the review will examine the reported validity, reliability, and feasibility of these systems in comparison with traditional human grading methods. Such a review will help clarify the current landscape of AI-driven automated scoring, highlight research trends and gaps, and provide a comprehensive overview of how these technologies are being implemented in medical education assessment.</p>
            </sec>
        </sec>
        <sec id="sec7" sec-type="methods">
            <title>Methods</title>
            <sec id="sec8">
                <title>Research questions</title>
                <p>The aim of the review is to systematically map the existing literature on automated scoring of SAQs in medical education, with a focus on utilized tools, accuracy, reliability, and fairness. The review will address the following research questions:
                    <list list-type="alpha-lower">
                        <list-item>
                            <label>a.</label>
                            <p>What AI-based models have been used to automated scoring of SAQs in medical education?</p>
                        </list-item>
                        <list-item>
                            <label>b.</label>
                            <p>How accurately does automated scoring of SAQs reflect the performance of human graders?</p>
                        </list-item>
                        <list-item>
                            <label>c.</label>
                            <p>What advantages and challenges of using AI for automated scoring of SAQs have been reported in medical education?</p>
                        </list-item>
                        <list-item>
                            <label>d.</label>
                            <p>Are these models more effective than human graders in terms of reliability, accuracy, and fairness?</p>
                        </list-item>
                    </list>
                </p>
            </sec>
            <sec id="sec9">
                <title>Search strategy</title>
                <p>This scoping review will be conducted in accordance with the methodology outlined in the Joanna Briggs Institute (JBI) Manual for Evidence Synthesis, with a focus on AI models used for automated scoring of SAQs in medical school assessments. The Population, Concept, and Context (PCC) framework will guide the search strategy, eligibility criteria, and data extraction processes.
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Population: Medical students</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Concept: Automated scoring of short-answer questions using AI-based models</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Context: Medical education and medical school settings</p>
                        </list-item>
                    </list>
                </p>
                <p>Four electronic databases will be searched: 
                    <italic toggle="yes">PubMed</italic>, 
                    <italic toggle="yes">Scopus</italic>, 
                    <italic toggle="yes">Medline</italic>, and 
                    <italic toggle="yes">Web of Science</italic>, to comprehensively capture research related to AI applications in medical education assessment. Peer-reviewed quantitative, qualitative, and mixed-methods studies will be included.</p>
                <p>A detailed search strategy will be developed in consultation with a medical librarian. Keywords such as 
                    <italic toggle="yes">&#x201c;medical education,&#x201d; &#x201c;automated scoring,&#x201d; &#x201c;short answer questions,&#x201d;</italic> and 
                    <italic toggle="yes">&#x201c;medical school&#x201d;</italic> will be used, combined with Boolean operators &#x201c;AND&#x201d; and &#x201c;OR&#x201d; to refine and optimize search results.</p>
            </sec>
            <sec id="sec10">
                <title>Screening process</title>
                <p>Covidence systematic review software tool will be used to import references, conduct title and abstract screening, review full texts, and manage data extraction. Two independent reviewers will screen all retrieved studies based on the predefined eligibility criteria. Discrepancies will be resolved by a third reviewer.</p>
                <p>Screening will occur in two stages:
                    <list list-type="order">
                        <list-item>
                            <label>1.</label>
                            <p>Title and abstract screening conducted independently by two reviewers.</p>
                        </list-item>
                        <list-item>
                            <label>2.</label>
                            <p>Full-text review of studies deemed potentially relevant.</p>
                        </list-item>
                    </list>
                </p>
                <p>A secondary search will involve screening the reference lists of included studies to identify additional relevant literature.</p>
            </sec>
            <sec id="sec11">
                <title>Eligibility criteria</title>
                <p>

                    <bold>Inclusion criteria</bold>

                    <list list-type="order">
                        <list-item>
                            <label>1.</label>
                            <p>Topic: Studies must focus on automated scoring of short-answer questions.</p>
                        </list-item>
                        <list-item>
                            <label>2.</label>
                            <p>Methodology: Original empirical research, including quantitative, qualitative, and mixed-methods studies.</p>
                        </list-item>
                        <list-item>
                            <label>3.</label>
                            <p>Study Population: Medical students completing assessments containing SAQs.</p>
                        </list-item>
                        <list-item>
                            <label>4.</label>
                            <p>Assessment Type: Any assessment format using SAQs (e.g., low-stakes, high-stakes, summative examinations).</p>
                        </list-item>
                        <list-item>
                            <label>5.</label>
                            <p>Publication Date: Studies published from 2015 onward.</p>
                        </list-item>
                        <list-item>
                            <label>6.</label>
                            <p>Language: English-language publications only.</p>
                        </list-item>
                        <list-item>
                            <label>7.</label>
                            <p>Computational Approach: Studies using AI-based autoscoring models (machine learning, large language models, deep learning).</p>
                        </list-item>
                        <list-item>
                            <label>8.</label>
                            <p>Setting: Medical education or medical school settings.</p>
                        </list-item>
                    </list>
                </p>
                <p>

                    <bold>Exclusion criteria</bold>

                    <list list-type="order">
                        <list-item>
                            <label>1.</label>
                            <p>Studies not focusing on SAQs (e.g., MCQs, essays).</p>
                        </list-item>
                        <list-item>
                            <label>2.</label>
                            <p>Non-English publications.</p>
                        </list-item>
                        <list-item>
                            <label>3.</label>
                            <p>Studies not conducted within a medical education context or not involving medical students.</p>
                        </list-item>
                        <list-item>
                            <label>4.</label>
                            <p>Studies using non-AI-based approaches to automated scoring.</p>
                        </list-item>
                        <list-item>
                            <label>5.</label>
                            <p>Systematic reviews, scoping reviews, and grey literature.</p>
                        </list-item>
                        <list-item>
                            <label>6.</label>
                            <p>Studies published before 2015.</p>
                        </list-item>
                    </list>
                </p>
            </sec>
            <sec id="sec12">
                <title>Data extraction and charting</title>
                <p>A standardized data extraction form will be used to collect key information from all included studies. Two reviewers will independently extract data, with disagreements resolved by a third reviewer. Extracted data will include:
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Study methodology</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Type of AI-based model used</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Participant characteristics</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Assessment type</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Outcomes relating to the performance of the autoscoring system</p>
                        </list-item>
                    </list>
                </p>
                <p>Extracted data will be charted and summarized in tables and figures. Tables will outline study characteristics (e.g., author, publication year, AI model used), while figures will illustrate the frequency of AI models applied and the accuracy of their automated scoring.</p>
            </sec>
            <sec id="sec13">
                <title>Data analysis</title>
                <p>A thematic analysis approach will be used to identify recurring patterns and themes within the included studies. Key concepts related to the types of AI models used, their scoring accuracy, associated challenges and advantages, and their reported validity and reliability will be systematically coded and synthesized into overarching themes.</p>
                <p>This analytical approach ensures alignment with the research questions and provides a comprehensive overview of the literature. The review will evaluate whether automated scoring of SAQs should be integrated more widely into medical education and identify existing research gaps and future directions.</p>
                <p>Results will be summarized narratively and supported with tables and figures where appropriate. Both the protocol and final review will follow the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) guidelines.</p>
            </sec>
            <sec id="sec14">
                <title>Expected outcomes</title>
                <p>

                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>A comprehensive mapping of current literature on the use of AI models for automated scoring of SAQs in medical education.</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Identification of the benefits and challenges associated with integrating automated scoring into medical assessments.</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Evaluation of the validity, reliability, and feasibility of automated scoring systems compared with human graders.</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Recommendations for future research on the validity, reliability, and implementation of AI-based autoscoring systems in medical education.</p>
                        </list-item>
                    </list>
                </p>
                <sec id="sec15">
                    <title>Dissemination</title>
                    <p>The findings from this scoping review will be presented at academic conferences and submitted for publication in peer-reviewed journals.</p>
                </sec>
                <sec id="sec16">
                    <title>Ethical considerations</title>
                    <p>No ethical concerns as this review will include the use pf published peer-reviewed articles and grey literature. Therefore, ethical approval and consent is not required.</p>
                </sec>
            </sec>
        </sec>
    </body>
    <back>
        <sec id="sec19" sec-type="data-availability">
            <title>Data availability</title>
            <p>No data is associated with this article.</p>
            <sec id="sec20">
                <title>Reporting guidelines</title>
                <p>Repository: PRISMA-P checklist for &#x201c;Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education&#x201d;.</p>
                <p>DOI: 10.6084/m9.figshare.30815456 (
                    <xref ref-type="bibr" rid="ref3">&#x00c7;al&#x0131;&#x015f;kan et al., 2025</xref>).</p>
                <p>Data are available under the terms of the 
                    <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</ext-link>
                </p>
            </sec>
        </sec>
        <ref-list>
            <title>References</title>
            <ref id="ref1">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Bolgova</surname>
                            <given-names>O</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ganguly</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ikram</surname>
                            <given-names>MF</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders.</article-title>
                    <source>

                        <italic toggle="yes">Med. Educ. Online.</italic>
</source>
                    <year>2025</year>;<volume>30</volume>(<issue>1</issue>).
                    <pub-id pub-id-type="pmid">40849930</pub-id>
                    <pub-id pub-id-type="doi">10.1080/10872981.2025.2550751</pub-id>
                    <pub-id pub-id-type="pmcid">PMC12377152</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref2">
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Boursicot</surname>
                            <given-names>KAM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Roberts</surname>
                            <given-names>TE</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Burdick</surname>
                            <given-names>WP</given-names>
                        </name>
</person-group>:
                    <chapter-title>Structured assessments of clinical competence.</chapter-title>
                    <person-group person-group-type="editor">

                        <name name-style="western">
                            <surname>Swanwick</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Forrest</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>O&#x2019;Brien</surname>
                            <given-names>BC</given-names>
                        </name>
</person-group>, editors.
                    <source>

                        <italic toggle="yes">Understanding Medical Education: Evidence, Theory, and Practice.</italic>
</source>
                    <publisher-name>Wiley</publisher-name>;
                    <edition>3rd ed.</edition>
                    <year>2018</year>; pp.<fpage>335</fpage>&#x2013;<lpage>345</lpage>.
                    <pub-id pub-id-type="doi">10.1002/9781119373780.CH23;CTYPE:STRING:BOOK</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref3">
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>&#x00c7;al&#x0131;&#x015f;kan</surname>
                            <given-names>SA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Abubakar Bello</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Magzoub</surname>
                            <given-names>ME</given-names>
                        </name>
</person-group>:
                    <article-title>PRISMA-P checklist for Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education.</article-title>
                    <year>2025, December 7</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://figshare.com/articles/dataset/_b_PRISMA-P_checklist_for_b_b_Protocol_for_Conducting_a_Scoping_Review_on_The_Use_of_AI_in_Automated_Scoring_of_Short-Answer_Questions_in_Medical_Education_b_/30815456?file=60171386">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref4">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Clauser</surname>
                            <given-names>BE</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Yaneva</surname>
                            <given-names>V</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Baldwin</surname>
                            <given-names>P</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Automated Scoring of Short-Answer Questions: A Progress Report.</article-title>
                    <source>

                        <italic toggle="yes">Appl. Meas. Educ.</italic>
</source>
                    <year>2024</year>;<volume>37</volume>(<issue>3</issue>):<fpage>209</fpage>&#x2013;<lpage>224</lpage>.
                    <pub-id pub-id-type="doi">10.1080/08957347.2024.2386945</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref5">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Gordon</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Daniel</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ajiboye</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A scoping review of artificial intelligence in medical education: BEME Guide No. 84.</article-title>
                    <source>

                        <italic toggle="yes">Med. Teach.</italic>
</source>
                    <year>2024</year>;<volume>46</volume>(<issue>4</issue>):<fpage>446</fpage>&#x2013;<lpage>470</lpage>.
                    <pub-id pub-id-type="pmid">38423127</pub-id>
                    <pub-id pub-id-type="doi">10.1080/0142159X.2024.2314198</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref6">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Gr&#x00e9;visse</surname>
                            <given-names>C</given-names>
                        </name>
</person-group>:
                    <article-title>LLM-based automatic short answer grading in undergraduate medical education.</article-title>
                    <source>

                        <italic toggle="yes">BMC Med. Educ.</italic>
</source>
                    <year>2024</year>;<volume>24</volume>(<issue>1</issue>):<fpage>1060</fpage>.
                    <pub-id pub-id-type="pmid">39334087</pub-id>
                    <pub-id pub-id-type="doi">10.1186/S12909-024-06026-5/FIGURES/13</pub-id>
                    <pub-id pub-id-type="pmcid">PMC11429088</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref7">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Hallquist</surname>
                            <given-names>E</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gupta</surname>
                            <given-names>I</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Montalbano</surname>
                            <given-names>M</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Applications of Artificial Intelligence in Medical Education: A Systematic Review.</article-title>
                    <source>

                        <italic toggle="yes">Cureus.</italic>
</source>
                    <year>2025</year>;<volume>17</volume>(<issue>3</issue>).
                    <pub-id pub-id-type="doi">10.7759/CUREUS.79878</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref8">
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Jolly</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dalton</surname>
                            <given-names>MJ</given-names>
                        </name>
</person-group>:
                    <chapter-title>Written assessment.</chapter-title>
                    <person-group person-group-type="editor">

                        <name name-style="western">
                            <surname>Swanwick</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Forrest</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>O&#x2019;Brien</surname>
                            <given-names>BC</given-names>
                        </name>
</person-group>, editors.
                    <source>

                        <italic toggle="yes">Understanding Medical Education: Evidence, Theory, and Practice.</italic>
</source>
                    <publisher-name>Wiley</publisher-name>;
                    <edition>3rd ed.</edition>
                    <year>2018</year>; pp.<fpage>291</fpage>&#x2013;<lpage>317</lpage>.
                    <pub-id pub-id-type="doi">10.1002/9781119373780.CH21;PAGE:STRING:ARTICLE/CHAPTER</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref9">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Merritt</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Glisson</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dewan</surname>
                            <given-names>M</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Implementation and Evaluation of an Artificial Intelligence Driven Simulation to Improve Resident Communication With Primary Care Providers.</article-title>
                    <source>

                        <italic toggle="yes">Acad. Pediatr.</italic>
</source>
                    <year>2022</year>;<volume>22</volume>(<issue>3</issue>):<fpage>503</fpage>&#x2013;<lpage>505</lpage>.
                    <pub-id pub-id-type="pmid">34923145</pub-id>
                    <pub-id pub-id-type="doi">10.1016/J.ACAP.2021.12.013</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref10">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Norcini</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Anderson</surname>
                            <given-names>MB</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bollela</surname>
                            <given-names>V</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>2018 Consensus framework for good assessment.</article-title>
                    <source>

                        <italic toggle="yes">Med. Teach.</italic>
</source>
                    <year>2018</year>;<volume>40</volume>(<issue>11</issue>):<fpage>1102</fpage>&#x2013;<lpage>1109</lpage>.
                    <pub-id pub-id-type="pmid">30299187</pub-id>
                    <pub-id pub-id-type="doi">10.1080/0142159X.2018.1500016</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref11">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Potter</surname>
                            <given-names>HG</given-names>
                        </name>

                        <name name-style="western">
                            <surname>McLachlan</surname>
                            <given-names>JC</given-names>
                        </name>
</person-group>:
                    <article-title>Assessing medical knowledge: A 3-year comparative study of very short answer vs. multiple choice questions.</article-title>
                    <source>

                        <italic toggle="yes">Med. Teach.</italic>
</source>
                    <year>2025</year>;<volume>47</volume>(<issue>10</issue>):<fpage>1669</fpage>&#x2013;<lpage>1677</lpage>.
                    <pub-id pub-id-type="pmid">40293799</pub-id>
                    <pub-id pub-id-type="doi">10.1080/0142159X.2025.2496382</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref12">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Preston</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gratani</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Owens</surname>
                            <given-names>K</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Exploring the Impact of Assessment on Medical Students&#x2019; Learning.</article-title>
                    <source>

                        <italic toggle="yes">Assess. Eval. High. Educ.</italic>
</source>
                    <year>2020</year>;<volume>45</volume>(<issue>1</issue>):<fpage>109</fpage>&#x2013;<lpage>124</lpage>.
                    <pub-id pub-id-type="doi">10.1080/02602938.2019.1614145;ISSUE:ISSUE:DOI</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref13">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Rajan</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Alexander</surname>
                            <given-names>SMK</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Shenvi</surname>
                            <given-names>CL</given-names>
                        </name>
</person-group>:
                    <article-title>Can AI grade like a professor? comparing artificial intelligence and faculty scoring of medical student short-answer clinical reasoning exams.</article-title>
                    <source>

                        <italic toggle="yes">Adv. Health Sci. Educ.</italic>
</source>
                    <year>2025</year>;<fpage>1</fpage>&#x2013;<lpage>11</lpage>.
                    <pub-id pub-id-type="doi">10.1007/S10459-025-10462-3/FIGURES/2</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref14">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Rinc&#x00f3;n</surname>
                            <given-names>EHH</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Jimenez</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Aguilar</surname>
                            <given-names>LAC</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Mapping the use of artificial intelligence in medical education: a scoping review.</article-title>
                    <source>

                        <italic toggle="yes">BMC Med. Educ.</italic>
</source>
                    <year>2025</year>;<volume>25</volume>(<issue>1</issue>):<fpage>526</fpage>.
                    <pub-id pub-id-type="doi">10.1186/S12909-025-07089-8/FIGURES/4</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref15">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Schuwirth</surname>
                            <given-names>LWT</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Van Der Vleuten</surname>
                            <given-names>CPM</given-names>
                        </name>
</person-group>:
                    <article-title>Different written assessment methods: what can be said about their strengths and weaknesses?.</article-title>
                    <source>

                        <italic toggle="yes">Med. Educ.</italic>
</source>
                    <year>2004</year>;<volume>38</volume>(<issue>9</issue>):<fpage>974</fpage>&#x2013;<lpage>979</lpage>.
                    <pub-id pub-id-type="pmid">15327679</pub-id>
                    <pub-id pub-id-type="doi">10.1111/J.1365-2929.2004.01916.X</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref16">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Schuwirth</surname>
                            <given-names>LWT</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Vleuten</surname>
                            <given-names>CPM</given-names>
                            <prefix>van der</prefix>
                        </name>
</person-group>:
                    <article-title>A history of assessment in medical education.</article-title>
                    <source>

                        <italic toggle="yes">Adv. Health Sci. Educ. Theory Pract.</italic>
</source>
                    <year>2020</year>;<volume>25</volume>(<issue>5</issue>):<fpage>1045</fpage>&#x2013;<lpage>1056</lpage>.
                    <pub-id pub-id-type="doi">10.1007/S10459-020-10003-0</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref17">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Seneviratne</surname>
                            <given-names>HMTW</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Manathunga</surname>
                            <given-names>SS</given-names>
                        </name>
</person-group>:
                    <article-title>Artificial intelligence assisted automated short answer question scoring tool shows high correlation with human examiner markings.</article-title>
                    <source>

                        <italic toggle="yes">BMC Med. Educ.</italic>
</source>
                    <year>2025</year>;<volume>25</volume>(<issue>1</issue>):<fpage>1146</fpage>.
                    <pub-id pub-id-type="pmid">40764994</pub-id>
                    <pub-id pub-id-type="doi">10.1186/S12909-025-07718-2/FIGURES/5</pub-id>
                    <pub-id pub-id-type="pmcid">PMC12323204</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref18">
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Shumway</surname>
                            <given-names>JM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Harden</surname>
                            <given-names>RM</given-names>
                        </name>
</person-group>:
                    <article-title>Medical Teacher AMEE Guide No. 25: The assessment of learning outcomes for the competent and reflective physician.</article-title>
                    <year>2009</year>.
                    <pub-id pub-id-type="doi">10.1080/0142159032000151907</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref19">
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Yudkowsky</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Park</surname>
                            <given-names>YS</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Downing</surname>
                            <given-names>SM</given-names>
                        </name>
</person-group>:
                    <chapter-title>Introduction to Assessment in the Health Professions.</chapter-title>
                    <source>

                        <italic toggle="yes">Assessment in Health Professions Education.</italic>
</source>
                    <publisher-name>Routledge</publisher-name>;<year>2019</year>; pp.<fpage>3</fpage>&#x2013;<lpage>16</lpage>.
                    <pub-id pub-id-type="doi">10.4324/9781138054394-1</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report474984">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.193161.r474984</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Afzal</surname>
                        <given-names>Azam</given-names>
                    </name>
                    <xref ref-type="aff" rid="r474984a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-1643-8261</uri>
                </contrib>
                <aff id="r474984a1">
                    <label>1</label>Aga Khan University, Karachi, Pakistan</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>9</day>
                <month>6</month>
                <year>2026</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2026 Afzal A</copyright-statement>
                <copyright-year>2026</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport474984" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.175198.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>This protocol addresses a timely and highly relevant topic in health professions education. The increasing adoption of AI-based assessment systems, particularly Large Language Models (LLMs), makes a scoping review of automated scoring of short-answer questions (SAQs) both necessary and potentially impactful. The manuscript demonstrates a clear rationale, appropriate use of the Joanna Briggs Institute (JBI) methodology, and alignment with PRISMA-ScR reporting standards.</p>
            <p> However, several methodological and conceptual issues limit the rigor, transparency, and reproducibility of the proposed review.</p>
            <p> </p>
            <p> Major comments</p>
            <p> 1.The manuscript repeatedly refers to "AI-based models" but does not operationally define the term. Current automated scoring systems include: Rule-based systems, Machine Learning algorithms, Deep Learning models, Transformer-based architectures, Large Language Models (GPT, Claude, Gemini, Llama, etc.), Hybrid NLP approaches. These categories differ substantially in methodology and performance. My suggestion would be to add a conceptual framework that helps readers to distinguish between these terms.</p>
            <p> </p>
            <p> 2. Research question number 4 is problematic: "Are these models more effective than human graders in terms of reliability, accuracy, and fairness?" This question implies a comparative effectiveness judgment that may not be appropriate as it is beyond the aims of a scoping review. Scoping reviews aim to map evidence, describe characteristics and identify gaps. They generally do not evaluate superiority. A suggestion would be to replace it with: "How do studies compare the reliability, validity, fairness, and scoring performance of AI-based systems and human graders?"</p>
            <p> </p>
            <p> 3. The protocol states that databases will be searched and keywords will be combined using Boolean operators, but no search string is provided. The current reporting of this review does not allow replication of the search methodology. It is suggested to the authors to include a complete search strategy methodology.</p>
            <p> </p>
            <p> 4. The review includes studies from 2015 onward. This cutoff appears arbitrary. Authors should consider justifying the decision to exclude publications before 2015.</p>
            <p> </p>
            <p> 5. Data Analysis Plan Is Too Generic. The manuscript states that thematic analysis will be used. &#x00a0;For a scoping review of AI systems, more detail is needed. A suggestion could be to describe how themes will be developed around: Types of AI systems, Accuracy metrics, Fairness concerns, Implementation barriers, educational outcomes.</p>
            <p> </p>
            <p> 6. The proposed extraction variables are too limited. Important variables are missing, such as: AI characteristics, assessment characteristics, performance metrics. It is recommended to
                <bold>&#x00a0;i</bold>nclude a mock extraction table as supplementary material.</p>
            <p> </p>
            <p> 7. The manuscript continuously mentions fairness is assessment and the review aims to examine fairness, yet fairness is a subjective term as no operational definition or framework is provided.</p>
            <p> </p>
            <p> Minor comments</p>
            <p> The statement: "No ethical concerns..." should be revised. Perhaps a better wording could be: "Ethics approval was not required because the review utilizes publicly available published literature and does not involve human participants."</p>
            <p> </p>
            <p> The manuscript cites PRISMA-P, which is designed for systematic review protocols. Since this is a scoping review protocol, the authors should clarify why PRISMA-P was selected rather than PRISMA-ScR extensions and JBI scoping review guidance.</p>
            <p> </p>
            <p> Conclusion: The protocol addresses an important gap in medical education assessment research and has strong potential to contribute meaningfully to the field. However, several methodological details require clarification before indexing. In particular, the authors should strengthen the search strategy, provide a detailed data extraction framework, clarify definitions and outcome measures, justify eligibility restrictions, and elaborate on how fairness, validity, and reliability will be synthesized. These revisions would substantially improve the transparency, rigor, and reproducibility of the proposed review.</p>
            <p>Is the study design appropriate for the research question?</p>
            <p>Partly</p>
            <p>Is the rationale for, and objectives of, the study clearly described?</p>
            <p>Yes</p>
            <p>Are sufficient details of the methods provided to allow replication by others?</p>
            <p>Partly</p>
            <p>Are the datasets clearly presented in a useable and accessible format?</p>
            <p>Not applicable</p>
            <p>Reviewer Expertise:</p>
            <p>Health Professions Education; Educational Development; Teaching / Learning theory and pedagogies; Technology and Simulation based learning; Authentic Assessment.</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report486185">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.193161.r486185</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Bafor</surname>
                        <given-names>Anirejuoritse</given-names>
                    </name>
                    <xref ref-type="aff" rid="r486185a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-9278-5324</uri>
                </contrib>
                <aff id="r486185a1">
                    <label>1</label>Nationwide Children's Hospital, Columbus, OH, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>2</day>
                <month>6</month>
                <year>2026</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2026 Bafor A</copyright-statement>
                <copyright-year>2026</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport486185" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.175198.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The authors have presented a clear and well-articulated background and rationale for this scoping review protocol. The proposed study seeks to map the existing evidence on the use of artificial intelligence in the automated scoring of short-answer questions in medical education. The stated aim is to systematically map the literature with particular attention to the AI tools used, as well as their accuracy, reliability, and fairness. This is appropriate and relevant to current developments in medical education assessment.</p>
            <p> The review is structured around the following research questions: 
                <list list-type="order">
                    <list-item>
                        <p>What AI-based models have been used for the automated scoring of short-answer questions in medical education?</p>
                    </list-item>
                    <list-item>
                        <p>How accurately does automated scoring of short-answer questions reflect the performance of human graders?</p>
                    </list-item>
                    <list-item>
                        <p>What advantages and challenges have been reported regarding the use of AI for automated scoring of short-answer questions in medical education?</p>
                    </list-item>
                    <list-item>
                        <p>Are these AI-based models more effective than human graders in terms of reliability, accuracy, and fairness?</p>
                    </list-item>
                </list> The use of the Population, Concept, and Context framework to guide the search strategy, eligibility criteria, and data extraction process is appropriate for a scoping review. The involvement of a librarian in the database search is also a strength and supports the methodological rigor of the protocol.</p>
            <p> However, the following points require clarification or revision: 
                <list list-type="order">
                    <list-item>
                        <p>
                            <bold>Justification for the search date restriction -&#x00a0;</bold>The authors should provide a clear rationale for excluding studies published before 2015. If this date restriction is based on the emergence or maturation of relevant AI technologies, this should be explicitly stated and justified.</p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Assessment of accuracy -&#x00a0;</bold>The authors should clarify how the accuracy of automated scoring of short-answer questions will be determined from the included studies. For example, will accuracy be assessed based on correlation with human graders, agreement statistics, sensitivity/specificity, mean score differences, reliability coefficients, or other reported performance metrics?</p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Critical appraisal</bold> - What is the plan for assessing the quality, reliability, and potential bias of the studies included in this review.</p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Data extraction form -&#x00a0;</bold>The authors should include a copy of the proposed data extraction form as an appendix or supplementary material. This would allow reviewers to better assess whether the planned extraction process adequately captures key variables such as AI model type, dataset characteristics, scoring method, comparator, accuracy metrics, reliability, fairness considerations, and reported limitations.</p>
                    </list-item>
                </list> Overall, this is a timely and relevant scoping review protocol. Addressing the points above would improve the clarity, transparency, and reproducibility of the proposed methodology.</p>
            <p>Is the study design appropriate for the research question?</p>
            <p>Partly</p>
            <p>Is the rationale for, and objectives of, the study clearly described?</p>
            <p>Yes</p>
            <p>Are sufficient details of the methods provided to allow replication by others?</p>
            <p>Partly</p>
            <p>Are the datasets clearly presented in a useable and accessible format?</p>
            <p>Not applicable</p>
            <p>Reviewer Expertise:</p>
            <p>Orthopedic Surgery</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report474986">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.193161.r474986</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Mitra</surname>
                        <given-names>Nilesh Kumar</given-names>
                    </name>
                    <xref ref-type="aff" rid="r474986a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r474986a1">
                    <label>1</label>IMU University, Kuala Lumpur, Malaysia</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>7</day>
                <month>5</month>
                <year>2026</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2026 Mitra NK</copyright-statement>
                <copyright-year>2026</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport474986" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.175198.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>reject</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The authors have attempted to develop a protocol for conducting a scoping review on the use of AI in automated scoring of short answer questions in medical education. Such a protocol without any visible data will probably be of no use to the prospective researcher.</p>
            <p> The following is the elements of a protocol for scoping review</p>
            <p> 1. Review Questions</p>
            <p> 2. Eligibility criteria</p>
            <p> 3. Search Strategy</p>
            <p> 4. Data charting and mock extraction form&#x00a0;</p>
            <p> 5.Template to be used for final report of review</p>
            <p> The author should take effort to improve review questions by evidence-based analysis of the literature related to topic. Otherwise, it stands alone without evidence. The eligibility should be more descriptive. Each component P, C and C under search strategy should be described in detail with enough descriptions of each. Present description in general.</p>
            <p> Mechanism of data charting, technology used and extraction form should be added</p>
            <p> Please also look at JBI manual for evidence synthesis.</p>
            <p>Is the study design appropriate for the research question?</p>
            <p>Partly</p>
            <p>Is the rationale for, and objectives of, the study clearly described?</p>
            <p>Yes</p>
            <p>Are sufficient details of the methods provided to allow replication by others?</p>
            <p>No</p>
            <p>Are the datasets clearly presented in a useable and accessible format?</p>
            <p>No</p>
            <p>Reviewer Expertise:</p>
            <p>Technology-enhanced learning, Artificial intelligence, Online assessment</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.</p>
        </body>
    </sub-article>
</article>
