Introduction

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.175198.1

Study Protocol

Articles

Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education

[version 1; peer review: 2 approved with reservations, 1 not approved]

Çalişkan

S. Ayhan

Conceptualization Methodology Project Administration Supervision Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0000-0001-9714-6249 1 Bello Abubakar

Firdaus

Conceptualization Methodology Project Administration Writing – Original Draft Preparation Writing – Review & Editing 1 Magzoub

Mohi Eldin

Conceptualization Methodology Project Administration Supervision Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0000-0002-6721-4500 a 1 1Department of Medical Education, United Arab Emirates University College of Medicine and Health Sciences, Al Ain, Abu Dhabi, United Arab Emirates

a mmagzoub@uaeu.ac.ae

No competing interests were disclosed.

13 3 2026

2026

395

10 2 2026

2026

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Assessment plays a central role in medical education by evaluating learners’ knowledge, skills, and professional competencies. While multiple-choice questions (MCQs) are widely used due to their efficiency and broad content coverage, they primarily assess recall and recognition, limiting their ability to measure higher-order reasoning. Short-answer questions (SAQs), in contrast, promote deeper cognitive processing and provide better discrimination between levels of student performance. However, SAQs are resource-intensive to grade and susceptible to scorer inconsistency and rater bias, highlighting a need for more efficient and reliable assessment solutions.

Artificial Intelligence (AI) has emerged as a transformative tool in medical education, enhancing learning, supporting adaptive instruction, and automating assessment processes. AI-driven systems using machine learning and natural language processing have been increasingly applied to automated scoring of SAQs. These systems offer potential benefits, including reduced grading burden, greater scoring consistency, and timely feedback to learners. Despite promising developments, concerns persist regarding algorithmic transparency, data privacy, and the reliability and validity of automated scoring compared with human graders. Existing studies report mixed results, underscoring the need for a comprehensive examination of current approaches.

This scoping review aims to systematically map the literature on AI-based models used for automated scoring of SAQs in medical education. Specifically, it seeks to identify the types of AI models employed, evaluate their accuracy and reliability relative to human graders, describe reported advantages and challenges, and assess fairness and feasibility within educational settings. Following the Joanna Briggs Institute methodology and the Population–Concept–Context framework, the review will include empirical studies published since 2015 involving medical students and AI-driven SAQ scoring. Findings will provide an evidence-based overview of current practices, highlight gaps in the literature, and inform future research and implementation strategies for AI-assisted assessment in medical education.

Artificial intelligence Automated scoring Auto grading Short-Answer Questions SAQs Assessment Medical education Machine learning Natural language processing Assessment reliability Assessment validity Autoscoring tools

The author(s) declared that no grants were involved in supporting this work.

Introduction Assessment in medical education

Assessment is a central component of medical education, serving to evaluate learners’ knowledge, skills, and attitudes through both formative and summative assessment methods ( Schuwirth & van der Vleuten, 2020; Yudkowsky et al., 2019). Effective assessment not only guides and motivates students’ learning but also provides an essential mechanism for determining whether they have attained the competencies required of medical professionals. Medical schools employ a range of assessment strategies to monitor students’ progress in acquiring core knowledge and clinical competencies ( Norcini et al., 2018; Shumway & Harden, 2009). Common assessment methods used in medical education include Multiple-Choice Questions (MCQs), Modified Essay Questions (MEQ), Short Answer Questions (SAQs), Objective Structured Clinical Examination (OSCE) and Key Feature Problems (KFPs) ( Boursicot et al., 2018; Jolly & Dalton, 2018). Assessing students’ learning is therefore fundamental to ensuring high-quality medical training, and the choice of assessment method directly influences the breadth and depth of knowledge or skills that can be evaluated ( Preston et al., 2020). Crucially, each assessment method must be aligned with the targeted competencies, the instructional approaches used, and the desired impact on student learning ( Yudkowsky et al., 2019).

Short answer questions vs MCQ’s

Multiple-choice questions (MCQs) and SAQs are among the commonly used written assessment formats in medical education ( Jolly & Dalton, 2018; Preston et al., 2020; Shumway & Harden, 2009). MCQs, in which students select a response from predetermined options, are widely favored because they allow broad sampling of content, assess a wide range of knowledge areas efficiently, and facilitate rapid, objective grading ( Jolly & Dalton, 2018; Schuwirth & Van Der Vleuten, 2004). Although MCQs effectively evaluate recall, recognition, and factual knowledge, they have been criticized for their limited ability to promote critical thinking or accurately assess deeper mastery of subject matter ( Schuwirth & Van Der Vleuten, 2004). Their reliance on recognition-based answering may discourage deep learning and higher-order reasoning, making it challenging to determine whether students have truly internalized the material ( Schuwirth & Van Der Vleuten, 2004; Shumway & Harden, 2009).

SAQs, by contrast, require students to generate concise written responses, thereby encouraging active retrieval, deeper cognitive processing, and higher-order reasoning ( Grévisse, 2024; Jolly & Dalton, 2018; Potter & McLachlan, 2025). SAQs enable assessment of a broad range of competencies and cognitive skills and often demonstrate higher reliability and better discrimination -in other words, stronger ability to differentiate between high- and low-performing students- than MCQs, making them a valuable component of medical assessment systems ( Jolly & Dalton, 2018; Schuwirth & Van Der Vleuten, 2004). However, SAQs are more resource-intensive to grade, and concerns about scorer inconsistency and rater bias can pose challenges to their validity ( Grévisse, 2024; Potter & McLachlan, 2025). The emergence of AI-based autoscoring tools offers a promising solution, with early evidence suggesting improved efficiency, reduced bias, and enhanced scoring consistency ( Grévisse, 2024).

Emergence of AI in medical education assessment

Artificial Intelligence (AI) is emerging as a powerful and transformative tool in medical education. AI technologies are increasingly integrated into educational systems to enhance students’ learning experiences, prepare them for an AI-driven healthcare environment, personalize learning, and improve assessment processes ( Hallquist et al., 2025; Rincón et al., 2025). AI tools employ Machine Learning (ML), Deep Learning (DL), and Natural Language Processing (NLP) techniques to support adaptive learning platforms, automate question generation, evaluate student responses, and deliver timely, personalized feedback ( Hallquist et al., 2025; Rincón et al., 2025).

The incorporation of AI has enhanced students’ learning through AI-assisted assessment platforms that enable learners to practice applying their knowledge while receiving immediate feedback ( Gordon et al., 2024). AI-driven simulated case presentations -where AI functions as a virtual physician or simulated patient- have been shown to improve students’ communication and clinical skills ( Merritt et al., 2022; Rincón et al., 2025). For medical educators, AI has reduced workload by automating assessment processes and generating examination questions ( Hallquist et al., 2025; Seneviratne & Manathunga, 2025). Educators can use AI to develop assessment items, evaluate item reliability, and automatically score student responses. Despite these advantages, concerns remain regarding data privacy, ethical use of learner information, and, notably, the transparency of AI algorithms employed in automated assessment and feedback systems.

Automatic scoring of short answer questions

Automated assessment systems use Machine Learning (ML) and Natural Language Processing (NLP) techniques to automated scoring of SAQs and essays ( Grévisse, 2024; Seneviratne & Manathunga, 2025). These systems can process large numbers of student responses and provide timely feedback, thereby reducing educator workload, minimizing grader bias, and supporting improved student learning and performance ( Clauser et al., 2024; Grévisse, 2024; Seneviratne & Manathunga, 2025).

Automated scoring of short-answer questions (ASAQ) was first introduced in the 1960s and has since undergone substantial development, incorporating increasingly sophisticated statistical, ML, and NLP approaches to improve scoring accuracy and reliability. Recently, the application of ASAQ has gained considerable attention in medical education, where grading large volumes of SAQs is both time-consuming and susceptible to rater bias ( Bolgova et al., 2025; Clauser et al., 2024; Grévisse, 2024; Rajan et al., 2025; Seneviratne & Manathunga, 2025). Research on ASAQ has reported mixed findings: while many studies demonstrate that automated scoring can achieve results comparable to human graders, some highlight concerns related to the system’s reliability, validity, and the opacity of its algorithms ( Bolgova et al., 2025; Clauser et al., 2024; Grévisse, 2024; Seneviratne & Manathunga, 2025). Consequently, although ASAQ holds great potential to enhance assessment practices in medical education, further empirical research is needed to ensure fairness, robustness, and broad acceptance within educational settings.

Rationale

Although the use of AI models for automated scoring of short-answer questions has gained increasing attention in medical education research, the breadth and depth of the existing literature remain unclear. A notable gap exists in understanding the range and types of AI models employed by medical educators to evaluate short-answer responses in student assessments. Consequently, a scoping review is warranted to systematically explore the extent and nature of current research and to map the available evidence on this topic.

Beyond identifying the AI models used, the review will examine the reported validity, reliability, and feasibility of these systems in comparison with traditional human grading methods. Such a review will help clarify the current landscape of AI-driven automated scoring, highlight research trends and gaps, and provide a comprehensive overview of how these technologies are being implemented in medical education assessment.

Methods Research questions

The aim of the review is to systematically map the existing literature on automated scoring of SAQs in medical education, with a focus on utilized tools, accuracy, reliability, and fairness. The review will address the following research questions: a.

What AI-based models have been used to automated scoring of SAQs in medical education?

How accurately does automated scoring of SAQs reflect the performance of human graders?

What advantages and challenges of using AI for automated scoring of SAQs have been reported in medical education?

Are these models more effective than human graders in terms of reliability, accuracy, and fairness?

Search strategy

This scoping review will be conducted in accordance with the methodology outlined in the Joanna Briggs Institute (JBI) Manual for Evidence Synthesis, with a focus on AI models used for automated scoring of SAQs in medical school assessments. The Population, Concept, and Context (PCC) framework will guide the search strategy, eligibility criteria, and data extraction processes. •

Population: Medical students

•

Concept: Automated scoring of short-answer questions using AI-based models

•

Context: Medical education and medical school settings

Four electronic databases will be searched: PubMed, Scopus, Medline, and Web of Science, to comprehensively capture research related to AI applications in medical education assessment. Peer-reviewed quantitative, qualitative, and mixed-methods studies will be included.

A detailed search strategy will be developed in consultation with a medical librarian. Keywords such as “medical education,” “automated scoring,” “short answer questions,” and “medical school” will be used, combined with Boolean operators “AND” and “OR” to refine and optimize search results.

Screening process

Covidence systematic review software tool will be used to import references, conduct title and abstract screening, review full texts, and manage data extraction. Two independent reviewers will screen all retrieved studies based on the predefined eligibility criteria. Discrepancies will be resolved by a third reviewer.

Screening will occur in two stages: 1.

Title and abstract screening conducted independently by two reviewers.

Full-text review of studies deemed potentially relevant.

A secondary search will involve screening the reference lists of included studies to identify additional relevant literature.

Eligibility criteria

Inclusion criteria 1.

Topic: Studies must focus on automated scoring of short-answer questions.

Methodology: Original empirical research, including quantitative, qualitative, and mixed-methods studies.

Study Population: Medical students completing assessments containing SAQs.

Assessment Type: Any assessment format using SAQs (e.g., low-stakes, high-stakes, summative examinations).

Publication Date: Studies published from 2015 onward.

Language: English-language publications only.

Computational Approach: Studies using AI-based autoscoring models (machine learning, large language models, deep learning).

Setting: Medical education or medical school settings.

Exclusion criteria 1.

Studies not focusing on SAQs (e.g., MCQs, essays).

Non-English publications.

Studies not conducted within a medical education context or not involving medical students.

Studies using non-AI-based approaches to automated scoring.

Systematic reviews, scoping reviews, and grey literature.

Studies published before 2015.

Data extraction and charting

A standardized data extraction form will be used to collect key information from all included studies. Two reviewers will independently extract data, with disagreements resolved by a third reviewer. Extracted data will include: •

Study methodology

•

Type of AI-based model used

•

Participant characteristics

•

Assessment type

•

Outcomes relating to the performance of the autoscoring system

Extracted data will be charted and summarized in tables and figures. Tables will outline study characteristics (e.g., author, publication year, AI model used), while figures will illustrate the frequency of AI models applied and the accuracy of their automated scoring.

Data analysis

A thematic analysis approach will be used to identify recurring patterns and themes within the included studies. Key concepts related to the types of AI models used, their scoring accuracy, associated challenges and advantages, and their reported validity and reliability will be systematically coded and synthesized into overarching themes.

This analytical approach ensures alignment with the research questions and provides a comprehensive overview of the literature. The review will evaluate whether automated scoring of SAQs should be integrated more widely into medical education and identify existing research gaps and future directions.

Results will be summarized narratively and supported with tables and figures where appropriate. Both the protocol and final review will follow the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) guidelines.

Expected outcomes

•

A comprehensive mapping of current literature on the use of AI models for automated scoring of SAQs in medical education.

•

Identification of the benefits and challenges associated with integrating automated scoring into medical assessments.

•

Evaluation of the validity, reliability, and feasibility of automated scoring systems compared with human graders.

•

Recommendations for future research on the validity, reliability, and implementation of AI-based autoscoring systems in medical education.

Dissemination

The findings from this scoping review will be presented at academic conferences and submitted for publication in peer-reviewed journals.

Ethical considerations

No ethical concerns as this review will include the use pf published peer-reviewed articles and grey literature. Therefore, ethical approval and consent is not required.

Data availability

No data is associated with this article.

Reporting guidelines

Repository: PRISMA-P checklist for “Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education”.

DOI: 10.6084/m9.figshare.30815456 ( Çalışkan et al., 2025).

Data are available under the terms of the CC BY 4.0

References

Bolgova

Ganguly

Ikram

: Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders. Med. Educ. Online. 2025;30(1). 40849930

10.1080/10872981.2025.2550751

PMC12377152

Boursicot

KAM

Roberts

Burdick

: Structured assessments of clinical competence. Swanwick

Forrest

O’Brien

, editors. Understanding Medical Education: Evidence, Theory, and Practice. Wiley; 3rd ed. 2018; pp.335–345. 10.1002/9781119373780.CH23;CTYPE:STRING:BOOK

Çalışkan

Abubakar Bello

Magzoub

: PRISMA-P checklist for Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education. 2025, December 7. Reference Source

Clauser

Yaneva

Baldwin

: Automated Scoring of Short-Answer Questions: A Progress Report. Appl. Meas. Educ. 2024;37(3):209–224. 10.1080/08957347.2024.2386945

Gordon

Daniel

Ajiboye

: A scoping review of artificial intelligence in medical education: BEME Guide No. 84. Med. Teach. 2024;46(4):446–470. 38423127

10.1080/0142159X.2024.2314198

Grévisse

: LLM-based automatic short answer grading in undergraduate medical education. BMC Med. Educ. 2024;24(1):1060. 39334087

10.1186/S12909-024-06026-5/FIGURES/13

PMC11429088

Hallquist

Gupta

Montalbano

: Applications of Artificial Intelligence in Medical Education: A Systematic Review. Cureus. 2025;17(3). 10.7759/CUREUS.79878

Jolly

Dalton

: Written assessment. Swanwick

Forrest

O’Brien

, editors. Understanding Medical Education: Evidence, Theory, and Practice. Wiley; 3rd ed. 2018; pp.291–317. 10.1002/9781119373780.CH21;PAGE:STRING:ARTICLE/CHAPTER

Merritt

Glisson

Dewan

: Implementation and Evaluation of an Artificial Intelligence Driven Simulation to Improve Resident Communication With Primary Care Providers. Acad. Pediatr. 2022;22(3):503–505. 34923145

10.1016/J.ACAP.2021.12.013

Norcini

Anderson

Bollela

: 2018 Consensus framework for good assessment. Med. Teach. 2018;40(11):1102–1109. 30299187

10.1080/0142159X.2018.1500016

Potter

McLachlan

: Assessing medical knowledge: A 3-year comparative study of very short answer vs. multiple choice questions. Med. Teach. 2025;47(10):1669–1677. 40293799

10.1080/0142159X.2025.2496382

Preston

Gratani

Owens

: Exploring the Impact of Assessment on Medical Students’ Learning. Assess. Eval. High. Educ. 2020;45(1):109–124. 10.1080/02602938.2019.1614145;ISSUE:ISSUE:DOI

Rajan

Alexander

SMK

Shenvi

: Can AI grade like a professor? comparing artificial intelligence and faculty scoring of medical student short-answer clinical reasoning exams. Adv. Health Sci. Educ. 2025;1–11. 10.1007/S10459-025-10462-3/FIGURES/2

Rincón

EHH

Jimenez

Aguilar

LAC

: Mapping the use of artificial intelligence in medical education: a scoping review. BMC Med. Educ. 2025;25(1):526. 10.1186/S12909-025-07089-8/FIGURES/4

Schuwirth

LWT

Van Der Vleuten

CPM

: Different written assessment methods: what can be said about their strengths and weaknesses?. Med. Educ. 2004;38(9):974–979. 15327679

10.1111/J.1365-2929.2004.01916.X

Schuwirth

LWT

Vleuten

CPM

van der : A history of assessment in medical education. Adv. Health Sci. Educ. Theory Pract. 2020;25(5):1045–1056. 10.1007/S10459-020-10003-0

Seneviratne

HMTW

Manathunga

: Artificial intelligence assisted automated short answer question scoring tool shows high correlation with human examiner markings. BMC Med. Educ. 2025;25(1):1146. 40764994

10.1186/S12909-025-07718-2/FIGURES/5

PMC12323204

Shumway

Harden

: Medical Teacher AMEE Guide No. 25: The assessment of learning outcomes for the competent and reflective physician. 2009. 10.1080/0142159032000151907

Yudkowsky

Park

Downing

: Introduction to Assessment in the Health Professions. Assessment in Health Professions Education. Routledge;2019; pp.3–16. 10.4324/9781138054394-1

10.5256/f1000research.193161.r474984

Reviewer response for version 1

Afzal

Azam

1 Referee https://orcid.org/0000-0003-1643-8261 1Aga Khan University, Karachi, Pakistan

Competing interests: No competing interests were disclosed.

9 6 2026

2026

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

This protocol addresses a timely and highly relevant topic in health professions education. The increasing adoption of AI-based assessment systems, particularly Large Language Models (LLMs), makes a scoping review of automated scoring of short-answer questions (SAQs) both necessary and potentially impactful. The manuscript demonstrates a clear rationale, appropriate use of the Joanna Briggs Institute (JBI) methodology, and alignment with PRISMA-ScR reporting standards.

However, several methodological and conceptual issues limit the rigor, transparency, and reproducibility of the proposed review.

Major comments

1.The manuscript repeatedly refers to "AI-based models" but does not operationally define the term. Current automated scoring systems include: Rule-based systems, Machine Learning algorithms, Deep Learning models, Transformer-based architectures, Large Language Models (GPT, Claude, Gemini, Llama, etc.), Hybrid NLP approaches. These categories differ substantially in methodology and performance. My suggestion would be to add a conceptual framework that helps readers to distinguish between these terms.

2. Research question number 4 is problematic: "Are these models more effective than human graders in terms of reliability, accuracy, and fairness?" This question implies a comparative effectiveness judgment that may not be appropriate as it is beyond the aims of a scoping review. Scoping reviews aim to map evidence, describe characteristics and identify gaps. They generally do not evaluate superiority. A suggestion would be to replace it with: "How do studies compare the reliability, validity, fairness, and scoring performance of AI-based systems and human graders?"

3. The protocol states that databases will be searched and keywords will be combined using Boolean operators, but no search string is provided. The current reporting of this review does not allow replication of the search methodology. It is suggested to the authors to include a complete search strategy methodology.

4. The review includes studies from 2015 onward. This cutoff appears arbitrary. Authors should consider justifying the decision to exclude publications before 2015.

5. Data Analysis Plan Is Too Generic. The manuscript states that thematic analysis will be used. For a scoping review of AI systems, more detail is needed. A suggestion could be to describe how themes will be developed around: Types of AI systems, Accuracy metrics, Fairness concerns, Implementation barriers, educational outcomes.

6. The proposed extraction variables are too limited. Important variables are missing, such as: AI characteristics, assessment characteristics, performance metrics. It is recommended to include a mock extraction table as supplementary material.

7. The manuscript continuously mentions fairness is assessment and the review aims to examine fairness, yet fairness is a subjective term as no operational definition or framework is provided.

Minor comments

The statement: "No ethical concerns..." should be revised. Perhaps a better wording could be: "Ethics approval was not required because the review utilizes publicly available published literature and does not involve human participants."

The manuscript cites PRISMA-P, which is designed for systematic review protocols. Since this is a scoping review protocol, the authors should clarify why PRISMA-P was selected rather than PRISMA-ScR extensions and JBI scoping review guidance.

Conclusion: The protocol addresses an important gap in medical education assessment research and has strong potential to contribute meaningfully to the field. However, several methodological details require clarification before indexing. In particular, the authors should strengthen the search strategy, provide a detailed data extraction framework, clarify definitions and outcome measures, justify eligibility restrictions, and elaborate on how fairness, validity, and reliability will be synthesized. These revisions would substantially improve the transparency, rigor, and reproducibility of the proposed review.

Is the study design appropriate for the research question?

Partly

Is the rationale for, and objectives of, the study clearly described?

Yes

Are sufficient details of the methods provided to allow replication by others?

Partly

Are the datasets clearly presented in a useable and accessible format?

Not applicable

Reviewer Expertise:

Health Professions Education; Educational Development; Teaching / Learning theory and pedagogies; Technology and Simulation based learning; Authentic Assessment.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

10.5256/f1000research.193161.r486185

Reviewer response for version 1

Bafor

Anirejuoritse

1 Referee https://orcid.org/0000-0001-9278-5324 1Nationwide Children's Hospital, Columbus, OH, USA

Competing interests: No competing interests were disclosed.

2 6 2026

2026

recommendation

approve-with-reservations

The authors have presented a clear and well-articulated background and rationale for this scoping review protocol. The proposed study seeks to map the existing evidence on the use of artificial intelligence in the automated scoring of short-answer questions in medical education. The stated aim is to systematically map the literature with particular attention to the AI tools used, as well as their accuracy, reliability, and fairness. This is appropriate and relevant to current developments in medical education assessment.

The review is structured around the following research questions:

What AI-based models have been used for the automated scoring of short-answer questions in medical education?

How accurately does automated scoring of short-answer questions reflect the performance of human graders?

What advantages and challenges have been reported regarding the use of AI for automated scoring of short-answer questions in medical education?

Are these AI-based models more effective than human graders in terms of reliability, accuracy, and fairness?

The use of the Population, Concept, and Context framework to guide the search strategy, eligibility criteria, and data extraction process is appropriate for a scoping review. The involvement of a librarian in the database search is also a strength and supports the methodological rigor of the protocol.

However, the following points require clarification or revision:

Justification for the search date restriction - The authors should provide a clear rationale for excluding studies published before 2015. If this date restriction is based on the emergence or maturation of relevant AI technologies, this should be explicitly stated and justified.

Assessment of accuracy - The authors should clarify how the accuracy of automated scoring of short-answer questions will be determined from the included studies. For example, will accuracy be assessed based on correlation with human graders, agreement statistics, sensitivity/specificity, mean score differences, reliability coefficients, or other reported performance metrics?

Critical appraisal - What is the plan for assessing the quality, reliability, and potential bias of the studies included in this review.

Data extraction form - The authors should include a copy of the proposed data extraction form as an appendix or supplementary material. This would allow reviewers to better assess whether the planned extraction process adequately captures key variables such as AI model type, dataset characteristics, scoring method, comparator, accuracy metrics, reliability, fairness considerations, and reported limitations.

Overall, this is a timely and relevant scoping review protocol. Addressing the points above would improve the clarity, transparency, and reproducibility of the proposed methodology.

Is the study design appropriate for the research question?

Partly

Is the rationale for, and objectives of, the study clearly described?

Yes

Are sufficient details of the methods provided to allow replication by others?

Partly

Are the datasets clearly presented in a useable and accessible format?

Not applicable

Reviewer Expertise:

Orthopedic Surgery

10.5256/f1000research.193161.r474986

Reviewer response for version 1

Mitra

Nilesh Kumar

1 Referee 1IMU University, Kuala Lumpur, Malaysia

Competing interests: No competing interests were disclosed.

7 5 2026

2026

recommendation

reject

The authors have attempted to develop a protocol for conducting a scoping review on the use of AI in automated scoring of short answer questions in medical education. Such a protocol without any visible data will probably be of no use to the prospective researcher.

The following is the elements of a protocol for scoping review

1. Review Questions

2. Eligibility criteria

3. Search Strategy

4. Data charting and mock extraction form

5.Template to be used for final report of review

The author should take effort to improve review questions by evidence-based analysis of the literature related to topic. Otherwise, it stands alone without evidence. The eligibility should be more descriptive. Each component P, C and C under search strategy should be described in detail with enough descriptions of each. Present description in general.

Mechanism of data charting, technology used and extraction form should be added

Please also look at JBI manual for evidence synthesis.

Is the study design appropriate for the research question?

Partly

Is the rationale for, and objectives of, the study clearly described?

Yes

Are sufficient details of the methods provided to allow replication by others?

Are the datasets clearly presented in a useable and accessible format?

Reviewer Expertise:

Technology-enhanced learning, Artificial intelligence, Online assessment

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.