Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education

S. Ayhan Çalişkan; Firdaus Bello Abubakar; Mohi Eldin Magzoub

doi:10.12688/f1000research.175198.1

Home Browse Protocol for Conducting a Scoping Review on The Use of AI in Automated...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Study Protocol

Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education

[version 1; peer review: 2 approved with reservations, 1 not approved]

S. Ayhan Çalişkan¹^*, Firdaus Bello Abubakar¹^*, Mohi Eldin Magzoub ¹^*

^* Equal contributors

PUBLISHED 13 Mar 2026

Author details Author details

¹ Department of Medical Education, United Arab Emirates University College of Medicine and Health Sciences, Al Ain, Abu Dhabi, United Arab Emirates

S. Ayhan Çalişkan
Roles: Conceptualization, Methodology, Project Administration, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Firdaus Bello Abubakar
Roles: Conceptualization, Methodology, Project Administration, Writing – Original Draft Preparation, Writing – Review & Editing

Mohi Eldin Magzoub
Roles: Conceptualization, Methodology, Project Administration, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

Abstract

Assessment plays a central role in medical education by evaluating learners’ knowledge, skills, and professional competencies. While multiple-choice questions (MCQs) are widely used due to their efficiency and broad content coverage, they primarily assess recall and recognition, limiting their ability to measure higher-order reasoning. Short-answer questions (SAQs), in contrast, promote deeper cognitive processing and provide better discrimination between levels of student performance. However, SAQs are resource-intensive to grade and susceptible to scorer inconsistency and rater bias, highlighting a need for more efficient and reliable assessment solutions.

Artificial Intelligence (AI) has emerged as a transformative tool in medical education, enhancing learning, supporting adaptive instruction, and automating assessment processes. AI-driven systems using machine learning and natural language processing have been increasingly applied to automated scoring of SAQs. These systems offer potential benefits, including reduced grading burden, greater scoring consistency, and timely feedback to learners. Despite promising developments, concerns persist regarding algorithmic transparency, data privacy, and the reliability and validity of automated scoring compared with human graders. Existing studies report mixed results, underscoring the need for a comprehensive examination of current approaches.

This scoping review aims to systematically map the literature on AI-based models used for automated scoring of SAQs in medical education. Specifically, it seeks to identify the types of AI models employed, evaluate their accuracy and reliability relative to human graders, describe reported advantages and challenges, and assess fairness and feasibility within educational settings. Following the Joanna Briggs Institute methodology and the Population–Concept–Context framework, the review will include empirical studies published since 2015 involving medical students and AI-driven SAQ scoring. Findings will provide an evidence-based overview of current practices, highlight gaps in the literature, and inform future research and implementation strategies for AI-assisted assessment in medical education.

Keywords

Artificial intelligence, Automated scoring, Auto grading, Short-Answer Questions, SAQs, Assessment, Medical education, Machine learning, Natural language processing, Assessment reliability, Assessment validity, Autoscoring tools

Corresponding author: Mohi Eldin Magzoub

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2026 Çalişkan SA et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Çalişkan SA, Bello Abubakar F and Magzoub ME. Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2026, 15:395 (https://doi.org/10.12688/f1000research.175198.1) First published: 13 Mar 2026, 15:395 (https://doi.org/10.12688/f1000research.175198.1) Latest published: 13 Mar 2026, 15:395 (https://doi.org/10.12688/f1000research.175198.1)

Introduction

Assessment in medical education

Assessment is a central component of medical education, serving to evaluate learners’ knowledge, skills, and attitudes through both formative and summative assessment methods (Schuwirth & van der Vleuten, 2020; Yudkowsky et al., 2019). Effective assessment not only guides and motivates students’ learning but also provides an essential mechanism for determining whether they have attained the competencies required of medical professionals. Medical schools employ a range of assessment strategies to monitor students’ progress in acquiring core knowledge and clinical competencies (Norcini et al., 2018; Shumway & Harden, 2009). Common assessment methods used in medical education include Multiple-Choice Questions (MCQs), Modified Essay Questions (MEQ), Short Answer Questions (SAQs), Objective Structured Clinical Examination (OSCE) and Key Feature Problems (KFPs) (Boursicot et al., 2018; Jolly & Dalton, 2018). Assessing students’ learning is therefore fundamental to ensuring high-quality medical training, and the choice of assessment method directly influences the breadth and depth of knowledge or skills that can be evaluated (Preston et al., 2020). Crucially, each assessment method must be aligned with the targeted competencies, the instructional approaches used, and the desired impact on student learning (Yudkowsky et al., 2019).

Short answer questions vs MCQ’s

Multiple-choice questions (MCQs) and SAQs are among the commonly used written assessment formats in medical education (Jolly & Dalton, 2018; Preston et al., 2020; Shumway & Harden, 2009). MCQs, in which students select a response from predetermined options, are widely favored because they allow broad sampling of content, assess a wide range of knowledge areas efficiently, and facilitate rapid, objective grading (Jolly & Dalton, 2018; Schuwirth & Van Der Vleuten, 2004). Although MCQs effectively evaluate recall, recognition, and factual knowledge, they have been criticized for their limited ability to promote critical thinking or accurately assess deeper mastery of subject matter (Schuwirth & Van Der Vleuten, 2004). Their reliance on recognition-based answering may discourage deep learning and higher-order reasoning, making it challenging to determine whether students have truly internalized the material (Schuwirth & Van Der Vleuten, 2004; Shumway & Harden, 2009).

SAQs, by contrast, require students to generate concise written responses, thereby encouraging active retrieval, deeper cognitive processing, and higher-order reasoning (Grévisse, 2024; Jolly & Dalton, 2018; Potter & McLachlan, 2025). SAQs enable assessment of a broad range of competencies and cognitive skills and often demonstrate higher reliability and better discrimination -in other words, stronger ability to differentiate between high- and low-performing students- than MCQs, making them a valuable component of medical assessment systems (Jolly & Dalton, 2018; Schuwirth & Van Der Vleuten, 2004). However, SAQs are more resource-intensive to grade, and concerns about scorer inconsistency and rater bias can pose challenges to their validity (Grévisse, 2024; Potter & McLachlan, 2025). The emergence of AI-based autoscoring tools offers a promising solution, with early evidence suggesting improved efficiency, reduced bias, and enhanced scoring consistency (Grévisse, 2024).

Emergence of AI in medical education assessment

Artificial Intelligence (AI) is emerging as a powerful and transformative tool in medical education. AI technologies are increasingly integrated into educational systems to enhance students’ learning experiences, prepare them for an AI-driven healthcare environment, personalize learning, and improve assessment processes (Hallquist et al., 2025; Rincón et al., 2025). AI tools employ Machine Learning (ML), Deep Learning (DL), and Natural Language Processing (NLP) techniques to support adaptive learning platforms, automate question generation, evaluate student responses, and deliver timely, personalized feedback (Hallquist et al., 2025; Rincón et al., 2025).

The incorporation of AI has enhanced students’ learning through AI-assisted assessment platforms that enable learners to practice applying their knowledge while receiving immediate feedback (Gordon et al., 2024). AI-driven simulated case presentations -where AI functions as a virtual physician or simulated patient- have been shown to improve students’ communication and clinical skills (Merritt et al., 2022; Rincón et al., 2025). For medical educators, AI has reduced workload by automating assessment processes and generating examination questions (Hallquist et al., 2025; Seneviratne & Manathunga, 2025). Educators can use AI to develop assessment items, evaluate item reliability, and automatically score student responses. Despite these advantages, concerns remain regarding data privacy, ethical use of learner information, and, notably, the transparency of AI algorithms employed in automated assessment and feedback systems.

Automatic scoring of short answer questions

Automated assessment systems use Machine Learning (ML) and Natural Language Processing (NLP) techniques to automated scoring of SAQs and essays (Grévisse, 2024; Seneviratne & Manathunga, 2025). These systems can process large numbers of student responses and provide timely feedback, thereby reducing educator workload, minimizing grader bias, and supporting improved student learning and performance (Clauser et al., 2024; Grévisse, 2024; Seneviratne & Manathunga, 2025).

Automated scoring of short-answer questions (ASAQ) was first introduced in the 1960s and has since undergone substantial development, incorporating increasingly sophisticated statistical, ML, and NLP approaches to improve scoring accuracy and reliability. Recently, the application of ASAQ has gained considerable attention in medical education, where grading large volumes of SAQs is both time-consuming and susceptible to rater bias (Bolgova et al., 2025; Clauser et al., 2024; Grévisse, 2024; Rajan et al., 2025; Seneviratne & Manathunga, 2025). Research on ASAQ has reported mixed findings: while many studies demonstrate that automated scoring can achieve results comparable to human graders, some highlight concerns related to the system’s reliability, validity, and the opacity of its algorithms (Bolgova et al., 2025; Clauser et al., 2024; Grévisse, 2024; Seneviratne & Manathunga, 2025). Consequently, although ASAQ holds great potential to enhance assessment practices in medical education, further empirical research is needed to ensure fairness, robustness, and broad acceptance within educational settings.

Rationale

Although the use of AI models for automated scoring of short-answer questions has gained increasing attention in medical education research, the breadth and depth of the existing literature remain unclear. A notable gap exists in understanding the range and types of AI models employed by medical educators to evaluate short-answer responses in student assessments. Consequently, a scoping review is warranted to systematically explore the extent and nature of current research and to map the available evidence on this topic.

Beyond identifying the AI models used, the review will examine the reported validity, reliability, and feasibility of these systems in comparison with traditional human grading methods. Such a review will help clarify the current landscape of AI-driven automated scoring, highlight research trends and gaps, and provide a comprehensive overview of how these technologies are being implemented in medical education assessment.

Methods

Research questions

The aim of the review is to systematically map the existing literature on automated scoring of SAQs in medical education, with a focus on utilized tools, accuracy, reliability, and fairness. The review will address the following research questions:

a. What AI-based models have been used to automated scoring of SAQs in medical education?
b. How accurately does automated scoring of SAQs reflect the performance of human graders?
c. What advantages and challenges of using AI for automated scoring of SAQs have been reported in medical education?
d. Are these models more effective than human graders in terms of reliability, accuracy, and fairness?

Search strategy

This scoping review will be conducted in accordance with the methodology outlined in the Joanna Briggs Institute (JBI) Manual for Evidence Synthesis, with a focus on AI models used for automated scoring of SAQs in medical school assessments. The Population, Concept, and Context (PCC) framework will guide the search strategy, eligibility criteria, and data extraction processes.

• Population: Medical students
• Concept: Automated scoring of short-answer questions using AI-based models
• Context: Medical education and medical school settings

Four electronic databases will be searched: PubMed, Scopus, Medline, and Web of Science, to comprehensively capture research related to AI applications in medical education assessment. Peer-reviewed quantitative, qualitative, and mixed-methods studies will be included.

A detailed search strategy will be developed in consultation with a medical librarian. Keywords such as “medical education,” “automated scoring,” “short answer questions,” and “medical school” will be used, combined with Boolean operators “AND” and “OR” to refine and optimize search results.

Screening process

Covidence systematic review software tool will be used to import references, conduct title and abstract screening, review full texts, and manage data extraction. Two independent reviewers will screen all retrieved studies based on the predefined eligibility criteria. Discrepancies will be resolved by a third reviewer.

Screening will occur in two stages:

1. Title and abstract screening conducted independently by two reviewers.
2. Full-text review of studies deemed potentially relevant.

A secondary search will involve screening the reference lists of included studies to identify additional relevant literature.

Eligibility criteria

Inclusion criteria

1. Topic: Studies must focus on automated scoring of short-answer questions.
2. Methodology: Original empirical research, including quantitative, qualitative, and mixed-methods studies.
3. Study Population: Medical students completing assessments containing SAQs.
4. Assessment Type: Any assessment format using SAQs (e.g., low-stakes, high-stakes, summative examinations).
5. Publication Date: Studies published from 2015 onward.
6. Language: English-language publications only.
7. Computational Approach: Studies using AI-based autoscoring models (machine learning, large language models, deep learning).
8. Setting: Medical education or medical school settings.

Exclusion criteria

1. Studies not focusing on SAQs (e.g., MCQs, essays).
2. Non-English publications.
3. Studies not conducted within a medical education context or not involving medical students.
4. Studies using non-AI-based approaches to automated scoring.
5. Systematic reviews, scoping reviews, and grey literature.
6. Studies published before 2015.

Data extraction and charting

A standardized data extraction form will be used to collect key information from all included studies. Two reviewers will independently extract data, with disagreements resolved by a third reviewer. Extracted data will include:

• Study methodology
• Type of AI-based model used
• Participant characteristics
• Assessment type
• Outcomes relating to the performance of the autoscoring system

Extracted data will be charted and summarized in tables and figures. Tables will outline study characteristics (e.g., author, publication year, AI model used), while figures will illustrate the frequency of AI models applied and the accuracy of their automated scoring.

Data analysis

A thematic analysis approach will be used to identify recurring patterns and themes within the included studies. Key concepts related to the types of AI models used, their scoring accuracy, associated challenges and advantages, and their reported validity and reliability will be systematically coded and synthesized into overarching themes.

This analytical approach ensures alignment with the research questions and provides a comprehensive overview of the literature. The review will evaluate whether automated scoring of SAQs should be integrated more widely into medical education and identify existing research gaps and future directions.

Results will be summarized narratively and supported with tables and figures where appropriate. Both the protocol and final review will follow the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) guidelines.

Expected outcomes

• A comprehensive mapping of current literature on the use of AI models for automated scoring of SAQs in medical education.
• Identification of the benefits and challenges associated with integrating automated scoring into medical assessments.
• Evaluation of the validity, reliability, and feasibility of automated scoring systems compared with human graders.
• Recommendations for future research on the validity, reliability, and implementation of AI-based autoscoring systems in medical education.

Dissemination

The findings from this scoping review will be presented at academic conferences and submitted for publication in peer-reviewed journals.

Ethical considerations

No ethical concerns as this review will include the use pf published peer-reviewed articles and grey literature. Therefore, ethical approval and consent is not required.

Data availability

No data is associated with this article.

Reporting guidelines

Repository: PRISMA-P checklist for “Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education”.

DOI: 10.6084/m9.figshare.30815456 (Çalışkan et al., 2025).

Data are available under the terms of the CC BY 4.0

References

Bolgova O, Ganguly P, Ikram MF, et al.: Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders. Med. Educ. Online. 2025; 30(1). PubMed Abstract | Publisher Full Text | Free Full Text
Boursicot KAM, Roberts TE, Burdick WP: Structured assessments of clinical competence.Swanwick T, Forrest K, O’Brien BC, editors. Understanding Medical Education: Evidence, Theory, and Practice. Wiley; 3rd ed.2018; pp. 335–345. Publisher Full Text
Çalışkan SA, Abubakar Bello F, Magzoub ME: PRISMA-P checklist for Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education.2025, December 7. Reference Source
Clauser BE, Yaneva V, Baldwin P, et al.: Automated Scoring of Short-Answer Questions: A Progress Report. Appl. Meas. Educ. 2024; 37(3): 209–224. Publisher Full Text
Gordon M, Daniel M, Ajiboye A, et al.: A scoping review of artificial intelligence in medical education: BEME Guide No. 84. Med. Teach. 2024; 46(4): 446–470. PubMed Abstract | Publisher Full Text
Grévisse C: LLM-based automatic short answer grading in undergraduate medical education. BMC Med. Educ. 2024; 24(1): 1060. PubMed Abstract | Publisher Full Text | Free Full Text
Hallquist E, Gupta I, Montalbano M, et al.: Applications of Artificial Intelligence in Medical Education: A Systematic Review. Cureus. 2025; 17(3). Publisher Full Text
Jolly B, Dalton MJ: Written assessment.Swanwick T, Forrest K, O’Brien BC, editors. Understanding Medical Education: Evidence, Theory, and Practice. Wiley; 3rd ed.2018; pp. 291–317. Publisher Full Text
Merritt C, Glisson M, Dewan M, et al.: Implementation and Evaluation of an Artificial Intelligence Driven Simulation to Improve Resident Communication With Primary Care Providers. Acad. Pediatr. 2022; 22(3): 503–505. PubMed Abstract | Publisher Full Text
Norcini J, Anderson MB, Bollela V, et al.: 2018 Consensus framework for good assessment. Med. Teach. 2018; 40(11): 1102–1109. PubMed Abstract | Publisher Full Text
Potter HG, McLachlan JC: Assessing medical knowledge: A 3-year comparative study of very short answer vs. multiple choice questions. Med. Teach. 2025; 47(10): 1669–1677. PubMed Abstract | Publisher Full Text
Preston R, Gratani M, Owens K, et al.: Exploring the Impact of Assessment on Medical Students’ Learning. Assess. Eval. High. Educ. 2020; 45(1): 109–124. Publisher Full Text
Rajan A, Alexander SMK, Shenvi CL: Can AI grade like a professor? comparing artificial intelligence and faculty scoring of medical student short-answer clinical reasoning exams. Adv. Health Sci. Educ. 2025; 1–11. Publisher Full Text
Rincón EHH, Jimenez D, Aguilar LAC, et al.: Mapping the use of artificial intelligence in medical education: a scoping review. BMC Med. Educ. 2025; 25(1): 526. Publisher Full Text
Schuwirth LWT, Van Der Vleuten CPM: Different written assessment methods: what can be said about their strengths and weaknesses?. Med. Educ. 2004; 38(9): 974–979. PubMed Abstract | Publisher Full Text
Schuwirth LWT, van der Vleuten CPM : A history of assessment in medical education. Adv. Health Sci. Educ. Theory Pract. 2020; 25(5): 1045–1056. Publisher Full Text
Seneviratne HMTW, Manathunga SS: Artificial intelligence assisted automated short answer question scoring tool shows high correlation with human examiner markings. BMC Med. Educ. 2025; 25(1): 1146. PubMed Abstract | Publisher Full Text | Free Full Text
Shumway JM, Harden RM: Medical Teacher AMEE Guide No. 25: The assessment of learning outcomes for the competent and reflective physician.2009. Publisher Full Text
Yudkowsky R, Park YS, Downing SM: Introduction to Assessment in the Health Professions. Assessment in Health Professions Education. Routledge; 2019; pp. 3–16. Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 13 Mar 2026

Author details Author details

¹ Department of Medical Education, United Arab Emirates University College of Medicine and Health Sciences, Al Ain, Abu Dhabi, United Arab Emirates

S. Ayhan Çalişkan
Roles: Conceptualization, Methodology, Project Administration, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Firdaus Bello Abubakar
Roles: Conceptualization, Methodology, Project Administration, Writing – Original Draft Preparation, Writing – Review & Editing

Mohi Eldin Magzoub
Roles: Conceptualization, Methodology, Project Administration, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (1)

version 1

Published: 13 Mar 2026, 15:395

https://doi.org/10.12688/f1000research.175198.1

Copyright

© 2026 Çalişkan SA et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Çalişkan SA, Bello Abubakar F and Magzoub ME. Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2026, 15:395 (https://doi.org/10.12688/f1000research.175198.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 1

VERSION 1

PUBLISHED 13 Mar 2026

Views

5

Reviewer Report 09 Jun 2026

Azam Afzal, Aga Khan University, Karachi, Pakistan

Approved with Reservations

https://doi.org/10.5256/f1000research.193161.r474984

This protocol addresses a timely and highly relevant topic in health professions education. The increasing adoption of AI-based assessment systems, particularly Large Language Models (LLMs), makes a scoping review of automated scoring of short-answer questions (SAQs) both necessary and potentially ... Continue reading

This protocol addresses a timely and highly relevant topic in health professions education. The increasing adoption of AI-based assessment systems, particularly Large Language Models (LLMs), makes a scoping review of automated scoring of short-answer questions (SAQs) both necessary and potentially impactful. The manuscript demonstrates a clear rationale, appropriate use of the Joanna Briggs Institute (JBI) methodology, and alignment with PRISMA-ScR reporting standards.
However, several methodological and conceptual issues limit the rigor, transparency, and reproducibility of the proposed review.

Major comments
1.The manuscript repeatedly refers to "AI-based models" but does not operationally define the term. Current automated scoring systems include: Rule-based systems, Machine Learning algorithms, Deep Learning models, Transformer-based architectures, Large Language Models (GPT, Claude, Gemini, Llama, etc.), Hybrid NLP approaches. These categories differ substantially in methodology and performance. My suggestion would be to add a conceptual framework that helps readers to distinguish between these terms.

2. Research question number 4 is problematic: "Are these models more effective than human graders in terms of reliability, accuracy, and fairness?" This question implies a comparative effectiveness judgment that may not be appropriate as it is beyond the aims of a scoping review. Scoping reviews aim to map evidence, describe characteristics and identify gaps. They generally do not evaluate superiority. A suggestion would be to replace it with: "How do studies compare the reliability, validity, fairness, and scoring performance of AI-based systems and human graders?"

3. The protocol states that databases will be searched and keywords will be combined using Boolean operators, but no search string is provided. The current reporting of this review does not allow replication of the search methodology. It is suggested to the authors to include a complete search strategy methodology.

4. The review includes studies from 2015 onward. This cutoff appears arbitrary. Authors should consider justifying the decision to exclude publications before 2015.

5. Data Analysis Plan Is Too Generic. The manuscript states that thematic analysis will be used. For a scoping review of AI systems, more detail is needed. A suggestion could be to describe how themes will be developed around: Types of AI systems, Accuracy metrics, Fairness concerns, Implementation barriers, educational outcomes.

6. The proposed extraction variables are too limited. Important variables are missing, such as: AI characteristics, assessment characteristics, performance metrics. It is recommended to include a mock extraction table as supplementary material.

7. The manuscript continuously mentions fairness is assessment and the review aims to examine fairness, yet fairness is a subjective term as no operational definition or framework is provided.

Minor comments
The statement: "No ethical concerns..." should be revised. Perhaps a better wording could be: "Ethics approval was not required because the review utilizes publicly available published literature and does not involve human participants."

The manuscript cites PRISMA-P, which is designed for systematic review protocols. Since this is a scoping review protocol, the authors should clarify why PRISMA-P was selected rather than PRISMA-ScR extensions and JBI scoping review guidance.

Conclusion: The protocol addresses an important gap in medical education assessment research and has strong potential to contribute meaningfully to the field. However, several methodological details require clarification before indexing. In particular, the authors should strengthen the search strategy, provide a detailed data extraction framework, clarify definitions and outcome measures, justify eligibility restrictions, and elaborate on how fairness, validity, and reliability will be synthesized. These revisions would substantially improve the transparency, rigor, and reproducibility of the proposed review.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

Partly
Are the datasets clearly presented in a useable and accessible format?

Not applicable

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Health Professions Education; Educational Development; Teaching / Learning theory and pedagogies; Technology and Simulation based learning; Authentic Assessment.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

10

Reviewer Report 02 Jun 2026

Anirejuoritse Bafor, Nationwide Children's Hospital, Columbus, OH, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.193161.r486185

The authors have presented a clear and well-articulated background and rationale for this scoping review protocol. The proposed study seeks to map the existing evidence on the use of artificial intelligence in the automated scoring of short-answer questions in ... Continue reading

The authors have presented a clear and well-articulated background and rationale for this scoping review protocol. The proposed study seeks to map the existing evidence on the use of artificial intelligence in the automated scoring of short-answer questions in medical education. The stated aim is to systematically map the literature with particular attention to the AI tools used, as well as their accuracy, reliability, and fairness. This is appropriate and relevant to current developments in medical education assessment.
The review is structured around the following research questions:

What AI-based models have been used for the automated scoring of short-answer questions in medical education?
How accurately does automated scoring of short-answer questions reflect the performance of human graders?
What advantages and challenges have been reported regarding the use of AI for automated scoring of short-answer questions in medical education?
Are these AI-based models more effective than human graders in terms of reliability, accuracy, and fairness?

The use of the Population, Concept, and Context framework to guide the search strategy, eligibility criteria, and data extraction process is appropriate for a scoping review. The involvement of a librarian in the database search is also a strength and supports the methodological rigor of the protocol.
However, the following points require clarification or revision:

Justification for the search date restriction - The authors should provide a clear rationale for excluding studies published before 2015. If this date restriction is based on the emergence or maturation of relevant AI technologies, this should be explicitly stated and justified.
Assessment of accuracy - The authors should clarify how the accuracy of automated scoring of short-answer questions will be determined from the included studies. For example, will accuracy be assessed based on correlation with human graders, agreement statistics, sensitivity/specificity, mean score differences, reliability coefficients, or other reported performance metrics?
Critical appraisal - What is the plan for assessing the quality, reliability, and potential bias of the studies included in this review.
Data extraction form - The authors should include a copy of the proposed data extraction form as an appendix or supplementary material. This would allow reviewers to better assess whether the planned extraction process adequately captures key variables such as AI model type, dataset characteristics, scoring method, comparator, accuracy metrics, reliability, fairness considerations, and reported limitations.

Overall, this is a timely and relevant scoping review protocol. Addressing the points above would improve the clarity, transparency, and reproducibility of the proposed methodology.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

Partly
Are the datasets clearly presented in a useable and accessible format?

Not applicable

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Orthopedic Surgery

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

7

Reviewer Report 07 May 2026

Nilesh Kumar Mitra, IMU University, Kuala Lumpur, Malaysia

Not Approved

https://doi.org/10.5256/f1000research.193161.r474986

The authors have attempted to develop a protocol for conducting a scoping review on the use of AI in automated scoring of short answer questions in medical education. Such a protocol without any visible data will probably be of no ... Continue reading

The authors have attempted to develop a protocol for conducting a scoping review on the use of AI in automated scoring of short answer questions in medical education. Such a protocol without any visible data will probably be of no use to the prospective researcher.
The following is the elements of a protocol for scoping review
1. Review Questions
2. Eligibility criteria
3. Search Strategy
4. Data charting and mock extraction form
5.Template to be used for final report of review
The author should take effort to improve review questions by evidence-based analysis of the literature related to topic. Otherwise, it stands alone without evidence. The eligibility should be more descriptive. Each component P, C and C under search strategy should be described in detail with enough descriptions of each. Present description in general.
Mechanism of data charting, technology used and extraction form should be added
Please also look at JBI manual for evidence synthesis.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

No
Are the datasets clearly presented in a useable and accessible format?

No

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Technology-enhanced learning, Artificial intelligence, Online assessment

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 13 Mar 2026

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 1 13 Mar 26	read	read	read

Nilesh Kumar Mitra, IMU University, Kuala Lumpur, Malaysia
Anirejuoritse Bafor, Nationwide Children's Hospital, Columbus, USA
Azam Afzal, Aga Khan University, Karachi, Pakistan

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

5 Views

09 Jun 2026 | for Version 1

Azam Afzal, Aga Khan University, Karachi, Pakistan

5 Views Cite this report Responses(0)

Approved With Reservations

This protocol addresses a timely and highly relevant topic in health professions education. The increasing adoption of AI-based assessment systems, particularly Large Language Models (LLMs), makes a scoping review of automated scoring of short-answer questions (SAQs) both necessary and potentially impactful. The manuscript demonstrates a clear rationale, appropriate use of the Joanna Briggs Institute (JBI) methodology, and alignment with PRISMA-ScR reporting standards.
However, several methodological and conceptual issues limit the rigor, transparency, and reproducibility of the proposed review.

Major comments
1.The manuscript repeatedly refers to "AI-based models" but does not operationally define the term. Current automated scoring systems include: Rule-based systems, Machine Learning algorithms, Deep Learning models, Transformer-based architectures, Large Language Models (GPT, Claude, Gemini, Llama, etc.), Hybrid NLP approaches. These categories differ substantially in methodology and performance. My suggestion would be to add a conceptual framework that helps readers to distinguish between these terms.

2. Research question number 4 is problematic: "Are these models more effective than human graders in terms of reliability, accuracy, and fairness?" This question implies a comparative effectiveness judgment that may not be appropriate as it is beyond the aims of a scoping review. Scoping reviews aim to map evidence, describe characteristics and identify gaps. They generally do not evaluate superiority. A suggestion would be to replace it with: "How do studies compare the reliability, validity, fairness, and scoring performance of AI-based systems and human graders?"

3. The protocol states that databases will be searched and keywords will be combined using Boolean operators, but no search string is provided. The current reporting of this review does not allow replication of the search methodology. It is suggested to the authors to include a complete search strategy methodology.

4. The review includes studies from 2015 onward. This cutoff appears arbitrary. Authors should consider justifying the decision to exclude publications before 2015.

5. Data Analysis Plan Is Too Generic. The manuscript states that thematic analysis will be used. For a scoping review of AI systems, more detail is needed. A suggestion could be to describe how themes will be developed around: Types of AI systems, Accuracy metrics, Fairness concerns, Implementation barriers, educational outcomes.

6. The proposed extraction variables are too limited. Important variables are missing, such as: AI characteristics, assessment characteristics, performance metrics. It is recommended to include a mock extraction table as supplementary material.

7. The manuscript continuously mentions fairness is assessment and the review aims to examine fairness, yet fairness is a subjective term as no operational definition or framework is provided.

Minor comments
The statement: "No ethical concerns..." should be revised. Perhaps a better wording could be: "Ethics approval was not required because the review utilizes publicly available published literature and does not involve human participants."

The manuscript cites PRISMA-P, which is designed for systematic review protocols. Since this is a scoping review protocol, the authors should clarify why PRISMA-P was selected rather than PRISMA-ScR extensions and JBI scoping review guidance.

Conclusion: The protocol addresses an important gap in medical education assessment research and has strong potential to contribute meaningfully to the field. However, several methodological details require clarification before indexing. In particular, the authors should strengthen the search strategy, provide a detailed data extraction framework, clarify definitions and outcome measures, justify eligibility restrictions, and elaborate on how fairness, validity, and reliability will be synthesized. These revisions would substantially improve the transparency, rigor, and reproducibility of the proposed review.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

Partly
Are the datasets clearly presented in a useable and accessible format?

Not applicable

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Health Professions Education; Educational Development; Teaching / Learning theory and pedagogies; Technology and Simulation based learning; Authentic Assessment.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

10 Views

02 Jun 2026 | for Version 1

Anirejuoritse Bafor, Nationwide Children's Hospital, Columbus, OH, USA

10 Views Cite this report Responses(0)

Approved With Reservations

The authors have presented a clear and well-articulated background and rationale for this scoping review protocol. The proposed study seeks to map the existing evidence on the use of artificial intelligence in the automated scoring of short-answer questions in medical education. The stated aim is to systematically map the literature with particular attention to the AI tools used, as well as their accuracy, reliability, and fairness. This is appropriate and relevant to current developments in medical education assessment.
The review is structured around the following research questions:

What AI-based models have been used for the automated scoring of short-answer questions in medical education?
How accurately does automated scoring of short-answer questions reflect the performance of human graders?
What advantages and challenges have been reported regarding the use of AI for automated scoring of short-answer questions in medical education?
Are these AI-based models more effective than human graders in terms of reliability, accuracy, and fairness?

The use of the Population, Concept, and Context framework to guide the search strategy, eligibility criteria, and data extraction process is appropriate for a scoping review. The involvement of a librarian in the database search is also a strength and supports the methodological rigor of the protocol.
However, the following points require clarification or revision:

Justification for the search date restriction - The authors should provide a clear rationale for excluding studies published before 2015. If this date restriction is based on the emergence or maturation of relevant AI technologies, this should be explicitly stated and justified.
Assessment of accuracy - The authors should clarify how the accuracy of automated scoring of short-answer questions will be determined from the included studies. For example, will accuracy be assessed based on correlation with human graders, agreement statistics, sensitivity/specificity, mean score differences, reliability coefficients, or other reported performance metrics?
Critical appraisal - What is the plan for assessing the quality, reliability, and potential bias of the studies included in this review.
Data extraction form - The authors should include a copy of the proposed data extraction form as an appendix or supplementary material. This would allow reviewers to better assess whether the planned extraction process adequately captures key variables such as AI model type, dataset characteristics, scoring method, comparator, accuracy metrics, reliability, fairness considerations, and reported limitations.

Overall, this is a timely and relevant scoping review protocol. Addressing the points above would improve the clarity, transparency, and reproducibility of the proposed methodology.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

Partly
Are the datasets clearly presented in a useable and accessible format?

Not applicable

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Orthopedic Surgery

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

7 Views

07 May 2026 | for Version 1

Nilesh Kumar Mitra, IMU University, Kuala Lumpur, Malaysia

7 Views Cite this report Responses(0)

Not Approved

The authors have attempted to develop a protocol for conducting a scoping review on the use of AI in automated scoring of short answer questions in medical education. Such a protocol without any visible data will probably be of no use to the prospective researcher.
The following is the elements of a protocol for scoping review
1. Review Questions
2. Eligibility criteria
3. Search Strategy
4. Data charting and mock extraction form
5.Template to be used for final report of review
The author should take effort to improve review questions by evidence-based analysis of the literature related to topic. Otherwise, it stands alone without evidence. The eligibility should be more descriptive. Each component P, C and C under search strategy should be described in detail with enough descriptions of each. Present description in general.
Mechanism of data charting, technology used and extraction form should be added
Please also look at JBI manual for evidence synthesis.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

No
Are the datasets clearly presented in a useable and accessible format?

No

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Technology-enhanced learning, Artificial intelligence, Online assessment

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Respond to this report

Responses (0)

[1] Bolgova O, Ganguly P, Ikram MF, et al.: Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders. Med. Educ. Online. 2025; 30(1). PubMed Abstract | Publisher Full Text | Free Full Text

[2] Boursicot KAM, Roberts TE, Burdick WP: Structured assessments of clinical competence.Swanwick T, Forrest K, O’Brien BC, editors. Understanding Medical Education: Evidence, Theory, and Practice. Wiley; 3rd ed.2018; pp. 335–345. Publisher Full Text

[3] Çalışkan SA, Abubakar Bello F, Magzoub ME: PRISMA-P checklist for Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education.2025, December 7. Reference Source

[4] Clauser BE, Yaneva V, Baldwin P, et al.: Automated Scoring of Short-Answer Questions: A Progress Report. Appl. Meas. Educ. 2024; 37(3): 209–224. Publisher Full Text

[5] Gordon M, Daniel M, Ajiboye A, et al.: A scoping review of artificial intelligence in medical education: BEME Guide No. 84. Med. Teach. 2024; 46(4): 446–470. PubMed Abstract | Publisher Full Text

[6] Grévisse C: LLM-based automatic short answer grading in undergraduate medical education. BMC Med. Educ. 2024; 24(1): 1060. PubMed Abstract | Publisher Full Text | Free Full Text

[7] Hallquist E, Gupta I, Montalbano M, et al.: Applications of Artificial Intelligence in Medical Education: A Systematic Review. Cureus. 2025; 17(3). Publisher Full Text

[8] Jolly B, Dalton MJ: Written assessment.Swanwick T, Forrest K, O’Brien BC, editors. Understanding Medical Education: Evidence, Theory, and Practice. Wiley; 3rd ed.2018; pp. 291–317. Publisher Full Text

[9] Merritt C, Glisson M, Dewan M, et al.: Implementation and Evaluation of an Artificial Intelligence Driven Simulation to Improve Resident Communication With Primary Care Providers. Acad. Pediatr. 2022; 22(3): 503–505. PubMed Abstract | Publisher Full Text

[10] Norcini J, Anderson MB, Bollela V, et al.: 2018 Consensus framework for good assessment. Med. Teach. 2018; 40(11): 1102–1109. PubMed Abstract | Publisher Full Text

[11] Potter HG, McLachlan JC: Assessing medical knowledge: A 3-year comparative study of very short answer vs. multiple choice questions. Med. Teach. 2025; 47(10): 1669–1677. PubMed Abstract | Publisher Full Text

[12] Preston R, Gratani M, Owens K, et al.: Exploring the Impact of Assessment on Medical Students’ Learning. Assess. Eval. High. Educ. 2020; 45(1): 109–124. Publisher Full Text

[13] Rajan A, Alexander SMK, Shenvi CL: Can AI grade like a professor? comparing artificial intelligence and faculty scoring of medical student short-answer clinical reasoning exams. Adv. Health Sci. Educ. 2025; 1–11. Publisher Full Text

[14] Rincón EHH, Jimenez D, Aguilar LAC, et al.: Mapping the use of artificial intelligence in medical education: a scoping review. BMC Med. Educ. 2025; 25(1): 526. Publisher Full Text

[15] Schuwirth LWT, Van Der Vleuten CPM: Different written assessment methods: what can be said about their strengths and weaknesses?. Med. Educ. 2004; 38(9): 974–979. PubMed Abstract | Publisher Full Text

[16] Schuwirth LWT, van der Vleuten CPM : A history of assessment in medical education. Adv. Health Sci. Educ. Theory Pract. 2020; 25(5): 1045–1056. Publisher Full Text

[17] Seneviratne HMTW, Manathunga SS: Artificial intelligence assisted automated short answer question scoring tool shows high correlation with human examiner markings. BMC Med. Educ. 2025; 25(1): 1146. PubMed Abstract | Publisher Full Text | Free Full Text

[18] Shumway JM, Harden RM: Medical Teacher AMEE Guide No. 25: The assessment of learning outcomes for the competent and reflective physician.2009. Publisher Full Text

[19] Yudkowsky R, Park YS, Downing SM: Introduction to Assessment in the Health Professions. Assessment in Health Professions Education. Routledge; 2019; pp. 3–16. Publisher Full Text

Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education

Abstract

Keywords

Introduction

Assessment in medical education

Short answer questions vs MCQ’s

Emergence of AI in medical education assessment

Automatic scoring of short answer questions

Rationale

Methods

Research questions

Search strategy

Screening process

Eligibility criteria

Data extraction and charting

Data analysis

Expected outcomes

Data availability

Reporting guidelines

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated