ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Study Protocol

Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education

[version 1; peer review: 2 approved with reservations, 1 not approved]
* Equal contributors
PUBLISHED 13 Mar 2026
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Assessment plays a central role in medical education by evaluating learners’ knowledge, skills, and professional competencies. While multiple-choice questions (MCQs) are widely used due to their efficiency and broad content coverage, they primarily assess recall and recognition, limiting their ability to measure higher-order reasoning. Short-answer questions (SAQs), in contrast, promote deeper cognitive processing and provide better discrimination between levels of student performance. However, SAQs are resource-intensive to grade and susceptible to scorer inconsistency and rater bias, highlighting a need for more efficient and reliable assessment solutions.

Artificial Intelligence (AI) has emerged as a transformative tool in medical education, enhancing learning, supporting adaptive instruction, and automating assessment processes. AI-driven systems using machine learning and natural language processing have been increasingly applied to automated scoring of SAQs. These systems offer potential benefits, including reduced grading burden, greater scoring consistency, and timely feedback to learners. Despite promising developments, concerns persist regarding algorithmic transparency, data privacy, and the reliability and validity of automated scoring compared with human graders. Existing studies report mixed results, underscoring the need for a comprehensive examination of current approaches.

This scoping review aims to systematically map the literature on AI-based models used for automated scoring of SAQs in medical education. Specifically, it seeks to identify the types of AI models employed, evaluate their accuracy and reliability relative to human graders, describe reported advantages and challenges, and assess fairness and feasibility within educational settings. Following the Joanna Briggs Institute methodology and the Population–Concept–Context framework, the review will include empirical studies published since 2015 involving medical students and AI-driven SAQ scoring. Findings will provide an evidence-based overview of current practices, highlight gaps in the literature, and inform future research and implementation strategies for AI-assisted assessment in medical education.

Keywords

Artificial intelligence, Automated scoring, Auto grading, Short-Answer Questions, SAQs, Assessment, Medical education, Machine learning, Natural language processing, Assessment reliability, Assessment validity, Autoscoring tools

Introduction

Assessment in medical education

Assessment is a central component of medical education, serving to evaluate learners’ knowledge, skills, and attitudes through both formative and summative assessment methods (Schuwirth & van der Vleuten, 2020; Yudkowsky et al., 2019). Effective assessment not only guides and motivates students’ learning but also provides an essential mechanism for determining whether they have attained the competencies required of medical professionals. Medical schools employ a range of assessment strategies to monitor students’ progress in acquiring core knowledge and clinical competencies (Norcini et al., 2018; Shumway & Harden, 2009). Common assessment methods used in medical education include Multiple-Choice Questions (MCQs), Modified Essay Questions (MEQ), Short Answer Questions (SAQs), Objective Structured Clinical Examination (OSCE) and Key Feature Problems (KFPs) (Boursicot et al., 2018; Jolly & Dalton, 2018). Assessing students’ learning is therefore fundamental to ensuring high-quality medical training, and the choice of assessment method directly influences the breadth and depth of knowledge or skills that can be evaluated (Preston et al., 2020). Crucially, each assessment method must be aligned with the targeted competencies, the instructional approaches used, and the desired impact on student learning (Yudkowsky et al., 2019).

Short answer questions vs MCQ’s

Multiple-choice questions (MCQs) and SAQs are among the commonly used written assessment formats in medical education (Jolly & Dalton, 2018; Preston et al., 2020; Shumway & Harden, 2009). MCQs, in which students select a response from predetermined options, are widely favored because they allow broad sampling of content, assess a wide range of knowledge areas efficiently, and facilitate rapid, objective grading (Jolly & Dalton, 2018; Schuwirth & Van Der Vleuten, 2004). Although MCQs effectively evaluate recall, recognition, and factual knowledge, they have been criticized for their limited ability to promote critical thinking or accurately assess deeper mastery of subject matter (Schuwirth & Van Der Vleuten, 2004). Their reliance on recognition-based answering may discourage deep learning and higher-order reasoning, making it challenging to determine whether students have truly internalized the material (Schuwirth & Van Der Vleuten, 2004; Shumway & Harden, 2009).

SAQs, by contrast, require students to generate concise written responses, thereby encouraging active retrieval, deeper cognitive processing, and higher-order reasoning (Grévisse, 2024; Jolly & Dalton, 2018; Potter & McLachlan, 2025). SAQs enable assessment of a broad range of competencies and cognitive skills and often demonstrate higher reliability and better discrimination -in other words, stronger ability to differentiate between high- and low-performing students- than MCQs, making them a valuable component of medical assessment systems (Jolly & Dalton, 2018; Schuwirth & Van Der Vleuten, 2004). However, SAQs are more resource-intensive to grade, and concerns about scorer inconsistency and rater bias can pose challenges to their validity (Grévisse, 2024; Potter & McLachlan, 2025). The emergence of AI-based autoscoring tools offers a promising solution, with early evidence suggesting improved efficiency, reduced bias, and enhanced scoring consistency (Grévisse, 2024).

Emergence of AI in medical education assessment

Artificial Intelligence (AI) is emerging as a powerful and transformative tool in medical education. AI technologies are increasingly integrated into educational systems to enhance students’ learning experiences, prepare them for an AI-driven healthcare environment, personalize learning, and improve assessment processes (Hallquist et al., 2025; Rincón et al., 2025). AI tools employ Machine Learning (ML), Deep Learning (DL), and Natural Language Processing (NLP) techniques to support adaptive learning platforms, automate question generation, evaluate student responses, and deliver timely, personalized feedback (Hallquist et al., 2025; Rincón et al., 2025).

The incorporation of AI has enhanced students’ learning through AI-assisted assessment platforms that enable learners to practice applying their knowledge while receiving immediate feedback (Gordon et al., 2024). AI-driven simulated case presentations -where AI functions as a virtual physician or simulated patient- have been shown to improve students’ communication and clinical skills (Merritt et al., 2022; Rincón et al., 2025). For medical educators, AI has reduced workload by automating assessment processes and generating examination questions (Hallquist et al., 2025; Seneviratne & Manathunga, 2025). Educators can use AI to develop assessment items, evaluate item reliability, and automatically score student responses. Despite these advantages, concerns remain regarding data privacy, ethical use of learner information, and, notably, the transparency of AI algorithms employed in automated assessment and feedback systems.

Automatic scoring of short answer questions

Automated assessment systems use Machine Learning (ML) and Natural Language Processing (NLP) techniques to automated scoring of SAQs and essays (Grévisse, 2024; Seneviratne & Manathunga, 2025). These systems can process large numbers of student responses and provide timely feedback, thereby reducing educator workload, minimizing grader bias, and supporting improved student learning and performance (Clauser et al., 2024; Grévisse, 2024; Seneviratne & Manathunga, 2025).

Automated scoring of short-answer questions (ASAQ) was first introduced in the 1960s and has since undergone substantial development, incorporating increasingly sophisticated statistical, ML, and NLP approaches to improve scoring accuracy and reliability. Recently, the application of ASAQ has gained considerable attention in medical education, where grading large volumes of SAQs is both time-consuming and susceptible to rater bias (Bolgova et al., 2025; Clauser et al., 2024; Grévisse, 2024; Rajan et al., 2025; Seneviratne & Manathunga, 2025). Research on ASAQ has reported mixed findings: while many studies demonstrate that automated scoring can achieve results comparable to human graders, some highlight concerns related to the system’s reliability, validity, and the opacity of its algorithms (Bolgova et al., 2025; Clauser et al., 2024; Grévisse, 2024; Seneviratne & Manathunga, 2025). Consequently, although ASAQ holds great potential to enhance assessment practices in medical education, further empirical research is needed to ensure fairness, robustness, and broad acceptance within educational settings.

Rationale

Although the use of AI models for automated scoring of short-answer questions has gained increasing attention in medical education research, the breadth and depth of the existing literature remain unclear. A notable gap exists in understanding the range and types of AI models employed by medical educators to evaluate short-answer responses in student assessments. Consequently, a scoping review is warranted to systematically explore the extent and nature of current research and to map the available evidence on this topic.

Beyond identifying the AI models used, the review will examine the reported validity, reliability, and feasibility of these systems in comparison with traditional human grading methods. Such a review will help clarify the current landscape of AI-driven automated scoring, highlight research trends and gaps, and provide a comprehensive overview of how these technologies are being implemented in medical education assessment.

Methods

Research questions

The aim of the review is to systematically map the existing literature on automated scoring of SAQs in medical education, with a focus on utilized tools, accuracy, reliability, and fairness. The review will address the following research questions:

  • a. What AI-based models have been used to automated scoring of SAQs in medical education?

  • b. How accurately does automated scoring of SAQs reflect the performance of human graders?

  • c. What advantages and challenges of using AI for automated scoring of SAQs have been reported in medical education?

  • d. Are these models more effective than human graders in terms of reliability, accuracy, and fairness?

Search strategy

This scoping review will be conducted in accordance with the methodology outlined in the Joanna Briggs Institute (JBI) Manual for Evidence Synthesis, with a focus on AI models used for automated scoring of SAQs in medical school assessments. The Population, Concept, and Context (PCC) framework will guide the search strategy, eligibility criteria, and data extraction processes.

  • Population: Medical students

  • Concept: Automated scoring of short-answer questions using AI-based models

  • Context: Medical education and medical school settings

Four electronic databases will be searched: PubMed, Scopus, Medline, and Web of Science, to comprehensively capture research related to AI applications in medical education assessment. Peer-reviewed quantitative, qualitative, and mixed-methods studies will be included.

A detailed search strategy will be developed in consultation with a medical librarian. Keywords such as “medical education,” “automated scoring,” “short answer questions,” and “medical school” will be used, combined with Boolean operators “AND” and “OR” to refine and optimize search results.

Screening process

Covidence systematic review software tool will be used to import references, conduct title and abstract screening, review full texts, and manage data extraction. Two independent reviewers will screen all retrieved studies based on the predefined eligibility criteria. Discrepancies will be resolved by a third reviewer.

Screening will occur in two stages:

  • 1. Title and abstract screening conducted independently by two reviewers.

  • 2. Full-text review of studies deemed potentially relevant.

A secondary search will involve screening the reference lists of included studies to identify additional relevant literature.

Eligibility criteria

Inclusion criteria

  • 1. Topic: Studies must focus on automated scoring of short-answer questions.

  • 2. Methodology: Original empirical research, including quantitative, qualitative, and mixed-methods studies.

  • 3. Study Population: Medical students completing assessments containing SAQs.

  • 4. Assessment Type: Any assessment format using SAQs (e.g., low-stakes, high-stakes, summative examinations).

  • 5. Publication Date: Studies published from 2015 onward.

  • 6. Language: English-language publications only.

  • 7. Computational Approach: Studies using AI-based autoscoring models (machine learning, large language models, deep learning).

  • 8. Setting: Medical education or medical school settings.

Exclusion criteria

  • 1. Studies not focusing on SAQs (e.g., MCQs, essays).

  • 2. Non-English publications.

  • 3. Studies not conducted within a medical education context or not involving medical students.

  • 4. Studies using non-AI-based approaches to automated scoring.

  • 5. Systematic reviews, scoping reviews, and grey literature.

  • 6. Studies published before 2015.

Data extraction and charting

A standardized data extraction form will be used to collect key information from all included studies. Two reviewers will independently extract data, with disagreements resolved by a third reviewer. Extracted data will include:

  • Study methodology

  • Type of AI-based model used

  • Participant characteristics

  • Assessment type

  • Outcomes relating to the performance of the autoscoring system

Extracted data will be charted and summarized in tables and figures. Tables will outline study characteristics (e.g., author, publication year, AI model used), while figures will illustrate the frequency of AI models applied and the accuracy of their automated scoring.

Data analysis

A thematic analysis approach will be used to identify recurring patterns and themes within the included studies. Key concepts related to the types of AI models used, their scoring accuracy, associated challenges and advantages, and their reported validity and reliability will be systematically coded and synthesized into overarching themes.

This analytical approach ensures alignment with the research questions and provides a comprehensive overview of the literature. The review will evaluate whether automated scoring of SAQs should be integrated more widely into medical education and identify existing research gaps and future directions.

Results will be summarized narratively and supported with tables and figures where appropriate. Both the protocol and final review will follow the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) guidelines.

Expected outcomes

  • A comprehensive mapping of current literature on the use of AI models for automated scoring of SAQs in medical education.

  • Identification of the benefits and challenges associated with integrating automated scoring into medical assessments.

  • Evaluation of the validity, reliability, and feasibility of automated scoring systems compared with human graders.

  • Recommendations for future research on the validity, reliability, and implementation of AI-based autoscoring systems in medical education.

Dissemination

The findings from this scoping review will be presented at academic conferences and submitted for publication in peer-reviewed journals.

Ethical considerations

No ethical concerns as this review will include the use pf published peer-reviewed articles and grey literature. Therefore, ethical approval and consent is not required.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 13 Mar 2026
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Çalişkan SA, Bello Abubakar F and Magzoub ME. Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2026, 15:395 (https://doi.org/10.12688/f1000research.175198.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 13 Mar 2026
Views
5
Cite
Reviewer Report 09 Jun 2026
Azam Afzal, Aga Khan University, Karachi, Pakistan 
Approved with Reservations
VIEWS 5
This protocol addresses a timely and highly relevant topic in health professions education. The increasing adoption of AI-based assessment systems, particularly Large Language Models (LLMs), makes a scoping review of automated scoring of short-answer questions (SAQs) both necessary and potentially ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Afzal A. Reviewer Report For: Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2026, 15:395 (https://doi.org/10.5256/f1000research.193161.r474984)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
10
Cite
Reviewer Report 02 Jun 2026
Anirejuoritse Bafor, Nationwide Children's Hospital, Columbus, OH, USA 
Approved with Reservations
VIEWS 10
The authors have presented a clear and well-articulated background and rationale for this scoping review protocol. The proposed study seeks to map the existing evidence on the use of artificial intelligence in the automated scoring of short-answer questions in ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Bafor A. Reviewer Report For: Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2026, 15:395 (https://doi.org/10.5256/f1000research.193161.r486185)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
7
Cite
Reviewer Report 07 May 2026
Nilesh Kumar Mitra, IMU University, Kuala Lumpur, Malaysia 
Not Approved
VIEWS 7
The authors have attempted to develop a protocol for conducting a scoping review on the use of AI in automated scoring of short answer questions in medical education. Such a protocol without any visible data will probably be of no ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Mitra NK. Reviewer Report For: Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2026, 15:395 (https://doi.org/10.5256/f1000research.193161.r474986)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 13 Mar 2026
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.