Keywords
Artificial intelligence, Automated scoring, Auto grading, Short-Answer Questions, SAQs, Assessment, Medical education, Machine learning, Natural language processing, Assessment reliability, Assessment validity, Autoscoring tools
Assessment plays a central role in medical education by evaluating learners’ knowledge, skills, and professional competencies. While multiple-choice questions (MCQs) are widely used due to their efficiency and broad content coverage, they primarily assess recall and recognition, limiting their ability to measure higher-order reasoning. Short-answer questions (SAQs), in contrast, promote deeper cognitive processing and provide better discrimination between levels of student performance. However, SAQs are resource-intensive to grade and susceptible to scorer inconsistency and rater bias, highlighting a need for more efficient and reliable assessment solutions.
Artificial Intelligence (AI) has emerged as a transformative tool in medical education, enhancing learning, supporting adaptive instruction, and automating assessment processes. AI-driven systems using machine learning and natural language processing have been increasingly applied to automated scoring of SAQs. These systems offer potential benefits, including reduced grading burden, greater scoring consistency, and timely feedback to learners. Despite promising developments, concerns persist regarding algorithmic transparency, data privacy, and the reliability and validity of automated scoring compared with human graders. Existing studies report mixed results, underscoring the need for a comprehensive examination of current approaches.
This scoping review aims to systematically map the literature on AI-based models used for automated scoring of SAQs in medical education. Specifically, it seeks to identify the types of AI models employed, evaluate their accuracy and reliability relative to human graders, describe reported advantages and challenges, and assess fairness and feasibility within educational settings. Following the Joanna Briggs Institute methodology and the Population–Concept–Context framework, the review will include empirical studies published since 2015 involving medical students and AI-driven SAQ scoring. Findings will provide an evidence-based overview of current practices, highlight gaps in the literature, and inform future research and implementation strategies for AI-assisted assessment in medical education.
Artificial intelligence, Automated scoring, Auto grading, Short-Answer Questions, SAQs, Assessment, Medical education, Machine learning, Natural language processing, Assessment reliability, Assessment validity, Autoscoring tools
Assessment is a central component of medical education, serving to evaluate learners’ knowledge, skills, and attitudes through both formative and summative assessment methods (Schuwirth & van der Vleuten, 2020; Yudkowsky et al., 2019). Effective assessment not only guides and motivates students’ learning but also provides an essential mechanism for determining whether they have attained the competencies required of medical professionals. Medical schools employ a range of assessment strategies to monitor students’ progress in acquiring core knowledge and clinical competencies (Norcini et al., 2018; Shumway & Harden, 2009). Common assessment methods used in medical education include Multiple-Choice Questions (MCQs), Modified Essay Questions (MEQ), Short Answer Questions (SAQs), Objective Structured Clinical Examination (OSCE) and Key Feature Problems (KFPs) (Boursicot et al., 2018; Jolly & Dalton, 2018). Assessing students’ learning is therefore fundamental to ensuring high-quality medical training, and the choice of assessment method directly influences the breadth and depth of knowledge or skills that can be evaluated (Preston et al., 2020). Crucially, each assessment method must be aligned with the targeted competencies, the instructional approaches used, and the desired impact on student learning (Yudkowsky et al., 2019).
Multiple-choice questions (MCQs) and SAQs are among the commonly used written assessment formats in medical education (Jolly & Dalton, 2018; Preston et al., 2020; Shumway & Harden, 2009). MCQs, in which students select a response from predetermined options, are widely favored because they allow broad sampling of content, assess a wide range of knowledge areas efficiently, and facilitate rapid, objective grading (Jolly & Dalton, 2018; Schuwirth & Van Der Vleuten, 2004). Although MCQs effectively evaluate recall, recognition, and factual knowledge, they have been criticized for their limited ability to promote critical thinking or accurately assess deeper mastery of subject matter (Schuwirth & Van Der Vleuten, 2004). Their reliance on recognition-based answering may discourage deep learning and higher-order reasoning, making it challenging to determine whether students have truly internalized the material (Schuwirth & Van Der Vleuten, 2004; Shumway & Harden, 2009).
SAQs, by contrast, require students to generate concise written responses, thereby encouraging active retrieval, deeper cognitive processing, and higher-order reasoning (Grévisse, 2024; Jolly & Dalton, 2018; Potter & McLachlan, 2025). SAQs enable assessment of a broad range of competencies and cognitive skills and often demonstrate higher reliability and better discrimination -in other words, stronger ability to differentiate between high- and low-performing students- than MCQs, making them a valuable component of medical assessment systems (Jolly & Dalton, 2018; Schuwirth & Van Der Vleuten, 2004). However, SAQs are more resource-intensive to grade, and concerns about scorer inconsistency and rater bias can pose challenges to their validity (Grévisse, 2024; Potter & McLachlan, 2025). The emergence of AI-based autoscoring tools offers a promising solution, with early evidence suggesting improved efficiency, reduced bias, and enhanced scoring consistency (Grévisse, 2024).
Artificial Intelligence (AI) is emerging as a powerful and transformative tool in medical education. AI technologies are increasingly integrated into educational systems to enhance students’ learning experiences, prepare them for an AI-driven healthcare environment, personalize learning, and improve assessment processes (Hallquist et al., 2025; Rincón et al., 2025). AI tools employ Machine Learning (ML), Deep Learning (DL), and Natural Language Processing (NLP) techniques to support adaptive learning platforms, automate question generation, evaluate student responses, and deliver timely, personalized feedback (Hallquist et al., 2025; Rincón et al., 2025).
The incorporation of AI has enhanced students’ learning through AI-assisted assessment platforms that enable learners to practice applying their knowledge while receiving immediate feedback (Gordon et al., 2024). AI-driven simulated case presentations -where AI functions as a virtual physician or simulated patient- have been shown to improve students’ communication and clinical skills (Merritt et al., 2022; Rincón et al., 2025). For medical educators, AI has reduced workload by automating assessment processes and generating examination questions (Hallquist et al., 2025; Seneviratne & Manathunga, 2025). Educators can use AI to develop assessment items, evaluate item reliability, and automatically score student responses. Despite these advantages, concerns remain regarding data privacy, ethical use of learner information, and, notably, the transparency of AI algorithms employed in automated assessment and feedback systems.
Automated assessment systems use Machine Learning (ML) and Natural Language Processing (NLP) techniques to automated scoring of SAQs and essays (Grévisse, 2024; Seneviratne & Manathunga, 2025). These systems can process large numbers of student responses and provide timely feedback, thereby reducing educator workload, minimizing grader bias, and supporting improved student learning and performance (Clauser et al., 2024; Grévisse, 2024; Seneviratne & Manathunga, 2025).
Automated scoring of short-answer questions (ASAQ) was first introduced in the 1960s and has since undergone substantial development, incorporating increasingly sophisticated statistical, ML, and NLP approaches to improve scoring accuracy and reliability. Recently, the application of ASAQ has gained considerable attention in medical education, where grading large volumes of SAQs is both time-consuming and susceptible to rater bias (Bolgova et al., 2025; Clauser et al., 2024; Grévisse, 2024; Rajan et al., 2025; Seneviratne & Manathunga, 2025). Research on ASAQ has reported mixed findings: while many studies demonstrate that automated scoring can achieve results comparable to human graders, some highlight concerns related to the system’s reliability, validity, and the opacity of its algorithms (Bolgova et al., 2025; Clauser et al., 2024; Grévisse, 2024; Seneviratne & Manathunga, 2025). Consequently, although ASAQ holds great potential to enhance assessment practices in medical education, further empirical research is needed to ensure fairness, robustness, and broad acceptance within educational settings.
Although the use of AI models for automated scoring of short-answer questions has gained increasing attention in medical education research, the breadth and depth of the existing literature remain unclear. A notable gap exists in understanding the range and types of AI models employed by medical educators to evaluate short-answer responses in student assessments. Consequently, a scoping review is warranted to systematically explore the extent and nature of current research and to map the available evidence on this topic.
Beyond identifying the AI models used, the review will examine the reported validity, reliability, and feasibility of these systems in comparison with traditional human grading methods. Such a review will help clarify the current landscape of AI-driven automated scoring, highlight research trends and gaps, and provide a comprehensive overview of how these technologies are being implemented in medical education assessment.
The aim of the review is to systematically map the existing literature on automated scoring of SAQs in medical education, with a focus on utilized tools, accuracy, reliability, and fairness. The review will address the following research questions:
a. What AI-based models have been used to automated scoring of SAQs in medical education?
b. How accurately does automated scoring of SAQs reflect the performance of human graders?
c. What advantages and challenges of using AI for automated scoring of SAQs have been reported in medical education?
d. Are these models more effective than human graders in terms of reliability, accuracy, and fairness?
This scoping review will be conducted in accordance with the methodology outlined in the Joanna Briggs Institute (JBI) Manual for Evidence Synthesis, with a focus on AI models used for automated scoring of SAQs in medical school assessments. The Population, Concept, and Context (PCC) framework will guide the search strategy, eligibility criteria, and data extraction processes.
• Population: Medical students
• Concept: Automated scoring of short-answer questions using AI-based models
• Context: Medical education and medical school settings
Four electronic databases will be searched: PubMed, Scopus, Medline, and Web of Science, to comprehensively capture research related to AI applications in medical education assessment. Peer-reviewed quantitative, qualitative, and mixed-methods studies will be included.
A detailed search strategy will be developed in consultation with a medical librarian. Keywords such as “medical education,” “automated scoring,” “short answer questions,” and “medical school” will be used, combined with Boolean operators “AND” and “OR” to refine and optimize search results.
Covidence systematic review software tool will be used to import references, conduct title and abstract screening, review full texts, and manage data extraction. Two independent reviewers will screen all retrieved studies based on the predefined eligibility criteria. Discrepancies will be resolved by a third reviewer.
Screening will occur in two stages:
1. Title and abstract screening conducted independently by two reviewers.
2. Full-text review of studies deemed potentially relevant.
A secondary search will involve screening the reference lists of included studies to identify additional relevant literature.
Inclusion criteria
1. Topic: Studies must focus on automated scoring of short-answer questions.
2. Methodology: Original empirical research, including quantitative, qualitative, and mixed-methods studies.
3. Study Population: Medical students completing assessments containing SAQs.
4. Assessment Type: Any assessment format using SAQs (e.g., low-stakes, high-stakes, summative examinations).
5. Publication Date: Studies published from 2015 onward.
6. Language: English-language publications only.
7. Computational Approach: Studies using AI-based autoscoring models (machine learning, large language models, deep learning).
8. Setting: Medical education or medical school settings.
Exclusion criteria
1. Studies not focusing on SAQs (e.g., MCQs, essays).
2. Non-English publications.
3. Studies not conducted within a medical education context or not involving medical students.
4. Studies using non-AI-based approaches to automated scoring.
5. Systematic reviews, scoping reviews, and grey literature.
6. Studies published before 2015.
A standardized data extraction form will be used to collect key information from all included studies. Two reviewers will independently extract data, with disagreements resolved by a third reviewer. Extracted data will include:
• Study methodology
• Type of AI-based model used
• Participant characteristics
• Assessment type
• Outcomes relating to the performance of the autoscoring system
Extracted data will be charted and summarized in tables and figures. Tables will outline study characteristics (e.g., author, publication year, AI model used), while figures will illustrate the frequency of AI models applied and the accuracy of their automated scoring.
A thematic analysis approach will be used to identify recurring patterns and themes within the included studies. Key concepts related to the types of AI models used, their scoring accuracy, associated challenges and advantages, and their reported validity and reliability will be systematically coded and synthesized into overarching themes.
This analytical approach ensures alignment with the research questions and provides a comprehensive overview of the literature. The review will evaluate whether automated scoring of SAQs should be integrated more widely into medical education and identify existing research gaps and future directions.
Results will be summarized narratively and supported with tables and figures where appropriate. Both the protocol and final review will follow the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) guidelines.
• A comprehensive mapping of current literature on the use of AI models for automated scoring of SAQs in medical education.
• Identification of the benefits and challenges associated with integrating automated scoring into medical assessments.
• Evaluation of the validity, reliability, and feasibility of automated scoring systems compared with human graders.
• Recommendations for future research on the validity, reliability, and implementation of AI-based autoscoring systems in medical education.
No data is associated with this article.
Repository: PRISMA-P checklist for “Protocol for Conducting a Scoping Review on The Use of AI in Automated Scoring of Short-Answer Questions in Medical Education”.
DOI: 10.6084/m9.figshare.30815456 (Çalışkan et al., 2025).
Data are available under the terms of the CC BY 4.0
| Views | Downloads | |
|---|---|---|
| F1000Research | - | - |
|
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for, and objectives of, the study clearly described?
Yes
Is the study design appropriate for the research question?
Partly
Are sufficient details of the methods provided to allow replication by others?
Partly
Are the datasets clearly presented in a useable and accessible format?
Not applicable
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Health Professions Education; Educational Development; Teaching / Learning theory and pedagogies; Technology and Simulation based learning; Authentic Assessment.
Is the rationale for, and objectives of, the study clearly described?
Yes
Is the study design appropriate for the research question?
Partly
Are sufficient details of the methods provided to allow replication by others?
Partly
Are the datasets clearly presented in a useable and accessible format?
Not applicable
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Orthopedic Surgery
Is the rationale for, and objectives of, the study clearly described?
Yes
Is the study design appropriate for the research question?
Partly
Are sufficient details of the methods provided to allow replication by others?
No
Are the datasets clearly presented in a useable and accessible format?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Technology-enhanced learning, Artificial intelligence, Online assessment
Alongside their report, reviewers assign a status to the article:
| Invited Reviewers | |||
|---|---|---|---|
| 1 | 2 | 3 | |
|
Version 1 13 Mar 26 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)