Evaluating artificial intelligence for accurate detection of hand and wrist fractures: a systematic review and meta-analysis

Hamed Alosaimi; Abdullah Musaaed Alsalamah; Nawwaf N. Alharbi; Hashim Albar; Mohammed Khalid I. Alghamdi; Sultan Abdulaziz Alnuman; Anas M. Alrashed; Omar H Bin Salleeh; Khalid Abdullah Alharbi; Malik Raja Alanazi; Weaam Hamoud Alqabasani; Shahad Abdullah Nolelli; Mohammed Saeed Alharbi

doi:10.12688/f1000research.168673.1

Home Browse Evaluating artificial intelligence for accurate detection of hand...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Systematic Review

Evaluating artificial intelligence for accurate detection of hand and wrist fractures: a systematic review and meta-analysis

[version 1; peer review: awaiting peer review]

Hamed Alosaimi ¹, Abdullah Musaaed Alsalamah¹, Nawwaf N. Alharbi², [...] Hashim Albar³, Mohammed Khalid I. Alghamdi⁴, Sultan Abdulaziz Alnuman⁵, Anas M. Alrashed⁶, Omar H Bin Salleeh⁷, Khalid Abdullah Alharbi⁸, Malik Raja Alanazi⁵, Weaam Hamoud Alqabasani⁹, Shahad Abdullah Nolelli⁸, Mohammed Saeed Alharbi⁶

Hamed Alosaimi ¹, Abdullah Musaaed Alsalamah¹, [...] Nawwaf N. Alharbi², Hashim Albar³, Mohammed Khalid I. Alghamdi⁴, Sultan Abdulaziz Alnuman⁵, Anas M. Alrashed⁶, Omar H Bin Salleeh⁷, Khalid Abdullah Alharbi⁸, Malik Raja Alanazi⁵, Weaam Hamoud Alqabasani⁹, Shahad Abdullah Nolelli⁸, Mohammed Saeed Alharbi⁶

PUBLISHED 10 Oct 2025

Author details Author details

¹ Orthopedic, King Fahd Specialist Hospital Buraidah, Buraydah, Al Qassim, 52261, Saudi Arabia
² College of Medicine, King Saud University, Riyadh, Riyadh Province, 13523, Saudi Arabia
³ King Saud bin Abdulaziz University for Health Sciences College of Medicine, Jeddah, Makkah Province, 22233, Saudi Arabia
⁴ College of Medicine, University of Jeddah, Jeddah, Makkah Province, 22233, Saudi Arabia
⁵ Al-Jouf University College of Medicine, Sakaka, Al Jowf, 42421, Saudi Arabia
⁶ Qassim University College of Medicine, Buraydah, Al Qassim, 52261, Saudi Arabia
⁷ Almaarefa University College of Medicine, Riyadh, Riyadh Province, 13523, Saudi Arabia
⁸ Umm Al-Qura University College of Medicine, Mecca, Makkah Province, 21955, Saudi Arabia
⁹ College of Medicine, University of Tabuk, Tabuk, Tabuk Province, 71411, Saudi Arabia

Hamed Alosaimi
Roles: Conceptualization, Funding Acquisition, Methodology, Project Administration, Writing – Review & Editing

Abdullah Musaaed Alsalamah
Roles: Data Curation, Investigation, Resources

Nawwaf N. Alharbi
Roles: Investigation, Visualization, Writing – Original Draft Preparation

Hashim Albar
Roles: Investigation, Visualization, Writing – Original Draft Preparation

Mohammed Khalid I. Alghamdi
Roles: Validation, Writing – Review & Editing

Sultan Abdulaziz Alnuman
Roles: Validation, Visualization

Anas M. Alrashed
Roles: Investigation, Writing – Original Draft Preparation

Omar H Bin Salleeh
Roles: Investigation, Visualization

Khalid Abdullah Alharbi
Roles: Investigation, Visualization

Malik Raja Alanazi
Roles: Formal Analysis, Investigation

Weaam Hamoud Alqabasani
Roles: Formal Analysis, Writing – Original Draft Preparation

Shahad Abdullah Nolelli
Roles: Formal Analysis, Writing – Original Draft Preparation

Mohammed Saeed Alharbi
Roles: Project Administration, Supervision, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS AWAITING PEER REVIEW

Abstract

Background and Objectives

Hand and wrist fractures are among the most frequently encountered injuries in emergency departments and are often misdiagnosed, particularly when interpreted by non-specialist clinicians. These diagnostic errors can lead to treatment delays and long-term complications. Artificial intelligence (AI), particularly deep learning algorithms, is emerging as a promising adjunct to improve diagnostic accuracy in radiographic fracture detection. This study aims to evaluate the effectiveness of Artificial Intelligence (AI) in detecting hand and wrist fractures compared to manual radiographic interpretation by clinicians.

Materials and Methods

A systematic review and meta-analysis were conducted to assess the diagnostic performance of AI models in detecting hand and wrist fractures compared to conventional radiographic interpretation by clinicians. A comprehensive search of PubMed, Google Scholar, Science Direct, and Wiley Online Library was performed. Eligible studies included those utilizing AI for fracture detection with sensitivity and specificity data. Pooled estimates were calculated using fixed- and random-effects models. Heterogeneity was assessed via I² statistics, and publication bias was examined using funnel plots and Egger’s test.

Results

Eighteen studies met inclusion criteria. The pooled sensitivity and specificity under the random-effects model were 0.910 and 0.912, respectively, indicating high diagnostic accuracy of AI models. However, substantial heterogeneity (I² = 99.09% for sensitivity; 96.43% for specificity) and publication bias were observed, likely due to variations in AI algorithms, sample sizes, and study designs.

Conclusions

Most AI models demonstrated good diagnostic accuracy, with high sensitivity and specificity scores (≥90%). However, some models fell short in sensitivity and specificity (≤90%), indicating performance variations across different AI models or algorithms.

From a clinical perspective, AI models with lower sensitivity scores may fail to detect hand and wrist fractures, potentially delaying treatment, while those with lower specificity scores could lead to unnecessary interventions—treating hands and wrists that are not fractured.

Keywords

Artificial Intelligence; Hand Fractures; Wrist Fractures; Deep Learning; Machine Learning; Neural Network.

Corresponding author: Hamed Alosaimi

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2025 Alosaimi H et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Alosaimi H, Musaaed Alsalamah A, N. Alharbi N et al. Evaluating artificial intelligence for accurate detection of hand and wrist fractures: a systematic review and meta-analysis [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:1062 (https://doi.org/10.12688/f1000research.168673.1) First published: 10 Oct 2025, 14:1062 (https://doi.org/10.12688/f1000research.168673.1) Latest published: 10 Oct 2025, 14:1062 (https://doi.org/10.12688/f1000research.168673.1)

1. Introduction

Bone fractures are common public health problems worldwide,¹ with wrist fractures accounting for most fractures in general and in paediatric patients.^2,3 Some of its negative health outcomes include absenteeism from work, disability, reduced quality of life, health-related complications, and high healthcare costs that drain individuals, families, and societies financially, emotionally, and mentally.^4,5

X-ray, Computed Tomography (CT), and Magnetic Resonance Imaging (MRI) are the most popular imaging techniques in medical diagnostics used to diagnose fractures. The most used technique is X-ray due to its cost-effectiveness, though it depends on suboptimal positioning techniques and patient cooperation.^6,7

In their study, Gäbler et al.⁸ reported that radiographs in emergency departments were mainly evaluated by non-specialized physicians or junior orthopedic residents, increasing the likelihood of missed fracture diagnosis. Likewise, studies by Donald and Barnard⁹ and Berlin¹⁰ reported that interpretational errors resulting from missed fractures were common among physicians interpreting musculoskeletal radiographs. Studies by Guly¹¹ and Mattijssen-Horstink et al.¹² reported that four out of five diagnostic errors in emergency departments involved missed fractures, with wrists accounting for 13–17% of these cases. Thus, artificial intelligence could help physicians detect wrist fractures more accurately than humans.

On the same note, the problem of missed detection could lead to treatment delays among false negatives, resulting in malunion or pseudoarthrosis with attendant morbidity. These complications can be avoided through the adoption of artificial intelligence in wrist fracture detection.¹³

Clinical inexperience, fatigue, distractions, and poor eyesight all contribute to interpretation errors on radiographs.¹⁴ The author further recommends the use of automated analysis with consistent and indefatigable computers to complement the diagnostic skills of physicians, orthopedists, and radiologists in the emergency department.

In the recent past, artificial intelligence, machine learning, and deep learning have been used for fracture detection, classification, and prediction. The use of powerful computers and algorithms has paved the way for rapid and consistent analysis, which is valuable to the healthcare industry globally.

The present systematic review evaluates the effectiveness of artificial intelligence in detecting hand and wrist fractures compared to manual radiographic interpretation by clinicians. The review analyzes and evaluates various artificial intelligence algorithms, seeking to provide evidence-based insights for hospitals and healthcare institutions intending to integrate artificial intelligence models into their clinical systems.

2. Methods

The aim of the following systematic review is to determine the effectiveness of artificial intelligence (AI) in accurately detecting hand and wrist fractures compared to traditional diagnostic methods, such as a clinician’s manual reading of radiographs.

2.1 PICO framework

The PICO framework was used to investigate the effectiveness of Artificial Intelligence in accurate detection of hand and wrist fractures compared to traditional diagnostic methods such as clinicians’ manual reading of radiographs. The systematic review involved all age groups with suspected hand and wrist fractures. The intervention aimed to studies that used artificial intelligence, including machine learning and deep learning algorithms, for detecting hand and wrist fractures. These techniques were compared with traditional diagnostic methods, such as manual reading of radiographs by clinicians. The outcome sort were diagnostic accuracy metrics including sensitivity, specificity, positive predictive value, and negative predictive value. The target studies had prospective or retrospective cohort studies, randomized controlled trials, and observational study designs.

2.2 Search strategy

2.2.1 Databases searched

The search was conducted across multiple electronic databases: PubMed, Google Scholar, Wiley Online Library, and Science Direct.

2.2.2 Search terms and keywords used

The search terms and keywords used a combination of both standard terms and general keywords, which were refined into proper MeSH-based queries. The search terms and keywords utilized a combination of Boolean operators (AND, OR, AND) to effectively combine key terms and retrieve the desired literature from the searched databases.

The search criteria for Google Scholar, Wiley Online Library, and Science Direct involved keywords indexing system to capture results across all platforms.

(“Hand fracture” OR “Hand fractures” OR “Hand injury” OR “Hand injuries” OR “Wrist fracture” OR “Wrist fractures” OR “Wrist injury” OR “Wrist injuries”)

AND

(“Artificial Intelligence” OR “Machine Learning” OR “Deep Learning” OR “Neural Networks” OR “AI in healthcare” OR “AI for injury detection” OR “Machine learning for orthopedic diagnosis”)

The MeSH terms were used for the PubMed database as indicated in the query below:

(“Hand Injuries”[MeSH] OR “Wrist Injuries”[MeSH] OR (“Fractures, Bone”[MeSH] AND (“Hand” OR “Wrist”)))

AND

(“Artificial Intelligence”[MeSH] OR “Machine Learning”[MeSH] OR “Neural Networks, Computer”[MeSH]).

2.2.3 Study selection process

The selection process started with a thorough search of the electronic databases (Google Scholar, Wiley Online Library, and Science Direct, and PubMed), followed by uploading the results to Rayyan software to identify and remove duplicate entries using two distinct phases carried out independently by three researchers. In phase one, the title and the abstract of each study was reviewed to determine their eligibility criteria, while excluding those that did not meet the criteria.

2.3 Inclusion and exclusion process

The inclusion criteria encompassed prospective or retrospective cohort studies, randomized controlled trials, and observational studies published in English, with no time frame limitations, involving patients of all ages with suspected hand and wrist fractures, were included in the systematic review. The reviewed studies focused on relevant outcomes using artificial intelligence (AI), including machine learning and deep learning algorithms, to detect hand and wrist fractures and compare their performance with traditional diagnostic methods, such as the manual interpretation of radiographs by clinicians.

Studies involving animals or cadavers, those not published in English, those not using artificial intelligence for hand and wrist fracture detection, studies lacking a comparator group or comparison with conventional diagnosis methods, and studies lacking sufficient data to build a contingency table were among the excluded criteria. Omitted were reviews, case studies, letters, editorials, and conference abstracts. Studies with a high risk of bias or low quality based on the assessment of study design, sample size, data collection and analysis and lacking relevant factors were excluded.

2.4 Data extraction

A standardized form was created to summarize the data relevant factors to the research questions. The variables in the extraction form included general information about the study, author, year, study design, sample size, population characteristics, type of Al algorithm used, imaging modality, sensitivity, specificity, positive predictive value, negative predictive value, area under the curve (AUC), type of fracture detected, comparison group, data preprocessing methods, handling of imbalanced data, external validation, risk of bias, and funding sources.

2.5 Quality assessment

The selected studies were screened for duplication which were dropped from the systematic review. The risk of bias was assessed using the Methodological Index for Non-Randomized Studies (MINORS) for observational and non-randomized designs, as well as ROBINS-I for non-randomized comparative studies. Studies identified as having a serious risk of bias were excluded from the review.

2.6 Data synthesis

2.6.1 Treatment of missing data

Missing data was handled by checking the completeness of reported outcomes in the included studies. The final included articles were identified based on the methodological index for non-randomized studies – MINORS – a tool used to screen prospective, retrospective or case-control studies to be included in the systematic review. The studies assessed for bias were 23 in total, out of which only one study had used a prospective study design, while the rest used a retrospective study design.

2.6.2 Assessment of Bias

Two reviewers independently used the methodological index for nonrandomized studies (MINORS) to assess the risk of bias in retrospective and prospective non-randomized studies. This is a validated 12-item tool designed to assess the quality of non-randomized surgical studies. Each included article was assessed for risk of bias using the MINORS tool by the two reviewers, with disagreements resolved through discussion or consultation with a third author.

3. Results

3.1 Study selection

The PRISMA 2020 flow diagram was generated in R¹⁵ to summarize study selection. As shown in Figure 1, we identified 526 records; after removing duplicates, 447 remained. We assessed 22 full-text articles and included 18 studies in the review.

Figure 1. PRISMA flow diagram showing the process of study selection.

Abbreviations: PRISMA = Preferred Reporting Items for Systematic Reviews and Meta-Analyses.

3.2 Characteristics of included studies

A total of twenty studies were included in the study where were nine studies were retrospective cohort studies,^16–24 two were retrospective diagnostic^25,26; two retrospective experimental study^27,28; one retrospective validation study²⁹; four retrospective studies^30–33; and two prospective diagnostic studies,^34,35 Table 1. A retrospective cohort study identifies risk factors and associations, with follow-up at risk groups over time as a key feature. A retrospective diagnostic study evaluates the test accuracy and compares diagnostic results to a gold standard. A retrospective experimental study assesses past interventions using historical intervention data. A retrospective validation study tests models/methods as it validates previous findings on new past data. A prospective diagnostic study assesses the effectiveness of a diagnostic test in detecting conditions.

Table 1. The main characteristics of the included studies.

Author	Outcomes measured	Type of intervention AI method/algorithm	Comparison group-manual reading
Zech et al.³³	AUC, Accuracy, Sensitivity, and Specificity	Faster R-CNN	PGY-2 and PGY-4 Pediatrics Resident/fellow, and a PGY-2 and PGY-4 Radiology Resident
Raisuddin et al.³²	AUROC, AUPR, Sensitivity, Recall, TPR, Specificity, Selectivity, TNR, Precision, PPV, and F1 score	Deep Wrist pipeline	Two Board-Certified Radiologists, and Two primary care physicians
Cohen et al.³⁰	Sensitivity, Specificity, PPV, and NPV	BoneView (Gleamer) Deep CNN algorithm	41 Radiologists
Hardalaç et al.¹⁸	Average precision (AP50)	Deep-learning-based object detection models	1 Radiologist, 2 Orthopedists
Anttila et al.¹⁶	Sensitivity, Specificity, Accuracy, NPV, PPV, ROC, AUC, Inter-observer reliability (kappa coefficient)	Segmentation-based U-net architecture with 25 layers	1 Hand Surgery Resident, and 3 Consultant hand Surgeons
Üreten et al.²⁴	Accuracy, Sensitivity, Specificity, Precision	VGG-16, ResNet-50, and GoogLeNet	1 Orthopedist, and 1 Radiologist
Oka et al.²³	Accuracy, Sensitivity, Specificity, AUC	VGG-16 (16-layer CNN modeL)	Specialized Orthopedic surgeons
Zhang et al.³⁵	Sensitivity, Specificity, PPV, NPV, AUC, Interrater reliability (Cohen’s Kappa)	3D ultrasound, using a Philips IU22 machine	1 Radiologist, 1 Medical Student, and 1 Fellow
Blüthgen et al.¹⁷	AUC, Sensitivity, and Specificity	Generic image analysis software (ViDi Suite Version 2.0)	2 Radiologists Consultants, and 1 Radiology Resident
Min et al.²⁶	AUC, Accuracy, TPR, FPR, and Specificity	YOLOv5, and EfficientNet-B3	3 Orthopedic Training Registrars, and an Orthopedic Consultant
Ju and Cai²⁸	Mean average precision (mAP 50)	YOLOv8 algorithm	Radiologists
Gan et al.²⁵	Accuracy, Sensitivity, Specificity, Youden Index, and AUC	CNN-Inception-v4	Radiologists, and Orthopedists
Hendrix et al.²⁷	Sensitivity, Specificity, PPV, AUC, Cohen’s kappa coefficient, and fracture localization precision	YOLOv5s, and InceptionV3	5 Radiologists
Lee et al.²⁹	Sensitivity, Specificity, Accuracy, and AUC	CNN-RetinaNet, DeepLab v3, NasNet	2 Radiologists, and 1 Radiology Resident
Knight et al.³⁴	Sensitivity, Specificity, PPV, NPV, Accuracy, and AUROC	CNN-ResNet34, and DenseNet121	3 novice, 2 intermediate, and 2 expert readers
Lee et al.²⁰	Accuracy, Sensitivity, Specificity, Correlation coefficient and DSC (Dice similarity coefficient)	U-Net, and detection and classification model based on RetinaNet	1 Orthopedic surgeon
Li et al.²¹	Sensitivity, Specificity, AUROC, Fleiss’ Kappa, Cohen’s Kappa	CNN-YOLOv3 and MobileNetV3	4 Hand Surgeons
Jacques et al.¹⁹	Sensitivity, Specificity, PPV, NPV, and AUROC	Boneview (Gleamer)	23 Radiologists
Mert et al.²²	Sensitivity, Specificity, and AUC	ChatGPT 4	1 Radiologist, 1 Hand Surgery Resident, 1 Medical Student and Gleamer BoneViewTM
Kim and MacKinnon³¹	ROC, AUC, Specificity, and Sensitivity	Deep CNNs	1 Radiology Registrar

3.3 Quality assessment of included studies

The MINORS quality appraisal results for each study appear in ( Table 2). Individual item scores (0–2) and total scores are reported for each study; items 9–12 apply only to comparative designs.

Table 2. Quality assessment of included studies using the MINORS tool.

Study ID	Study design	1	2	3	4	5	6	7	8	9	10	11	12	Total
Zech et al.³³	Retrospective study	2	2	1	2	1	2	2	1	2	2	2	2	21
Raisuddin et al.³²	Retrospective study	2	2	1	2	0	2	2	1	2	2	2	2	20
Cohen et al.³⁰	Retrospective study	2	1	1	2	2	2	2	1	2	2	2	2	21
Hardalaç et al.¹⁸	Retrospective cohort study	2	2	1	2	1	0	2	0	2	2	2	2	18
Anttila et al.¹⁶	Retrospective cohort study	2	2	1	2	1	1	2	1	2	2	2	2	20
Üreten et al.²⁴	Retrospective cohort study	2	1	1	2	1	2	2	2	2	2	2	2	21
Oka et al.²³	Retrospective cohort study	2	2	1	2	1	2	2	1	2	2	2	2	21
Zhang et al.³⁵	Prospective diagnostic study	2	2	2	2	2	2	2	2	2	2	2	2	24
Blüthgen et al.¹⁷	Retrospective cohort study	2	2	1	2	1	2	2	1	2	2	2	2	21
Min et al.²⁶	Retrospective diagnostic study	2	2	1	2	2	1	2	2	2	2	2	2	22
Ju and Cai²⁸	Technical/Methodological study	2	1	0	2	1	2	2	2	2	2	2	2	20
Gan et al.²⁵	Retrospective study	2	2	1	2	1	2	2	2	2	2	2	2	22
Hendrix et al.²⁷	Retrospective study	2	2	1	2	2	2	2	1	2	2	0	2	20
Lee et al.²⁹	Retrospective study	2	1	1	2	1	2	2	0	1	1	1	2	16
Knight et al.³⁴	Prospective diagnostic study	2	2	2	2	2	2	2	1	1	2	2	2	22
Lee et al.²⁰	Retrospective study	2	1	1	2	1	2	2	0	1	2	2	2	18
Li et al.²¹	Retrospective study	2	1	1	2	1	2	2	0	1	2	2	2	18
Jacques et al.¹⁹	Retrospective study	2	2	1	2	2	2	2	0	1	2	2	2	20
Mert et al.²²	Retrospective study	2	1	1	2	2	2	2	0	2	2	2	2	20
Kim and MacKinnon³¹	Retrospective study	2	1	1	2	2	2	2	0	1	2	2	2	19

3.4 Pooled analysis of sensitivity and specificity

3.4.1 Forest Plot of Sensitivity and Specificity

The forest plots display sensitivity and specificity for individual studies along with the pooled estimates. ( Figure 2) shows sensitivity and ( Figure 3) shows specificity. This showed the variations among studies and how each contributed to the overall results. The studies selected for final inclusion were 18. However, you will notice that 22 entries were included in the forest plot. This is because some studies reported investigations using more than two algorithms, and to avoid overlapping, it was necessary to report each algorithm individually.

Figure 2. Forest plot of sensitivity for AI models in detecting hand and wrist fractures.

Figure 3. Forest plot of specificity for AI models in detecting hand and wrist fractures.

Figure 4 above shows the forest plot for sensitivity estimates. Logit sensitivity was estimated instead of logit specificity because it was the primary target for the systematic review—detection of hand and wrist fractures using AI. The true positive rate (sensitivity) was important because the AI models were designed for detecting hand and wrist fractures, as missing a fracture could have detrimental side effects due to delayed interventions.

Figure 4. Forest plot of specificity for AI models in detecting hand and wrist fractures.

Abbreviations: CI = Confidence Interval; AI = Artificial Intelligence.

Most of the AI models, as shown in Figure 4, cluster around a logit sensitivity of 3–5, which is a positive sign that most models had strong diagnostic performance. The logit sensitivity shows that three AI models—3D ultrasound-Philips, CNN-ResNet34, and Deep CNN Gleamer—had logit sensitivity values closer to 12, suggesting exceptional sensitivity. Models such as VGG-16 and Radius Segmentation U-Net & RetinaNet showed competitive sensitivity with estimates of around 4–5. These variations illustrate the different capabilities of deep learning techniques in detecting hand and wrist fractures. The blue spots in Figure 4 also display lines (confidence intervals), highlighting variations in performance across the different datasets.

High sensitivity scores across most models indicate that the AI models used for detecting hand and wrist fractures were effective at identifying fractures. This is very important in a clinical setup where missing a fracture (false negatives) could have serious consequences, as patients might miss urgent intervention.

3.4.2 Fixed-effects meta-analysis for sensitivity and specificity

The sensitivity and specificity values from multiple studies were combined using a fixed-effects model. The assumption was that all 18 studies were estimating the same true effect, with any differences arising from chance. Consequently, the pooled sensitivity and specificity were estimated as weighted averages, with studies having lower variance receiving more weight.

3.4.3 Logit transformation

The logit transformation function normalizes the data, increasing the reliability of its calculation. The data was normalized and then transformed back to the probability scale for easier interpretation. Since sensitivity and specificity values typically range between 0 and 1, their transformation to the logit scale was necessary.

3.4.4 Heterogeneity analysis and random effect

The Cochran’s Q test and I² statistic were used to measure the variability among the 18 studies. Higher values indicated large variations, suggesting that the 18 studies were not measuring the same thing. To further investigate this high variation, a random-effects model was performed, as it relies on the assumption that each individual study analyzed in the present systematic review had its own true effect rather than relying on a single common effect.

3.5 Assessment of publication bias

Figure 5 above shows the funnel plot that was plotted with studies, showing the relationship between study precisions (standard errors) and effect sizes. The identifiers were annotated with numbers, along with their specific labels and the AI models used. The shape of the plot is asymmetrical, meaning that there was publication bias, specifically a small study effect. In the absence of publication bias, the study points would have been evenly scattered around the red vertical line. The vertical red line represents the overall mean log (DOR)—the log of the Diagnostic Odds Ratios, which measures the effectiveness of a diagnostic test and is calculated as the ratio of the odds of a true positive to the odds of a false positive. A closer observation of Figure 5 shows that 16 AI models are on the left side of the vertical red line, and 7 on the right. The studies on the left side suggest that the AI models reviewed had lower DOR, which could be translated as potentially lower test accuracy or effect size. The seven AI models on the right side suggest a higher DOR, which could be translated as potentially higher test accuracy or effect size.

Figure 5. Funnel plot assessing publication bias in included studies.

Abbreviations: DOR = Diagnostic Odds Ratio; SE = Standard Error.

The imbalance – asymmetrical funnel plot – suggests potential publication bias, which resulted from the studies using different AI models and algorithms with variations in sample sizes. The AI models on the left side were clustered closer together, indicating that their results were more consistent with each other, while the clustering of the AI models on the right side is spread out, indicating more variability in their results and uncertainty. The precision was approximated by the standard error as reflected in the y-axis. The studies with higher precision (smaller error) appeared at the top – closer to the 0 value, while the studies with lower precision (larger errors) appeared at the bottom of the funnel plot. The studies,^{16,25,27,31–33} had standard errors below 0.025, which are closer to 0. This suggests their results are highly precise, probably due to their large sample sizes – an indication that they carried more weight in the overall conclusion. The study by Zhang et al.,³⁵ had a standard error placed at the far bottom of the funnel plot, indicating that its AI model had a higher uncertainty in its estimate, thus less reliable (lower precision). Its placement on the right side of the funnel is indicative of a higher diagnostic odds ratio – better diagnostic performance of the AI model. Nevertheless, the model shows a large standard error, which makes it less trustworthy due to its lower precision value, and thus not as reliable as studies with lower SE, clustered near the top on the right side.

The Egger’s test shown in Table 4 indicates that the precision value was statistically significant (p < 0.001), confirming the presence of publication bias in the published AI models. The R-squared score indicated that 83.4% of the variability in the standardized effect (log_DOR/SE) explained the precision (inverse of SE) in the final model, suggesting that the model accounted for most of the variance in detecting hand and wrist fractures using AI models. Thus, precision was an important predictor of the standardized effect. The adjusted R-squared variance of 82.6% suggested that the variability in the standardized effect was explained by precision.

Table 3. Measures of variability in the studies.

Heterogeneity & Random effect	Sensitivity	Specificity
Fixed-Effects Pooled	0.796	0.903
Random-Effects Pooled	0.910	0.912
Heterogeneity (I²)	99.09%	96.43%

Table 4. Egger’s Test Summary: OLS Regression Results.

Dep. Variable: Standardized_Effect					R-squared:	0.834
Model: OLS					Adj. R-squared:	0.826
Method: Least Squares					F-statistic:	105.7
No. Observations: 23					Prob (F-statistic):	1.19e-09
Df Residuals: 21					Log-Likelihood:	-121.5
Df Model: 1					AIC:	247
Covariance Type: nonrobust					BIC:	249.3
	coef	std err	t	P>\|t\|	[0.025	0.975]
const	51.267	14.714	3.484	0.002	20.668	81.866
Precision	3.466	0.337	10.282	0.000	2.765	4.167
Omnibus: 10.832					Durbin-Watson:	1.84
Prob (Omnibus): 0.004					Jarque-Bera (JB):	8.853
Skew: 1.246					Prob (JB):	0.012
Kurtosis: 4.741					Cond. No.	61.8

4. Discussion

In recent years, artificial intelligence has been spreading into various aspects of life, such as finance, education, manufacturing and Industry 4.0, retail and e-commerce, transport and logistics, agriculture, cybersecurity, media and entertainment, energy and environment, human resources and recruitment, legal and compliance, and healthcare. In healthcare, AI has driven innovations in medical imaging—cancer,^36,37 fractures,^38,39 and brain disorders^40,41—personalized treatment plans,^42,43 drug discovery and development,^44,45 AI-assisted surgeries,^46,47 and predictive analytics for patient outcomes.^48,49 Therefore, this systematic review investigates the accuracy of artificial intelligence (AI) in detecting hand and wrist fractures.

4.1 Summary of findings

A substantial variability was observed across the studies in terms of sensitivity and specificity. The sensitivity (0.796) and specificity (0.903) in the fixed-effects pooled estimates indicate that the fixed-effects model had higher specificity compared to sensitivity. However, the random-effects model showed higher values for both sensitivity (0.910) and specificity (0.912), suggesting that the random-effects model demonstrated better diagnostic performance for AI in detecting hand and wrist fractures.

The heterogeneity scores for sensitivity (99.09%) and specificity (96.43%) were very high, as shown in Table 3. This indicates substantial inconsistencies across the studies, likely due to variations in the devices and algorithms used across the 18 reviewed studies. As a result, caution is necessary when generalizing these findings across different clinical settings.

In terms of sensitivity of the devices and their respective algorithms, the 3D Ultrasound-Philips IU22 machine,³⁵ CNN-ResNet34 & 3DU,³⁴ and Deep CNN-Gleamer²² reported the highest scores, as shown in Figure 2. The mean sensitivity was approximately 90%.

Studies by Hendrix et al.,²⁷ - (YOLOv5 & Inception-v3), Jacques et al.,¹⁹ - (Deep CNN-Gleamer), Li et al.,²¹- (YOLOv3 & MobileNetV3), Knight et al.,³⁴- (CNN-ResNet34 & 2DU), Cohen et al.,³⁰- (Deep CNN-Gleamer), Min et al.,²⁶- (YOLOv5), Blüthgen et al.,¹⁷- (ViDi Suite Version), Zech et al.,³³- (Region-Based CNN), and Mert et al.,²²- (ChatGPT4) all had sensitivity values below 90%. This indicates that the algorithms used in these studies were less effective in identifying positive cases (true positives) compared to studies with sensitivity values above 90%.

The lower sensitivity in these studies could be attributed to a higher percentage of missed true positive cases (false negatives) when detecting hand and wrist fractures. Therefore, studies with specificity values below 90% signal that the AI models used may not be fully reliable for diagnosing hand and wrist fractures, increasing the risk of missed diagnoses.

The specificity identified algorithms that can distinguish individuals without hand and wrist fractures (true negatives) from those incorrectly identified as having fractures (false positives). The mean specificity was approximately 90%. Therefore, studies with higher specificities (greater than 90%) demonstrated that the algorithms correctly identified individuals without hand or wrist fractures, indicating a minimized risk of false positives, and vice versa.

Studies by Min et al.²⁶ – YOLOv5, Lee et al.²⁹ – CNN-RetinaNet, DeepLabv3 & NasNet, Üreten et al.²⁴ – ResNet-50, Üreten et al.²⁴ – GoogleNet, Raisuddin et al.³² – Deep Wrist Pipeline, Kim and MacKinnon³¹ – Deep CNN, Anttila et al.¹⁶ – Segmentation-Based U-Net Architecture, Zhang et al.³⁵ – 3D Ultrasound (Philips IU22 machine), Zech et al.³³ – Region-Based CNN, and Jacques et al.¹⁹ all reported specificity values below 90%. Nevertheless, studies with higher specificity scores indicated that their AI models performed better in avoiding false alarms.

4.2 Strengths and limitations

Like any other study, the present systematic review had its own strengths and limitations. Its strengths included the fact that most of the studies reported high scores for both sensitivity (14 out of 23 AI models) and specificity (12 out of 23 AI models), with values ≥90%. Higher sensitivity scores indicated fewer missed fractures, suggesting that radiologists could potentially rely on these AI models to detect hand and wrist fractures, and vice versa.

The systematic review indicated that deep learning models, particularly those based on CNNs, dominated the performance of the AI models reviewed. This pattern was reflected in the forest plot of both sensitivity and specificity, where the top quarter was largely occupied by CNN-based AI models. However, this review does not advocate that CNN models are inherently superior to other models; rather, it highlights opportunities for further improvements and modifications to develop better algorithms or models. Enhancements could include training the models on larger sample sizes or fine-tuning hyperparameters to improve predictive performance.

One of the limitations of this review was that some studies had smaller sample sizes compared to others. However, logit transformation was applied to convert the proportions to an unbounded scale in preparation for statistical modelling and meta-analysis, which helped stabilize variance resulting from different sample sizes. The logit function was then back transformed to the inverse logit function for easier interpretation of sensitivity and specificity scores. Additionally, sample weighting was performed to ensure that the final estimates of pooled sensitivity and specificity were reliable. Future AI studies can enhance their models by training on larger datasets and continuously reviewing and improving their performance.

Another limitation arose from the interpretation of the confidence intervals presented in the forest plots. The results indicated that 9 out of the 18 studies had sensitivity values below 90%, which was concerning as it suggested a higher risk of missing hand and wrist fractures.

Lastly, the systematic review aimed to evaluate sensitivity and specificity and ensure that the meta-analysis provided robust evidence for the clinical superiority of one AI model over other comparative AI models. The approach involved assessing bias and robustness in terms of publication bias and sensitivity analysis. Most studies failed to report the AUC, NPV, PPV, and even confidence intervals. As a result, the study relied on sensitivity, specificity, and sample sizes to determine publication bias.

5. Conclusion

Most AI models demonstrated good diagnostic accuracy, with high sensitivity and specificity scores (≥90%). However, some models fell short in sensitivity and specificity (≤90%), indicating performance variations across different AI models or algorithms.

From a clinical perspective, AI models with lower sensitivity scores may fail to detect hand and wrist fractures, potentially delaying treatment, while those with lower specificity scores could lead to unnecessary interventions—treating hands and wrists that are not fractured. The AI models were trained on datasets with varying sample sizes, using different devices and algorithms. Therefore, it is essential to standardize training datasets and algorithms and strive for greater consistency in AI models.

Ethical considerations

Not applicable. This study is a systematic review of published literature and did not involve human or animal subjects.

Reporting guidelines

This article follows the PRISMA 2020 reporting guideline for systematic reviews.⁵⁰

The completed PRISMA checklist and flowchart are available at: https://zenodo.org/records/16749232.

Data are available under the terms Creative Commons Zero v1.0 Universal (CC0)

Data availability

The data generated in this study are available at Zenodo data repository in the following link:

https://zenodo.org/records/16749232.⁵⁰

Titled: Evaluating Artificial Intelligence for Accurate Detection of Hand and Wrist Fractures: A Systematic Review and Meta-Analysis.

Data are available under the terms Creative Commons Zero v1.0 Universal (CC0)

Acknowledgements

None.

References

1. Court-Brown CM, Caesar B: Epidemiology of adult fractures: a review. Injury. 2006; 37(8): 691–697. Publisher Full Text
2. Randsborg P-H, Gulbrandsen P, Šaltyte Benth J, et al.: Fractures in Children: Epidemiology and Activity-Specific Fracture Rates. JBJS. 2013; 95(7): e42. PubMed Abstract | Publisher Full Text
3. Rundgren J, Bojan A, Mellstrand Navarro C, et al.: Epidemiology, classification, treatment and mortality of distal radius fractures in adults: an observational study of 23,394 fractures from the national Swedish fracture register. BMC Musculoskelet. Disord. 2020; 21(1): 88. PubMed Abstract | Publisher Full Text | Free Full Text
4. Pike C, Birnbaum HG, Schiller M, et al.: Direct and Indirect Costs of Non-Vertebral Fracture Patients with Osteoporosis in the US. PharmacoEconomics. 2010; 28(5): 395–409. PubMed Abstract | Publisher Full Text
5. Borgström F, Karlsson L, Ortsäter G, et al.: Fragility fractures in Europe: burden, management and opportunities. Arch. Osteoporos. 2020; 15(1): 59. PubMed Abstract | Publisher Full Text | Free Full Text
6. Geijer M, El-Khoury GY: MDCT in the evaluation of skeletal trauma: principles, protocols, and clinical applications. Emerg. Radiol. 2006; 13(1): 7–18. PubMed Abstract | Publisher Full Text
7. Kaewlai R, Avery LL, Asrani AV, et al.: Multidetector CT of carpal injuries: anatomy, fractures, and fracture-dislocations. Radiographics. 2008; 28(6): 1771–1784. PubMed Abstract | Publisher Full Text
8. Gäbler C, Kukla C, Breitenseher MJ, et al.: Diagnosis of occult scaphoid fractures and other wrist injuries. Langenbeck’s Arch. Surg. 2001; 386(2): 150–154. Publisher Full Text
9. Donald JJ, Barnard SA: Common patterns in 558 diagnostic radiology errors. J. Med. Imaging Radiat. Oncol. 2012; 56(2): 173–178. PubMed Abstract | Publisher Full Text
10. Berlin L: Defending the “Missed” Radiographic Diagnosis. Am. J. Roentgenol. 2001; 176(2): 317–322. PubMed Abstract | Publisher Full Text
11. Guly HR: Diagnostic errors in an accident and emergency department. Emerg. Med. J. 2001; 18(4): 263–269. PubMed Abstract | Publisher Full Text | Free Full Text
12. Mattijssen-Horstink L, Langeraar JJ, Mauritz GJ, et al.: Radiologic discrepancies in diagnosis of fractures in a Dutch teaching emergency department: a retrospective analysis. Scand. J. Trauma Resusc. Emerg. Med. 2020; 28(1): 38. PubMed Abstract | Publisher Full Text | Free Full Text
13. Whang JS, Baker SR, Patel R, et al.: The Causes of Medical Malpractice Suits against Radiologists in the United States. Radiology. 2013; 266(2): 548–554. PubMed Abstract | Publisher Full Text
14. Thian YL, Li Y, Jagmohan P, et al.: Convolutional Neural Networks for Automated Fracture Detection and Localization on Wrist Radiographs. Radiology Artif. Intell. 2019; 1(1): e180001. PubMed Abstract | Publisher Full Text | Free Full Text
15. Page MJ, McKenzie JE, Bossuyt PM, et al.: The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Int. J. Surg. 2021; 88: 105906. PubMed Abstract | Publisher Full Text
16. Anttila TT, Karjalainen TV, Mäkelä TO, et al.: Detecting distal radius fractures using a segmentation-based deep learning model. J. Digit. Imaging. 2023; 36(2): 679–687. PubMed Abstract | Publisher Full Text | Free Full Text
17. Blüthgen C, Becker AS, de Martini IV , et al.: Detection and localization of distal radius fractures: Deep learning system versus radiologists. Eur. J. Radiol. 2020; 126: 108925. PubMed Abstract | Publisher Full Text
18. Hardalaç F, Uysal F, Peker O, et al.: Fracture detection in wrist X-ray images using deep learning-based object detection models. Sensors. 2022; 22(3): 1285. PubMed Abstract | Publisher Full Text | Free Full Text
19. Jacques T, Cardot N, Ventre J, et al.: Commercially-available AI algorithm improves radiologists’ sensitivity for wrist and hand fracture detection on X-ray, compared to a CT-based ground truth. Eur. Radiol. 2024; 34(5): 2885–2894. PubMed Abstract | Publisher Full Text
20. Lee S, Kim KG, Kim YJ, et al.: Automatic Segmentation and Radiologic Measurement of Distal Radius Fractures Using Deep Learning. Clin. Orthop. Surg. 2024; 16(1): 113–124. PubMed Abstract | Publisher Full Text | Free Full Text
21. Li T, Yin Y, Yi Z, et al.: Evaluation of a convolutional neural network to identify scaphoid fractures on radiographs. J. Hand Surg. Eur. Vol. 2023; 48(5): 445–450. PubMed Abstract | Publisher Full Text
22. Mert S, Stoerzer P, Brauer J, et al.: Diagnostic power of ChatGPT 4 in distal radius fracture detection through wrist radiographs. Arch. Orthop. Trauma Surg. 2024; 144(5): 2461–2467. PubMed Abstract | Publisher Full Text | Free Full Text
23. Oka K, Shiode R, Yoshii Y, et al.: Artificial intelligence to diagnosis distal radius fracture using biplane plain X-rays. J. Orthop. Surg. Res. 2021; 16: 694–697. PubMed Abstract | Publisher Full Text | Free Full Text
24. Üreten K, Sevinç HF, İğdeli U, et al.: Use of deep learning methods for hand fracture detection from plain hand radiographs. Turkish Journal of Trauma & Emergency Surgery. 2022; 28(2): 196–201. PubMed Abstract | Publisher Full Text | Free Full Text
25. Gan K, Xu D, Lin Y, et al.: Artificial intelligence detection of distal radius fractures: a comparison between the convolutional neural network and professional assessments. Acta Orthop. 2019; 90(4): 394–400. PubMed Abstract | Publisher Full Text | Free Full Text
26. Min H, Rabi Y, Wadhawan A, et al.: Automatic classification of distal radius fracture using a two-stage ensemble deep learning framework. Physical and engineering sciences in medicine. 2023; 46(2): 877–886. PubMed Abstract | Publisher Full Text | Free Full Text
27. Hendrix N, Hendrix W, van Dijke K , et al.: Musculoskeletal radiologist-level performance by using deep learning for detection of scaphoid fractures on conventional multi-view radiographs of hand and wrist. Eur. Radiol. 2023; 33(3): 1575–1588. PubMed Abstract | Publisher Full Text | Free Full Text
28. Ju R-Y, Cai W: Fracture detection in pediatric wrist trauma X-ray images using YOLOv8 algorithm. Sci. Rep. 2023; 13(1): 20077. PubMed Abstract | Publisher Full Text | Free Full Text
29. Lee K-C, Choi IC, Kang CH, et al.: Clinical validation of an artificial intelligence model for detecting distal radius, ulnar styloid, and scaphoid fractures on conventional wrist radiographs. Diagnostics. 2023; 13(9): 1657. PubMed Abstract | Publisher Full Text | Free Full Text
30. Cohen M, Puntonet J, Sanchez J, et al.: Artificial intelligence vs. radiologist: accuracy of wrist fracture detection on radiographs. Eur. Radiol. 2023; 33(6): 3974–3983. PubMed Abstract | Publisher Full Text
31. Kim D, MacKinnon T: Artificial intelligence in fracture detection: transfer learning from deep convolutional neural networks. Clin. Radiol. 2018; 73(5): 439–445. Publisher Full Text
32. Raisuddin AM, Vaattovaara E, Nevalainen M, et al.: Critical evaluation of deep neural networks for wrist fracture detection. Sci. Rep. 2021; 11(1): 6006. PubMed Abstract | Publisher Full Text | Free Full Text
33. Zech JR, Carotenuto G, Igbinoba Z, et al.: Detecting pediatric wrist fractures using deep-learning-based object detection. Pediatr. Radiol. 2023; 53(6): 1125–1134. PubMed Abstract | Publisher Full Text
34. Knight J, Zhou Y, Keen C, et al.: 2D/3D ultrasound diagnosis of pediatric distal radius fractures by human readers vs artificial intelligence. Sci. Rep. 2023; 13(1): 14535. PubMed Abstract | Publisher Full Text | Free Full Text
35. Zhang J, Boora N, Melendez S, et al.: Diagnostic accuracy of 3D ultrasound and artificial intelligence for detection of pediatric wrist injuries. Children. 2021; 8(6): 431. PubMed Abstract | Publisher Full Text | Free Full Text
36. Huang S, Yang J, Fong S, et al.: Artificial intelligence in cancer diagnosis and prognosis: Opportunities and challenges. Cancer Lett. 2020; 471: 61–71. PubMed Abstract | Publisher Full Text
37. Hunter B, Hindocha S, Lee RW: The role of artificial intelligence in early cancer diagnosis. Cancers. 2022; 14(6): 1524. PubMed Abstract | Publisher Full Text | Free Full Text
38. Guermazi A, Tannoury C, Kompel AJ, et al.: Improving radiographic fracture recognition performance and efficiency using artificial intelligence. Radiology. 2022; 302(3): 627–636. PubMed Abstract | Publisher Full Text
39. Kutbi M: Artificial intelligence-based applications for bone fracture detection using medical images: a systematic review. Diagnostics. 2024; 14(17): 1879. PubMed Abstract | Publisher Full Text | Free Full Text
40. Liu G-D, Li Y-C, Zhang W, et al.: A brief review of artificial intelligence applications and algorithms for psychiatric disorders. Engineering. 2020; 6(4): 462–467. Publisher Full Text
41. Raghavendra U, Acharya UR, Adeli H: Artificial intelligence techniques for automated diagnosis of neurological disorders. Eur. Neurol. 2020; 82(1-3): 41–64. Publisher Full Text
42. Mohsin SN, Gapizov A, Ekhator C, et al.: The role of artificial intelligence in prediction, risk stratification, and personalized treatment planning for congenital heart diseases. Cureus. 2023; 15(8): e44374. PubMed Abstract | Publisher Full Text | Free Full Text
43. Sherani AMK, Khan M, Qayyum MU, et al.: Synergizing AI and Healthcare: Pioneering Advances in Cancer Medicine for Personalized Treatment. International Journal of Multidisciplinary Sciences and Arts. 2024; 3(1): 270–277. Publisher Full Text
44. Mak K-K, Wong Y-H, Pichika MR: Artificial intelligence in drug discovery and development. Drug discovery and evaluation: safety and pharmacokinetic assays. 2024; 1461–1498. Publisher Full Text
45. Patel V, Shah M: Artificial intelligence and machine learning in drug discovery and development. Intelligent Medicine. 2022; 2(3): 134–140. Publisher Full Text
46. Bodenstedt S, Wagner M, Müller-Stich BP, et al.: Artificial intelligence-assisted surgery: potential and challenges. Visceral medicine. 2020; 36(6): 450–455. PubMed Abstract | Publisher Full Text | Free Full Text
47. Thakre D, Patel J: The Advancements and Benefits of AI-Assisted Robotic Surgery. 2024 2nd DMIHER International Conference on Artificial Intelligence in Healthcare, Education and Industry (IDICAIEI). IEEE; 2024.
48. Dixon D, Sattar H, Moros N, et al.: Unveiling the Influence of AI Predictive Analytics on Patient Outcomes: A Comprehensive Narrative Review. Cureus. 2024; 16(5): e59954. PubMed Abstract | Publisher Full Text | Free Full Text
49. Rana MS, Shuford J: AI in Healthcare: Transforming Patient Care through Predictive Analytics and Decision Support Systems. Journal of Artificial Intelligence General Science (JAIGS) ISSN. 2024; 1(1): 3006–4023. Publisher Full Text
50. Alosaimi H, et al.: Evaluating Artificial Intelligence for Accurate Detection of Hand and Wrist Fractures: A Systematic Review and Meta-analysis. F1000 Research Journal, Zenodo. 6 Aug. 2025. Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 10 Oct 2025

Author details Author details

¹ Orthopedic, King Fahd Specialist Hospital Buraidah, Buraydah, Al Qassim, 52261, Saudi Arabia
² College of Medicine, King Saud University, Riyadh, Riyadh Province, 13523, Saudi Arabia
³ King Saud bin Abdulaziz University for Health Sciences College of Medicine, Jeddah, Makkah Province, 22233, Saudi Arabia
⁴ College of Medicine, University of Jeddah, Jeddah, Makkah Province, 22233, Saudi Arabia
⁵ Al-Jouf University College of Medicine, Sakaka, Al Jowf, 42421, Saudi Arabia
⁶ Qassim University College of Medicine, Buraydah, Al Qassim, 52261, Saudi Arabia
⁷ Almaarefa University College of Medicine, Riyadh, Riyadh Province, 13523, Saudi Arabia
⁸ Umm Al-Qura University College of Medicine, Mecca, Makkah Province, 21955, Saudi Arabia
⁹ College of Medicine, University of Tabuk, Tabuk, Tabuk Province, 71411, Saudi Arabia

Hamed Alosaimi
Roles: Conceptualization, Funding Acquisition, Methodology, Project Administration, Writing – Review & Editing

Abdullah Musaaed Alsalamah
Roles: Data Curation, Investigation, Resources

Nawwaf N. Alharbi
Roles: Investigation, Visualization, Writing – Original Draft Preparation

Hashim Albar
Roles: Investigation, Visualization, Writing – Original Draft Preparation

Mohammed Khalid I. Alghamdi
Roles: Validation, Writing – Review & Editing

Sultan Abdulaziz Alnuman
Roles: Validation, Visualization

Anas M. Alrashed
Roles: Investigation, Writing – Original Draft Preparation

Omar H Bin Salleeh
Roles: Investigation, Visualization

Khalid Abdullah Alharbi
Roles: Investigation, Visualization

Malik Raja Alanazi
Roles: Formal Analysis, Investigation

Weaam Hamoud Alqabasani
Roles: Formal Analysis, Writing – Original Draft Preparation

Shahad Abdullah Nolelli
Roles: Formal Analysis, Writing – Original Draft Preparation

Mohammed Saeed Alharbi
Roles: Project Administration, Supervision, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (1)

version 1

Published: 10 Oct 2025, 14:1062

https://doi.org/10.12688/f1000research.168673.1

Copyright

© 2025 Alosaimi H et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Alosaimi H, Musaaed Alsalamah A, N. Alharbi N et al. Evaluating artificial intelligence for accurate detection of hand and wrist fractures: a systematic review and meta-analysis [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:1062 (https://doi.org/10.12688/f1000research.168673.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 10 Oct 2025

Open Peer Review

Reviewer Status

AWAITING PEER REVIEW

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

[1] 1. Court-Brown CM, Caesar B: Epidemiology of adult fractures: a review. Injury. 2006; 37(8): 691–697. Publisher Full Text

[2] 2. Randsborg P-H, Gulbrandsen P, Šaltyte Benth J, et al.: Fractures in Children: Epidemiology and Activity-Specific Fracture Rates. JBJS. 2013; 95(7): e42. PubMed Abstract | Publisher Full Text

[3] 3. Rundgren J, Bojan A, Mellstrand Navarro C, et al.: Epidemiology, classification, treatment and mortality of distal radius fractures in adults: an observational study of 23,394 fractures from the national Swedish fracture register. BMC Musculoskelet. Disord. 2020; 21(1): 88. PubMed Abstract | Publisher Full Text | Free Full Text

[4] 4. Pike C, Birnbaum HG, Schiller M, et al.: Direct and Indirect Costs of Non-Vertebral Fracture Patients with Osteoporosis in the US. PharmacoEconomics. 2010; 28(5): 395–409. PubMed Abstract | Publisher Full Text

[5] 5. Borgström F, Karlsson L, Ortsäter G, et al.: Fragility fractures in Europe: burden, management and opportunities. Arch. Osteoporos. 2020; 15(1): 59. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Geijer M, El-Khoury GY: MDCT in the evaluation of skeletal trauma: principles, protocols, and clinical applications. Emerg. Radiol. 2006; 13(1): 7–18. PubMed Abstract | Publisher Full Text

[7] 7. Kaewlai R, Avery LL, Asrani AV, et al.: Multidetector CT of carpal injuries: anatomy, fractures, and fracture-dislocations. Radiographics. 2008; 28(6): 1771–1784. PubMed Abstract | Publisher Full Text

[8] 8. Gäbler C, Kukla C, Breitenseher MJ, et al.: Diagnosis of occult scaphoid fractures and other wrist injuries. Langenbeck’s Arch. Surg. 2001; 386(2): 150–154. Publisher Full Text

[9] 9. Donald JJ, Barnard SA: Common patterns in 558 diagnostic radiology errors. J. Med. Imaging Radiat. Oncol. 2012; 56(2): 173–178. PubMed Abstract | Publisher Full Text

[10] 10. Berlin L: Defending the “Missed” Radiographic Diagnosis. Am. J. Roentgenol. 2001; 176(2): 317–322. PubMed Abstract | Publisher Full Text

[11] 11. Guly HR: Diagnostic errors in an accident and emergency department. Emerg. Med. J. 2001; 18(4): 263–269. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Mattijssen-Horstink L, Langeraar JJ, Mauritz GJ, et al.: Radiologic discrepancies in diagnosis of fractures in a Dutch teaching emergency department: a retrospective analysis. Scand. J. Trauma Resusc. Emerg. Med. 2020; 28(1): 38. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Whang JS, Baker SR, Patel R, et al.: The Causes of Medical Malpractice Suits against Radiologists in the United States. Radiology. 2013; 266(2): 548–554. PubMed Abstract | Publisher Full Text

[14] 14. Thian YL, Li Y, Jagmohan P, et al.: Convolutional Neural Networks for Automated Fracture Detection and Localization on Wrist Radiographs. Radiology Artif. Intell. 2019; 1(1): e180001. PubMed Abstract | Publisher Full Text | Free Full Text

[15] 15. Page MJ, McKenzie JE, Bossuyt PM, et al.: The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Int. J. Surg. 2021; 88: 105906. PubMed Abstract | Publisher Full Text

[16] 16. Anttila TT, Karjalainen TV, Mäkelä TO, et al.: Detecting distal radius fractures using a segmentation-based deep learning model. J. Digit. Imaging. 2023; 36(2): 679–687. PubMed Abstract | Publisher Full Text | Free Full Text

[17] 17. Blüthgen C, Becker AS, de Martini IV , et al.: Detection and localization of distal radius fractures: Deep learning system versus radiologists. Eur. J. Radiol. 2020; 126: 108925. PubMed Abstract | Publisher Full Text

[18] 18. Hardalaç F, Uysal F, Peker O, et al.: Fracture detection in wrist X-ray images using deep learning-based object detection models. Sensors. 2022; 22(3): 1285. PubMed Abstract | Publisher Full Text | Free Full Text

[19] 19. Jacques T, Cardot N, Ventre J, et al.: Commercially-available AI algorithm improves radiologists’ sensitivity for wrist and hand fracture detection on X-ray, compared to a CT-based ground truth. Eur. Radiol. 2024; 34(5): 2885–2894. PubMed Abstract | Publisher Full Text

[20] 20. Lee S, Kim KG, Kim YJ, et al.: Automatic Segmentation and Radiologic Measurement of Distal Radius Fractures Using Deep Learning. Clin. Orthop. Surg. 2024; 16(1): 113–124. PubMed Abstract | Publisher Full Text | Free Full Text

[21] 21. Li T, Yin Y, Yi Z, et al.: Evaluation of a convolutional neural network to identify scaphoid fractures on radiographs. J. Hand Surg. Eur. Vol. 2023; 48(5): 445–450. PubMed Abstract | Publisher Full Text

[22] 22. Mert S, Stoerzer P, Brauer J, et al.: Diagnostic power of ChatGPT 4 in distal radius fracture detection through wrist radiographs. Arch. Orthop. Trauma Surg. 2024; 144(5): 2461–2467. PubMed Abstract | Publisher Full Text | Free Full Text

[23] 23. Oka K, Shiode R, Yoshii Y, et al.: Artificial intelligence to diagnosis distal radius fracture using biplane plain X-rays. J. Orthop. Surg. Res. 2021; 16: 694–697. PubMed Abstract | Publisher Full Text | Free Full Text

[24] 24. Üreten K, Sevinç HF, İğdeli U, et al.: Use of deep learning methods for hand fracture detection from plain hand radiographs. Turkish Journal of Trauma & Emergency Surgery. 2022; 28(2): 196–201. PubMed Abstract | Publisher Full Text | Free Full Text

[25] 25. Gan K, Xu D, Lin Y, et al.: Artificial intelligence detection of distal radius fractures: a comparison between the convolutional neural network and professional assessments. Acta Orthop. 2019; 90(4): 394–400. PubMed Abstract | Publisher Full Text | Free Full Text

[26] 26. Min H, Rabi Y, Wadhawan A, et al.: Automatic classification of distal radius fracture using a two-stage ensemble deep learning framework. Physical and engineering sciences in medicine. 2023; 46(2): 877–886. PubMed Abstract | Publisher Full Text | Free Full Text

[27] 27. Hendrix N, Hendrix W, van Dijke K , et al.: Musculoskeletal radiologist-level performance by using deep learning for detection of scaphoid fractures on conventional multi-view radiographs of hand and wrist. Eur. Radiol. 2023; 33(3): 1575–1588. PubMed Abstract | Publisher Full Text | Free Full Text

[28] 28. Ju R-Y, Cai W: Fracture detection in pediatric wrist trauma X-ray images using YOLOv8 algorithm. Sci. Rep. 2023; 13(1): 20077. PubMed Abstract | Publisher Full Text | Free Full Text

[29] 29. Lee K-C, Choi IC, Kang CH, et al.: Clinical validation of an artificial intelligence model for detecting distal radius, ulnar styloid, and scaphoid fractures on conventional wrist radiographs. Diagnostics. 2023; 13(9): 1657. PubMed Abstract | Publisher Full Text | Free Full Text

[30] 30. Cohen M, Puntonet J, Sanchez J, et al.: Artificial intelligence vs. radiologist: accuracy of wrist fracture detection on radiographs. Eur. Radiol. 2023; 33(6): 3974–3983. PubMed Abstract | Publisher Full Text

[31] 31. Kim D, MacKinnon T: Artificial intelligence in fracture detection: transfer learning from deep convolutional neural networks. Clin. Radiol. 2018; 73(5): 439–445. Publisher Full Text

[32] 32. Raisuddin AM, Vaattovaara E, Nevalainen M, et al.: Critical evaluation of deep neural networks for wrist fracture detection. Sci. Rep. 2021; 11(1): 6006. PubMed Abstract | Publisher Full Text | Free Full Text

[33] 33. Zech JR, Carotenuto G, Igbinoba Z, et al.: Detecting pediatric wrist fractures using deep-learning-based object detection. Pediatr. Radiol. 2023; 53(6): 1125–1134. PubMed Abstract | Publisher Full Text

[34] 34. Knight J, Zhou Y, Keen C, et al.: 2D/3D ultrasound diagnosis of pediatric distal radius fractures by human readers vs artificial intelligence. Sci. Rep. 2023; 13(1): 14535. PubMed Abstract | Publisher Full Text | Free Full Text

[35] 35. Zhang J, Boora N, Melendez S, et al.: Diagnostic accuracy of 3D ultrasound and artificial intelligence for detection of pediatric wrist injuries. Children. 2021; 8(6): 431. PubMed Abstract | Publisher Full Text | Free Full Text

[36] 36. Huang S, Yang J, Fong S, et al.: Artificial intelligence in cancer diagnosis and prognosis: Opportunities and challenges. Cancer Lett. 2020; 471: 61–71. PubMed Abstract | Publisher Full Text

[37] 37. Hunter B, Hindocha S, Lee RW: The role of artificial intelligence in early cancer diagnosis. Cancers. 2022; 14(6): 1524. PubMed Abstract | Publisher Full Text | Free Full Text

[38] 38. Guermazi A, Tannoury C, Kompel AJ, et al.: Improving radiographic fracture recognition performance and efficiency using artificial intelligence. Radiology. 2022; 302(3): 627–636. PubMed Abstract | Publisher Full Text

[39] 39. Kutbi M: Artificial intelligence-based applications for bone fracture detection using medical images: a systematic review. Diagnostics. 2024; 14(17): 1879. PubMed Abstract | Publisher Full Text | Free Full Text

[40] 40. Liu G-D, Li Y-C, Zhang W, et al.: A brief review of artificial intelligence applications and algorithms for psychiatric disorders. Engineering. 2020; 6(4): 462–467. Publisher Full Text

[41] 41. Raghavendra U, Acharya UR, Adeli H: Artificial intelligence techniques for automated diagnosis of neurological disorders. Eur. Neurol. 2020; 82(1-3): 41–64. Publisher Full Text

[42] 42. Mohsin SN, Gapizov A, Ekhator C, et al.: The role of artificial intelligence in prediction, risk stratification, and personalized treatment planning for congenital heart diseases. Cureus. 2023; 15(8): e44374. PubMed Abstract | Publisher Full Text | Free Full Text

[43] 43. Sherani AMK, Khan M, Qayyum MU, et al.: Synergizing AI and Healthcare: Pioneering Advances in Cancer Medicine for Personalized Treatment. International Journal of Multidisciplinary Sciences and Arts. 2024; 3(1): 270–277. Publisher Full Text

[44] 44. Mak K-K, Wong Y-H, Pichika MR: Artificial intelligence in drug discovery and development. Drug discovery and evaluation: safety and pharmacokinetic assays. 2024; 1461–1498. Publisher Full Text

[45] 45. Patel V, Shah M: Artificial intelligence and machine learning in drug discovery and development. Intelligent Medicine. 2022; 2(3): 134–140. Publisher Full Text

[46] 46. Bodenstedt S, Wagner M, Müller-Stich BP, et al.: Artificial intelligence-assisted surgery: potential and challenges. Visceral medicine. 2020; 36(6): 450–455. PubMed Abstract | Publisher Full Text | Free Full Text

[47] 47. Thakre D, Patel J: The Advancements and Benefits of AI-Assisted Robotic Surgery. 2024 2nd DMIHER International Conference on Artificial Intelligence in Healthcare, Education and Industry (IDICAIEI). IEEE; 2024.

[48] 48. Dixon D, Sattar H, Moros N, et al.: Unveiling the Influence of AI Predictive Analytics on Patient Outcomes: A Comprehensive Narrative Review. Cureus. 2024; 16(5): e59954. PubMed Abstract | Publisher Full Text | Free Full Text

[49] 49. Rana MS, Shuford J: AI in Healthcare: Transforming Patient Care through Predictive Analytics and Decision Support Systems. Journal of Artificial Intelligence General Science (JAIGS) ISSN. 2024; 1(1): 3006–4023. Publisher Full Text

[50] 50. Alosaimi H, et al.: Evaluating Artificial Intelligence for Accurate Detection of Hand and Wrist Fractures: A Systematic Review and Meta-analysis. F1000 Research Journal, Zenodo. 6 Aug. 2025. Publisher Full Text

Evaluating artificial intelligence for accurate detection of hand and wrist fractures: a systematic review and meta-analysis

Abstract

Background and Objectives

Materials and Methods

Results

Conclusions

Keywords

1. Introduction

2. Methods

2.1 PICO framework

2.2 Search strategy

2.3 Inclusion and exclusion process

2.4 Data extraction

2.5 Quality assessment

2.6 Data synthesis

3. Results

3.1 Study selection

Figure 1. PRISMA flow diagram showing the process of study selection.

3.2 Characteristics of included studies

Table 1. The main characteristics of the included studies.

3.3 Quality assessment of included studies

Table 2. Quality assessment of included studies using the MINORS tool.

3.4 Pooled analysis of sensitivity and specificity

Figure 2. Forest plot of sensitivity for AI models in detecting hand and wrist fractures.

Figure 3. Forest plot of specificity for AI models in detecting hand and wrist fractures.

Figure 4. Forest plot of specificity for AI models in detecting hand and wrist fractures.

3.5 Assessment of publication bias

Figure 5. Funnel plot assessing publication bias in included studies.

Table 3. Measures of variability in the studies.

Table 4. Egger’s Test Summary: OLS Regression Results.

4. Discussion

4.1 Summary of findings

4.2 Strengths and limitations

5. Conclusion

Ethical considerations

Reporting guidelines

Data availability

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated