AIr - Artificial Intelligence Risk of bias tool (AIr)

Hannah Glatzel; Solange Bramer; Katie Malone; Ranu Baral; Muhammad Hamza Shah; Sakshi Roy; Arjun Ahluwalia; Kirshun Suvendiran; Joanne Lac; Girishkumar Sivakumar; YuZhi Phuah; Lily Snell; Karish Thavabalan; Carmen Lucia Garcia Pérez; Reubeen Ahmad; Niraj S Kumar; Mahmood Ahmad

doi:10.12688/f1000research.169545.1

Home Browse AIr - Artificial Intelligence Risk of bias tool (AIr)

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

AIr - Artificial Intelligence Risk of bias tool (AIr)

[version 1; peer review: awaiting peer review]

Hannah Glatzel¹, Solange Bramer², Katie Malone³, [...] Ranu Baral⁴, Muhammad Hamza Shah⁵, Sakshi Roy⁵, Arjun Ahluwalia⁵, Kirshun Suvendiran⁶, Joanne Lac ⁷, Girishkumar Sivakumar⁶, YuZhi Phuah⁶, Lily Snell⁶, Karish Thavabalan⁶, Carmen Lucia Garcia Pérez⁶, Reubeen Ahmad⁸, Niraj S Kumar^9,10, Mahmood Ahmad^11,12

Hannah Glatzel¹, Solange Bramer², [...] Katie Malone³, Ranu Baral⁴, Muhammad Hamza Shah⁵, Sakshi Roy⁵, Arjun Ahluwalia⁵, Kirshun Suvendiran⁶, Joanne Lac ⁷, Girishkumar Sivakumar⁶, YuZhi Phuah⁶, Lily Snell⁶, Karish Thavabalan⁶, Carmen Lucia Garcia Pérez⁶, Reubeen Ahmad⁸, Niraj S Kumar^9,10, Mahmood Ahmad^11,12

PUBLISHED 03 Nov 2025

Author details Author details

¹ Great Western Hospitals NHS Foundation Trust, Swindon, England, UK
² Barnet Hospital, London, England, UK
³ Whipps Cross University Hospital, London, England, UK
⁴ King's College Hospital NHS Foundation Trust, London, England, UK
⁵ Queen's University Faculty of Health Sciences, Belfast, Belfast, UK
⁶ University College London Medical School, London, England, UK
⁷ University College London Hospitals NHS Foundation Trust, London, England, UK
⁸ Brighton and Sussex Medical School, Brighton, England, UK
⁹ National Medical Research Association, London, London, UK
¹⁰ University of Leicester Department of Cardiovascular Sciences, Leicester, England, UK
¹¹ Royal Free London NHS Foundation Trust, London, England, UK
¹² Tahira Institute of Medical Science, Gorakhpur, Uttar Pradesh, India

Hannah Glatzel
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Solange Bramer
Roles: Data Curation, Investigation, Methodology, Validation

Katie Malone
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

Ranu Baral
Roles: Writing – Review & Editing

Muhammad Hamza Shah
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

Sakshi Roy
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

Arjun Ahluwalia
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

Kirshun Suvendiran
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

Joanne Lac
Roles: Supervision, Validation, Writing – Review & Editing

Girishkumar Sivakumar
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

YuZhi Phuah
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

Lily Snell
Roles: Validation, Writing – Review & Editing

Karish Thavabalan
Roles: Validation, Writing – Review & Editing

Carmen Lucia Garcia Pérez
Roles: Validation, Writing – Review & Editing

Reubeen Ahmad
Roles: Validation, Writing – Review & Editing

Niraj S Kumar
Roles: Validation, Writing – Review & Editing

Mahmood Ahmad
Roles: Conceptualization, Methodology, Resources, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS AWAITING PEER REVIEW

This article is included in the Artificial Intelligence and Machine Learning gateway.

This article is included in the University College London collection.

This article is included in the AI in Medicine and Healthcare collection.

Abstract

Background

The use of artificial intelligence or machine learning in the development of prediction models is increasing exponentially but the present model is associated with a high degree of heterogeneity and associated bias. The present model is associated with a difficult learning curve and we aimed to develop a tool evaluating risk of bias in cardiology research which was succinct and effective.

Methods

Our tool (AIr) consists of 10 questions and can be utilised to assess the risk of bias in model development, external validation, and the combination of the two in machine-learned or artificial intelligence models.

Results

AIr was as effective as the current risk of bias tool, PROBAST, however, was significantly more succinct and had a greater inter-rater reliability than PROBAST.

Conclusion

We propose that our tool maintains validity regarding the assessment of the risk of bias in cardiology publications whilst increasing reliability when compared with PROBAST.

Keywords

artificial intelligence, cardiovascular

Corresponding author: Joanne Lac

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2025 Glatzel H et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Glatzel H, Bramer S, Malone K et al. AIr - Artificial Intelligence Risk of bias tool (AIr) [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:1198 (https://doi.org/10.12688/f1000research.169545.1) First published: 03 Nov 2025, 14:1198 (https://doi.org/10.12688/f1000research.169545.1) Latest published: 03 Nov 2025, 14:1198 (https://doi.org/10.12688/f1000research.169545.1)

Background

What is machine learning?

Machine learning is a broad name that encompasses both ‘shallow learning’ and ‘deep learning’ methods. Machine learning started in 1957 with the development of Perceptron.¹ Since then, the use of artificial intelligence through machine learning has exponentially grown with an estimated compound annual growth rate of approximately 40% until 2029.²

When machine learning techniques were initially developed, a feed-forward method was utilised whereby a set of predictors were analysed with the goal of identifying a singular outcome. However, as machine learning has developed, the complexity of the algorithms has too. A deep learning model utilises the principles of shallow learning but with tens if not hundreds of pathways that interact to generate a prediction model. These are known as neural networks. The aim of a deep learning model is for the first neural network layer to recognise the pattern of data presented. Thereafter, further convolutional layers interact to form a prediction of the outcome. There are variations in how these models work, with both feed-forward and feed-backwards methods to focus on complex issues³ whether in healthcare or the rest of society.⁴

Deep learning employs complex and extensive computer algorithms that have varying methods of development and validation. This was shown in a recent meta-analysis by Siontis et al. Given the heterogeneity of this data and the methods of validation and development, it highlights issues regarding the risk of bias in these studies.⁵

There is an increasing drive to utilise prognostic machine-learned models as a predictive mechanism for medical conditions. A prognostic model is based on numerous variables selected at a given time point with the aim is that these variables are able to detect the risk of developing a specific complication or condition.

What is the clinical significance of this?

As discussed, one of the many potential goals of a machine-learned prediction model might be to identify those at greatest risk of developing a medical condition or rather have an increased risk of a complication of treatment. Having the ability to focus resources on those with the greatest chance of developing a medical condition means that limited resources can be reallocated. Moreover, if we can accurately predict, who for example, will develop ventricular fibrillation and have an increased risk of cardiac arrest then this could improve the outcomes for patients and their families.

Following the significant rise in machine-learned models being developed, there are disparities in the quality of the research generated.⁶ Therefore, when systematic reviews and meta-analyses are generated there is often a high risk of bias in these studies with poor methodological quality.⁷ At present, there is only one wide-spread tool known as PROBAST⁸ which is used for the assessment of the risk of bias in prediction model studies. Although it has many benefits, PROBAST involves twenty seven questions and is time-consuming.

We aimed to generate a tool that had a specific focus on artificial intelligence and machine learning algorithms which was succinct and efficient whilst maintaining reliability. Our null hypothesis was that our tool was as effective as PROBAST in assessing the risk of bias in machine learning and artificial intelligence models in Cardiology. To validate our tool, we assessed it against cardiology papers that either developed or validated a machine-learned model. Cardiology, in particular, is generating a significant volume of machine learning, therefore this was the focus of our paper due to the ability to validate this.

Methods

Tool development

To commence, we reviewed and extracted data from multiple risk-of-bias tools. We assessed the benefits and limitations of each tool and designed a tool analysing the risk of bias with a specific focus on machine learning for prediction models. The tools analysed were as follows with their desired paper for analysis: JBI checklist (prognostic prevalence studies),⁹ QUADAS (diagnostic accuracy studies),¹⁰ ROBINS-E (observational studies of exposures),¹¹ QUIPS (prognostic factors),¹² ROBINS-I (non-randomised trials of interventions),¹³ Cochrane RoB 2.0 (randomised trials)¹⁴ and with a particular focus on PROBAST (prognostic tool for prediction model studies).⁸ These tools were chosen as they are well validated.

We utilised this data alongside external feedback from three experts in this field to refine our tool. We separated the tool into three domains:

1. Domain One – Participants/Data Selection
2. Domain Two – Predictors and Outcomes
3. Domain Three – Analysis

Research regarding artificial intelligence/machine-learned models often has a section whereby the model is developed and then will sometimes be externally validated to further strengthen the design. At other times, research will seek to externally validate a previously produced model. We developed two different tools depending on the aims of the paper we are analysing. Tool number one is for the analysis of a study which analyses either the development or external validation of a model. The second tool is utilised for the combination of these which incorporates development with external validation. Each question was answered with the same signalling code as proposed by Cochrane: Yes (Y), Probably Yes (PY), Probably No (PN), No (N) and No Information (NI).¹⁵ Guidance was given for answering each question and the risk of bias was assessed at the end of each domain and overall. This was validated by pilot data before developing the final tool. See Figure 1.

Figure 1. Depiction of how to utilise the tool.

Validation of the tool

To validate our risk of bias tool, we utilised the pre-analysed ‘example papers’ shown on PROBAST’s website. We selected three papers, two of which solely focussed on development or validation and one which focussed on both.^16–18 We conducted a blinded assessment of the risk of bias using these papers and subsequently compared this to the already analysed papers with the PROBAST tool. Following a preliminary successful analysis, we compared our tool to PROBAST with recent machine-learned studies with a focus on Cardiology. This was then used to refine the tool further.

Search strategy

We used the search terms ‘artificial intelligence’, ‘machine learning’, ‘original research articles’ ‘2022, 2023’ in both JACC and Circulation on the 12^th May 2023. This generated 58 papers. 24 papers were excluded based on the abstract. On reviewing the full-text papers, a further two papers were excluded due to not using a prognostic model. The included list of papers can be seen in Appendix 1. See Figure 2 for further information regarding paper selection.

Figure 2. Flow chart showing the selection of papers for our research.

Analysis of the studies

We conducted two blinded rounds of data extraction regarding the risk of bias, four authors in round one (MHS, SR, AA, KS) and two individuals in round two (GS, HG). Each paper was allocated to an individual and was analysed using PROBAST and our new tool (AIr). Throughout the process, we discussed any issues and adjusted the wording accordingly. All disputes were settled by an arbitrator (HG, SM, KM). Data was collected using Google Forms and analysed using R (version 4.3.2).

Results

We developed a tool in two formats both with three domains. The domains are Participants/Data Selection, Predictors and Outcomes, and Analysis. There are 10 questions, one in Domain One (Participants/Data Selection), three in Domain Two (Predictors and Outcomes) and six in Domain Three (Analysis). There is variation in questions five and nine depending on whether a model development or external validation study is being analysed. Similarly, question ten is solely for a model development study. Our questions are demonstrated in Table 1.

Table 1. Questions utilised in our tool.

		Question	Explanation to assist decision
Domain One Participants/Data Selection	Question One	Was data selection defined of an appropriate nature with appropriate inclusions and exclusions?	Appropriate nature data selection would include but not be limited to RCTs, cohort or nested case-control study data. A higher ROB exists when data is extracted from already developed databases rather than longitudinal studies. This higher ROB can be reduced if authors allow for the potential bias in this data set. Furthermore, if the inclusion or exclusion criteria generate a cohort that is not representative of the target population then this increases ROB. Y/PY – both appropriate data selection and inclusions/exclusions. N/PN – If either data selection or inclusion/exclusion criteria are not appropriate. NI – no information.
Domain One – Risk of bias		Low risk of bias if: - Y/PY to appropriate data selection and inclusion/exclusion criteria. - If N/PN relevant to only one aspect of inappropriate data selection of inclusion/exclusion criteria, this could still be a low risk of bias but it should be justified. High risk of bias – if N/PN without justification for a low risk of bias. Unclear – if missing information that could generate a high risk of bias.
Domain Two Predictors and Outcomes	Question Two	Prior to starting analysis, were predictors defined with a method of assessment?	Predictors must be specified with their method of assessment prior to starting to run the analysis so that this doesn’t cause biased selection. Y/PY – predictors were defined prior to the analysis commencing. N/PN – predictors were not defined. NI – no information.
	Question Three	Prior to starting analysis, were outcomes and their assessment defined consistently and independent of predictors?	Outcomes can be defined appropriately if they can be objectively measured. If selected after, this increases the risk of bias. If the definition of outcome is based on predictors then this will likely overestimate the correlation between these variables. Y/PY – outcomes defined appropriately and prior to predictors. N/PN – outcomes were not defined prior to predictions or were not appropriate definitions. NI – no information.
	Question Four	At the time point when authors propose that their prediction model is implemented, were appropriate predictors available and was the time period between the information gathered (by predictors) and deciding the outcome achievable?	Are the predictors feasible to ascertain at the point that this prognostic model is intended to be used. Y/PY – at time of intended use, these predictors are available and it would be possible to assess the outcome in line with the paper. N/PN – these predictors are not available at intended time of use. NI – no information.
Domain Two – Risk of Bias		Low Risk if – all Y/PY or one N/PN with justification which reduces the bias. Unclear Risk of bias – if there is no information in a question but that is unlikely to affect the results or if there is a N/PN which is partially justified. High risk – if there are questions with multiple NI or the NI would risk altering the bias or if there is N/PN in any question without justification.
Domain Three Analysis	Question Five	Was the outcome represented adequately in the participant selection?	It is essential for the outcome to be represented enough in the sample to ensure that the result is valid without a risk of insufficient power to achieve a significant result or overestimation. Y/PY For model development studies, events per variable (EPV) should be over 20. An EPV between 10-20 could be considered although there is limited evidence for this. For model validation studies, participants with the outcome should be over 100. N/PN – EPV under 10 or participants with outcome is under 100 respectively. NI – no information.
	Question Six	Were valid methods used for identification of the outcome which would correctly classify the condition without oversimplification?	Valid methods would include continuous predictors that remained as continuous variables without categorisation and categorical predictor groups were sorted according to a pre-defined group. Y/PY – continuous predictors remained as continuous variables without categorisation and if categorical predictor groups are using a pre-defined group. N/PN – no specified categorical groups. Continuous variables dichotomised. NI – no information.
	Question Seven	Were all participants included in the analysis including how missing data was handled and information for those who dropped out?	There is a risk of bias associated with missing data due to the possibility of the missing data being correlated with an outcome. Y/PY – All participants are enrolled. No participants were removed from analysis due to missing data. If they were then there is an adequate explanation, and this is a small number that would be unlikely to alter the outcome of the analysis. N/PN – participants were removed from analysis due to missing data without explanation/justification. If a subgroup were excluded from analysis. NI – no information.
	Question Eight	Were both confounding factors and other complexities accounted for in the analysis?	Examples of complexities in the analysis would include censoring, completing risks and sampling of control participants etc. Y/PY – Complexities and confounding have been accounted for. N/PN – Not been accounted for. NI – no information.
	Question Nine	Has model calibration and discrimination been assessed (with consideration for overfitting – in model development studies)?	Both calibration and discrimination are assessed with a calibration plot, an adequate size of the data and are assessed by suitable statistical tests. In model development studies, there should be assessment of internal validation to mitigate overfitting. Overfitting of a model suggests that predictions might not be applicable to new studies. To mitigate this, internal validation must be assessed, eg through cross-validation or bootstrapping. Y/PY – calibration and discrimination have been assessed with consideration with an interval validation assessment in model development studies. N/PN – this has not been considered. NI – no information.
	For Model Development Studies only
	Question Ten	Was the final model produced clearly linked to a singular multivariable analysis which in turn is independent of univariable analyses?	There is an increased risk of bias if multiple univariable analyses are computed prior to grouping the statistically significant variables together and forming a multivariable analysis. Y/PY – if predictors were not based on univariable analysis. The final model directly links to the multivariable analysis. N/PN – if predictors are based on univariable analysis prior to multivariable modelling. NI – no information.
Domain Three – Risk of Bias		Low Risk if: All Y/PY or if there is no information in a question but that is unlikely to affect the results or if there is a N/PN which is justified. Unclear Risk of bias – missing information which is unlikely to affect results but we cannot be sure. High risk - if there are questions with multiple NI or the NI would risk altering the bias or if there is Y/PY in any question without justification.
Overall Risk of Bias		Low Risk if: Low in all domains, or unclear in one as long as justified. Unclear Risk of bias – unclear in one domain but not justified or a high risk justified. High risk: Any high risks not justified or multiple unclear risks.

As aforementioned some machine-learned models will either develop, externally validate or both develop and externally validate a model. In a joint analysis, Domains One and Three are duplicated with some variation, whereas Domain Two is combined given the utilisation of the same machine-learned model. Tool One is to be used if analysing the risk of bias of either a sole development or validation paper and Tool Two if these are combined. The risk of bias is assessed at the end of each domain and the tool. Please see Appendix 2 and 3 for the completed tools for utilisation in research/clinical work.

We utilised these questions alongside PROBAST’s tool to blindly assess the risk of bias in 32 cardiology papers in JACC and Circulation. We hypothesised that our tool would be as effective as PROBAST in assessing for the risk of bias. Our analysis demonstrates no difference between AIr or PROBAST in either Round 1 or Round 2 (X-squared = 0.234, df = 2, p-value = 0.890; X-squared = 0.541, df = 2, p-value = 0.763, respectively) and therefore was as effective at PROBAST in assessing the risk of bias. The distribution of each round and the arbitrated results can be seen in Figure 3 and 4.

Figure 3. Bar chart demonstrating the Round One and Round Two results for AIr and PROBAST.

(V1 and V2 stand for Round One and Round Two respectively. ROB is the risk of bias).

Figure 4. Demonstrating the risk of bias with the arbitrated outcomes.

(ROB is the risk of bias).

Analysis of inter-rater reliability via Weighted Cohens Kappa showed a ‘moderate agreement’ for AIr with a weight value of 0.47. In comparison, there was less inter-rater reliability with a ‘fair agreement’ between analyses for PROBAST with a Weighted Cohens Kappa of 0.35.

Discussion

At present, PROBAST is the sole tool for assessing the ROB in artificial intelligence/machine learned models. As aforementioned, AI models are frequently developed but often have a high risk of bias and therefore this needs to be accurately assessed.

Our results suggest that for a tool with significantly fewer questions, twenty seven versus ten, AIr was as effective as PROBAST in assessing the risk of bias whilst improving the inter-rater reliability of the results. This suggests that our tool is effective in assessing the risk of bias and can be applied to further artificial intelligence/machine-learned studies.

Limitations

This is an area in which there is only one widely used tool which makes it the only possible comparison. One potential limitation of our analysis is that we are comparing our tool to a different tool rather than directly assessing its validity. This could be an issue if there were errors within PROBAST as this would be more likely to bias our results. This was mitigated by the papers produced by PROBAST demonstrating its validity and therefore making this a suitable comparison for our tool.

The AIr tool has only ten questions which makes it shorter than the PROBAST tool but makes it easier to use. We only concentrated on papers in high impact Cardiology journals making it difficult to predict further about this tool and its usage in fields outside of cardiology and in papers outside of these journals.

Conclusion

AIr is a tool analysing the risk of bias in studies using artificial intelligence or machine learning to predict an outcome or a complication. This is as effective as the current leading tool and has a greater degree of inter-rater reliability. It is much easier to use than the most widely used tool, i.e. PROBAST. This is a promising means of measuring risk of bias in Artificial intelligence papers. It needs further validation outside of cardiology.

Extended data

Repository name: AIr - Artificial Intelligence Risk of bias tool appendices. https://doi.org/10.5281/zenodo.17130713.¹⁹

The project contains the following underlying data:

- Appendix 1 – Table of included studies
- Appendix 2 - AIr - Tool 1 – for studies that either develop or externally validate an artificial intelligence/machine learning model
- Appendix 3 - AIr - Tool 2 – for studies that develop AND externally validate an artificial intelligence/machine-learned model

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

References

1. rosenblatt-1957.pdf: [cited 2023 Jul 25]. Reference Source
2. LIMITED MMRP: GlobeNewswire News Room. Artificial Intelligence (AI) Healthcare Market to hit USD 97.23Bn at a CAGR of 40.51 percent by 2029.2023 [cited 2023 Jul 25]. Reference Source
3. Meir Y, Tevet O, Tzach Y, et al.: Efficient shallow learning as an alternative to deep learning. Sci. Rep. 2023 Apr 20; 13(1): 5423. PubMed Abstract | Publisher Full Text | Free Full Text
4. Xu Y, Zhou Y, Sekula P, et al.: Machine learning in construction: From shallow to deep learning. Developments in the Built Environment. 2021 May 1; 6: 100045.
5. Siontis GCM, Sweda R, Noseworthy PA, et al.: Development and validation pathways of artificial intelligence tools evaluated in randomised clinical trials. BMJ Health Care Inform. 2021 Dec; 28(1): e100466. PubMed Abstract | Publisher Full Text | Free Full Text
6. Moons KGM, Royston P, Vergouwe Y, et al.: Prognosis and prognostic research: what, why, and how? BMJ. 2009 Feb 23; 338: b375. Publisher Full Text
7. Navarro CLA, Damen JAA, Takada T, et al.: Risk of bias in studies on prediction models developed using supervised machine learning techniques: systematic review. BMJ. 2021 Oct 20; 375: n2281. Publisher Full Text
8. Moons KGM, Wolff RF, Riley RD, et al.: PROBAST: A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies: Explanation and Elaboration. Ann. Intern. Med. 2019 Jan 1; 170(1): W1–W33. PubMed Abstract | Publisher Full Text
9. Munn Z, Moola S, Lisy K, et al.: Chapter 5: Systematic reviews of prevalence and incidence. JBI Manual for Evidence Synthesis. 2020.
10. Whiting P, Rutjes AWS, Reitsma JB, et al.: The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews. BMC Med. Res. Methodol. 2003 Nov 10; 3: 25. PubMed Abstract | Publisher Full Text | Free Full Text
11. Higgins J, Morgan R, Rooney A, et al.: Risk Of Bias In Non-randomized Studies - of Exposure (ROBINS-E).2023 Jun 20. Reference Source
12. Group C: The Cochrane Collaboration Prognosis Methods Group, Review Tools.
13. Sterne JA, Hernán MA, Reeves BC, et al.: ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ. 2016 Oct 12; 355: i4919. Publisher Full Text
14. Sterne JAC, Savović J, Page MJ, et al.: RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. 2019 Aug 28; 366: l4898. Publisher Full Text
15. Higgins J, Savović J, Page M, et al.: Chapter 8: Assessing risk of bias in a randomized trial.2022 [cited 2022 Apr 12]. Reference Source
16. Oudega R, Hoes AW, Moons KGM: The Wells rule does not adequately rule out deep venous thrombosis in primary care patients. Ann. Intern. Med. 2005 Jul 19; 143(2): 100–107. PubMed Abstract | Publisher Full Text
17. Perel P, Prieto-Merino D, Shakur H, et al.: Predicting early death in patients with traumatic bleeding: development and validation of prognostic model. BMJ. 2012 Aug 15; 345: e5166. PubMed Abstract | Publisher Full Text | Free Full Text
18. Aslibekyan S, Campos H, Loucks EB, et al.: Development of a cardiovascular risk score for use in low- and middle-income countries. J. Nutr. 2011 Jul; 141(7): 1375–1380. PubMed Abstract | Publisher Full Text | Free Full Text
19. Glatzel H: AIr - Artificial intelligence risk of bias tool appendices. Zenodo. 2025. Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 03 Nov 2025

Author details Author details

¹ Great Western Hospitals NHS Foundation Trust, Swindon, England, UK
² Barnet Hospital, London, England, UK
³ Whipps Cross University Hospital, London, England, UK
⁴ King's College Hospital NHS Foundation Trust, London, England, UK
⁵ Queen's University Faculty of Health Sciences, Belfast, Belfast, UK
⁶ University College London Medical School, London, England, UK
⁷ University College London Hospitals NHS Foundation Trust, London, England, UK
⁸ Brighton and Sussex Medical School, Brighton, England, UK
⁹ National Medical Research Association, London, London, UK
¹⁰ University of Leicester Department of Cardiovascular Sciences, Leicester, England, UK
¹¹ Royal Free London NHS Foundation Trust, London, England, UK
¹² Tahira Institute of Medical Science, Gorakhpur, Uttar Pradesh, India

Hannah Glatzel
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Solange Bramer
Roles: Data Curation, Investigation, Methodology, Validation

Katie Malone
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

Ranu Baral
Roles: Writing – Review & Editing

Muhammad Hamza Shah
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

Sakshi Roy
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

Arjun Ahluwalia
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

Kirshun Suvendiran
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

Joanne Lac
Roles: Supervision, Validation, Writing – Review & Editing

Girishkumar Sivakumar
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

YuZhi Phuah
Roles: Data Curation, Investigation, Methodology, Validation, Writing – Review & Editing

Lily Snell
Roles: Validation, Writing – Review & Editing

Karish Thavabalan
Roles: Validation, Writing – Review & Editing

Carmen Lucia Garcia Pérez
Roles: Validation, Writing – Review & Editing

Reubeen Ahmad
Roles: Validation, Writing – Review & Editing

Niraj S Kumar
Roles: Validation, Writing – Review & Editing

Mahmood Ahmad
Roles: Conceptualization, Methodology, Resources, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (1)

version 1

Published: 03 Nov 2025, 14:1198

https://doi.org/10.12688/f1000research.169545.1

Copyright

© 2025 Glatzel H et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Glatzel H, Bramer S, Malone K et al. AIr - Artificial Intelligence Risk of bias tool (AIr) [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:1198 (https://doi.org/10.12688/f1000research.169545.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 03 Nov 2025

Open Peer Review

Reviewer Status

AWAITING PEER REVIEW

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

[1] 1. rosenblatt-1957.pdf: [cited 2023 Jul 25]. Reference Source

[2] 2. LIMITED MMRP: GlobeNewswire News Room. Artificial Intelligence (AI) Healthcare Market to hit USD 97.23Bn at a CAGR of 40.51 percent by 2029.2023 [cited 2023 Jul 25]. Reference Source

[3] 3. Meir Y, Tevet O, Tzach Y, et al.: Efficient shallow learning as an alternative to deep learning. Sci. Rep. 2023 Apr 20; 13(1): 5423. PubMed Abstract | Publisher Full Text | Free Full Text

[4] 4. Xu Y, Zhou Y, Sekula P, et al.: Machine learning in construction: From shallow to deep learning. Developments in the Built Environment. 2021 May 1; 6: 100045.

[5] 5. Siontis GCM, Sweda R, Noseworthy PA, et al.: Development and validation pathways of artificial intelligence tools evaluated in randomised clinical trials. BMJ Health Care Inform. 2021 Dec; 28(1): e100466. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Moons KGM, Royston P, Vergouwe Y, et al.: Prognosis and prognostic research: what, why, and how? BMJ. 2009 Feb 23; 338: b375. Publisher Full Text

[7] 7. Navarro CLA, Damen JAA, Takada T, et al.: Risk of bias in studies on prediction models developed using supervised machine learning techniques: systematic review. BMJ. 2021 Oct 20; 375: n2281. Publisher Full Text

[8] 8. Moons KGM, Wolff RF, Riley RD, et al.: PROBAST: A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies: Explanation and Elaboration. Ann. Intern. Med. 2019 Jan 1; 170(1): W1–W33. PubMed Abstract | Publisher Full Text

[9] 9. Munn Z, Moola S, Lisy K, et al.: Chapter 5: Systematic reviews of prevalence and incidence. JBI Manual for Evidence Synthesis. 2020.

[10] 10. Whiting P, Rutjes AWS, Reitsma JB, et al.: The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews. BMC Med. Res. Methodol. 2003 Nov 10; 3: 25. PubMed Abstract | Publisher Full Text | Free Full Text

[11] 11. Higgins J, Morgan R, Rooney A, et al.: Risk Of Bias In Non-randomized Studies - of Exposure (ROBINS-E).2023 Jun 20. Reference Source

[12] 12. Group C: The Cochrane Collaboration Prognosis Methods Group, Review Tools.

[13] 13. Sterne JA, Hernán MA, Reeves BC, et al.: ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ. 2016 Oct 12; 355: i4919. Publisher Full Text

[14] 14. Sterne JAC, Savović J, Page MJ, et al.: RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. 2019 Aug 28; 366: l4898. Publisher Full Text

[15] 15. Higgins J, Savović J, Page M, et al.: Chapter 8: Assessing risk of bias in a randomized trial.2022 [cited 2022 Apr 12]. Reference Source

[16] 16. Oudega R, Hoes AW, Moons KGM: The Wells rule does not adequately rule out deep venous thrombosis in primary care patients. Ann. Intern. Med. 2005 Jul 19; 143(2): 100–107. PubMed Abstract | Publisher Full Text

[17] 17. Perel P, Prieto-Merino D, Shakur H, et al.: Predicting early death in patients with traumatic bleeding: development and validation of prognostic model. BMJ. 2012 Aug 15; 345: e5166. PubMed Abstract | Publisher Full Text | Free Full Text

[18] 18. Aslibekyan S, Campos H, Loucks EB, et al.: Development of a cardiovascular risk score for use in low- and middle-income countries. J. Nutr. 2011 Jul; 141(7): 1375–1380. PubMed Abstract | Publisher Full Text | Free Full Text

[19] 19. Glatzel H: AIr - Artificial intelligence risk of bias tool appendices. Zenodo. 2025. Publisher Full Text

AIr - Artificial Intelligence Risk of bias tool (AIr)

Abstract

Background

Methods

Results

Conclusion

Keywords

Background

What is machine learning?

What is the clinical significance of this?

Methods

Tool development

Figure 1. Depiction of how to utilise the tool.

Validation of the tool

Search strategy

Figure 2. Flow chart showing the selection of papers for our research.

Analysis of the studies

Results

Table 1. Questions utilised in our tool.

Figure 3. Bar chart demonstrating the Round One and Round Two results for AIr and PROBAST.

Figure 4. Demonstrating the risk of bias with the arbitrated outcomes.

Discussion

Limitations

Conclusion

Extended data

References

Further Reading

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated