ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

AIr - Artificial Intelligence Risk of bias tool (AIr)

[version 1; peer review: awaiting peer review]
PUBLISHED 03 Nov 2025
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS AWAITING PEER REVIEW

Abstract

Background

The use of artificial intelligence or machine learning in the development of prediction models is increasing exponentially but the present model is associated with a high degree of heterogeneity and associated bias. The present model is associated with a difficult learning curve and we aimed to develop a tool evaluating risk of bias in cardiology research which was succinct and effective.

Methods

Our tool (AIr) consists of 10 questions and can be utilised to assess the risk of bias in model development, external validation, and the combination of the two in machine-learned or artificial intelligence models.

Results

AIr was as effective as the current risk of bias tool, PROBAST, however, was significantly more succinct and had a greater inter-rater reliability than PROBAST.

Conclusion

We propose that our tool maintains validity regarding the assessment of the risk of bias in cardiology publications whilst increasing reliability when compared with PROBAST.

Keywords

artificial intelligence, cardiovascular

Background

What is machine learning?

Machine learning is a broad name that encompasses both ‘shallow learning’ and ‘deep learning’ methods. Machine learning started in 1957 with the development of Perceptron.1 Since then, the use of artificial intelligence through machine learning has exponentially grown with an estimated compound annual growth rate of approximately 40% until 2029.2

When machine learning techniques were initially developed, a feed-forward method was utilised whereby a set of predictors were analysed with the goal of identifying a singular outcome. However, as machine learning has developed, the complexity of the algorithms has too. A deep learning model utilises the principles of shallow learning but with tens if not hundreds of pathways that interact to generate a prediction model. These are known as neural networks. The aim of a deep learning model is for the first neural network layer to recognise the pattern of data presented. Thereafter, further convolutional layers interact to form a prediction of the outcome. There are variations in how these models work, with both feed-forward and feed-backwards methods to focus on complex issues3 whether in healthcare or the rest of society.4

Deep learning employs complex and extensive computer algorithms that have varying methods of development and validation. This was shown in a recent meta-analysis by Siontis et al. Given the heterogeneity of this data and the methods of validation and development, it highlights issues regarding the risk of bias in these studies.5

There is an increasing drive to utilise prognostic machine-learned models as a predictive mechanism for medical conditions. A prognostic model is based on numerous variables selected at a given time point with the aim is that these variables are able to detect the risk of developing a specific complication or condition.

What is the clinical significance of this?

As discussed, one of the many potential goals of a machine-learned prediction model might be to identify those at greatest risk of developing a medical condition or rather have an increased risk of a complication of treatment. Having the ability to focus resources on those with the greatest chance of developing a medical condition means that limited resources can be reallocated. Moreover, if we can accurately predict, who for example, will develop ventricular fibrillation and have an increased risk of cardiac arrest then this could improve the outcomes for patients and their families.

Following the significant rise in machine-learned models being developed, there are disparities in the quality of the research generated.6 Therefore, when systematic reviews and meta-analyses are generated there is often a high risk of bias in these studies with poor methodological quality.7 At present, there is only one wide-spread tool known as PROBAST8 which is used for the assessment of the risk of bias in prediction model studies. Although it has many benefits, PROBAST involves twenty seven questions and is time-consuming.

We aimed to generate a tool that had a specific focus on artificial intelligence and machine learning algorithms which was succinct and efficient whilst maintaining reliability. Our null hypothesis was that our tool was as effective as PROBAST in assessing the risk of bias in machine learning and artificial intelligence models in Cardiology. To validate our tool, we assessed it against cardiology papers that either developed or validated a machine-learned model. Cardiology, in particular, is generating a significant volume of machine learning, therefore this was the focus of our paper due to the ability to validate this.

Methods

Tool development

To commence, we reviewed and extracted data from multiple risk-of-bias tools. We assessed the benefits and limitations of each tool and designed a tool analysing the risk of bias with a specific focus on machine learning for prediction models. The tools analysed were as follows with their desired paper for analysis: JBI checklist (prognostic prevalence studies),9 QUADAS (diagnostic accuracy studies),10 ROBINS-E (observational studies of exposures),11 QUIPS (prognostic factors),12 ROBINS-I (non-randomised trials of interventions),13 Cochrane RoB 2.0 (randomised trials)14 and with a particular focus on PROBAST (prognostic tool for prediction model studies).8 These tools were chosen as they are well validated.

We utilised this data alongside external feedback from three experts in this field to refine our tool. We separated the tool into three domains:

  • 1. Domain One – Participants/Data Selection

  • 2. Domain Two – Predictors and Outcomes

  • 3. Domain Three – Analysis

Research regarding artificial intelligence/machine-learned models often has a section whereby the model is developed and then will sometimes be externally validated to further strengthen the design. At other times, research will seek to externally validate a previously produced model. We developed two different tools depending on the aims of the paper we are analysing. Tool number one is for the analysis of a study which analyses either the development or external validation of a model. The second tool is utilised for the combination of these which incorporates development with external validation. Each question was answered with the same signalling code as proposed by Cochrane: Yes (Y), Probably Yes (PY), Probably No (PN), No (N) and No Information (NI).15 Guidance was given for answering each question and the risk of bias was assessed at the end of each domain and overall. This was validated by pilot data before developing the final tool. See Figure 1.

5b71a495-f07f-4fa4-a846-8401e1b775c1_figure1.gif

Figure 1. Depiction of how to utilise the tool.

Validation of the tool

To validate our risk of bias tool, we utilised the pre-analysed ‘example papers’ shown on PROBAST’s website. We selected three papers, two of which solely focussed on development or validation and one which focussed on both.1618 We conducted a blinded assessment of the risk of bias using these papers and subsequently compared this to the already analysed papers with the PROBAST tool. Following a preliminary successful analysis, we compared our tool to PROBAST with recent machine-learned studies with a focus on Cardiology. This was then used to refine the tool further.

Search strategy

We used the search terms ‘artificial intelligence’, ‘machine learning’, ‘original research articles’ ‘2022, 2023’ in both JACC and Circulation on the 12th May 2023. This generated 58 papers. 24 papers were excluded based on the abstract. On reviewing the full-text papers, a further two papers were excluded due to not using a prognostic model. The included list of papers can be seen in Appendix 1. See Figure 2 for further information regarding paper selection.

5b71a495-f07f-4fa4-a846-8401e1b775c1_figure2.gif

Figure 2. Flow chart showing the selection of papers for our research.

Analysis of the studies

We conducted two blinded rounds of data extraction regarding the risk of bias, four authors in round one (MHS, SR, AA, KS) and two individuals in round two (GS, HG). Each paper was allocated to an individual and was analysed using PROBAST and our new tool (AIr). Throughout the process, we discussed any issues and adjusted the wording accordingly. All disputes were settled by an arbitrator (HG, SM, KM). Data was collected using Google Forms and analysed using R (version 4.3.2).

Results

We developed a tool in two formats both with three domains. The domains are Participants/Data Selection, Predictors and Outcomes, and Analysis. There are 10 questions, one in Domain One (Participants/Data Selection), three in Domain Two (Predictors and Outcomes) and six in Domain Three (Analysis). There is variation in questions five and nine depending on whether a model development or external validation study is being analysed. Similarly, question ten is solely for a model development study. Our questions are demonstrated in Table 1.

Table 1. Questions utilised in our tool.

Question Explanation to assist decision
Domain One Participants/Data SelectionQuestion OneWas data selection defined of an appropriate nature with appropriate inclusions and exclusions?Appropriate nature data selection would include but not be limited to RCTs, cohort or nested case-control study data. A higher ROB exists when data is extracted from already developed databases rather than longitudinal studies. This higher ROB can be reduced if authors allow for the potential bias in this data set. Furthermore, if the inclusion or exclusion criteria generate a cohort that is not representative of the target population then this increases ROB.
Y/PY – both appropriate data selection and inclusions/exclusions.
N/PN – If either data selection or inclusion/exclusion criteria are not appropriate.
NI – no information.
Domain One – Risk of biasLow risk of bias if:

  • - Y/PY to appropriate data selection and inclusion/exclusion criteria.

  • - If N/PN relevant to only one aspect of inappropriate data selection of inclusion/exclusion criteria, this could still be a low risk of bias but it should be justified.

High risk of bias – if N/PN without justification for a low risk of bias.
Unclear – if missing information that could generate a high risk of bias.
Domain Two Predictors and OutcomesQuestion TwoPrior to starting analysis, were predictors defined with a method of assessment?Predictors must be specified with their method of assessment prior to starting to run the analysis so that this doesn’t cause biased selection.
Y/PY – predictors were defined prior to the analysis commencing.
N/PN – predictors were not defined.
NI – no information.
Question ThreePrior to starting analysis, were outcomes and their assessment defined consistently and independent of predictors?Outcomes can be defined appropriately if they can be objectively measured. If selected after, this increases the risk of bias. If the definition of outcome is based on predictors then this will likely overestimate the correlation between these variables.
Y/PY – outcomes defined appropriately and prior to predictors.
N/PN – outcomes were not defined prior to predictions or were not appropriate definitions.
NI – no information.
Question FourAt the time point when authors propose that their prediction model is implemented, were appropriate predictors available and was the time period between the information gathered (by predictors) and deciding the outcome achievable?Are the predictors feasible to ascertain at the point that this prognostic model is intended to be used.
Y/PY – at time of intended use, these predictors are available and it would be possible to assess the outcome in line with the paper.
N/PN – these predictors are not available at intended time of use.
NI – no information.
Domain Two – Risk of BiasLow Risk if – all Y/PY or one N/PN with justification which reduces the bias.
Unclear Risk of bias – if there is no information in a question but that is unlikely to affect the results or if there is a N/PN which is partially justified.
High risk – if there are questions with multiple NI or the NI would risk altering the bias or if there is N/PN in any question without justification.
Domain Three AnalysisQuestion FiveWas the outcome represented adequately in the participant selection?It is essential for the outcome to be represented enough in the sample to ensure that the result is valid without a risk of insufficient power to achieve a significant result or overestimation.
Y/PY
For model development studies, events per variable (EPV) should be over 20. An EPV between 10-20 could be considered although there is limited evidence for this.
For model validation studies, participants with the outcome should be over 100.
N/PN – EPV under 10 or participants with outcome is under 100 respectively.
NI – no information.
Question SixWere valid methods used for identification of the outcome which would correctly classify the condition without oversimplification?Valid methods would include continuous predictors that remained as continuous variables without categorisation and categorical predictor groups were sorted according to a pre-defined group.
Y/PY – continuous predictors remained as continuous variables without categorisation and if categorical predictor groups are using a pre-defined group.
N/PN – no specified categorical groups. Continuous variables dichotomised.
NI – no information.
Question SevenWere all participants included in the analysis including how missing data was handled and information for those who dropped out?There is a risk of bias associated with missing data due to the possibility of the missing data being correlated with an outcome.
Y/PY – All participants are enrolled. No participants were removed from analysis due to missing data. If they were then there is an adequate explanation, and this is a small number that would be unlikely to alter the outcome of the analysis.
N/PN – participants were removed from analysis due to missing data without explanation/justification. If a subgroup were excluded from analysis.
NI – no information.
Question EightWere both confounding factors and other complexities accounted for in the analysis?Examples of complexities in the analysis would include censoring, completing risks and sampling of control participants etc.
Y/PY – Complexities and confounding have been accounted for.
N/PN – Not been accounted for.
NI – no information.
Question NineHas model calibration and discrimination been assessed (with consideration for overfitting – in model development studies)?Both calibration and discrimination are assessed with a calibration plot, an adequate size of the data and are assessed by suitable statistical tests.
In model development studies, there should be assessment of internal validation to mitigate overfitting. Overfitting of a model suggests that predictions might not be applicable to new studies. To mitigate this, internal validation must be assessed, eg through cross-validation or bootstrapping.
Y/PY – calibration and discrimination have been assessed with consideration with an interval validation assessment in model development studies.
N/PN – this has not been considered.
NI – no information.
For Model Development Studies only
Question TenWas the final model produced clearly linked to a singular multivariable analysis which in turn is independent of univariable analyses?There is an increased risk of bias if multiple univariable analyses are computed prior to grouping the statistically significant variables together and forming a multivariable analysis.
Y/PY – if predictors were not based on univariable analysis. The final model directly links to the multivariable analysis.
N/PN – if predictors are based on univariable analysis prior to multivariable modelling.
NI – no information.
Domain Three – Risk of BiasLow Risk if: All Y/PY or if there is no information in a question but that is unlikely to affect the results or if there is a N/PN which is justified.
Unclear Risk of bias – missing information which is unlikely to affect results but we cannot be sure.
High risk - if there are questions with multiple NI or the NI would risk altering the bias or if there is Y/PY in any question without justification.
Overall Risk of BiasLow Risk if: Low in all domains, or unclear in one as long as justified.
Unclear Risk of bias – unclear in one domain but not justified or a high risk justified.
High risk: Any high risks not justified or multiple unclear risks.

As aforementioned some machine-learned models will either develop, externally validate or both develop and externally validate a model. In a joint analysis, Domains One and Three are duplicated with some variation, whereas Domain Two is combined given the utilisation of the same machine-learned model. Tool One is to be used if analysing the risk of bias of either a sole development or validation paper and Tool Two if these are combined. The risk of bias is assessed at the end of each domain and the tool. Please see Appendix 2 and 3 for the completed tools for utilisation in research/clinical work.

We utilised these questions alongside PROBAST’s tool to blindly assess the risk of bias in 32 cardiology papers in JACC and Circulation. We hypothesised that our tool would be as effective as PROBAST in assessing for the risk of bias. Our analysis demonstrates no difference between AIr or PROBAST in either Round 1 or Round 2 (X-squared = 0.234, df = 2, p-value = 0.890; X-squared = 0.541, df = 2, p-value = 0.763, respectively) and therefore was as effective at PROBAST in assessing the risk of bias. The distribution of each round and the arbitrated results can be seen in Figure 3 and 4.

5b71a495-f07f-4fa4-a846-8401e1b775c1_figure3.gif

Figure 3. Bar chart demonstrating the Round One and Round Two results for AIr and PROBAST.

(V1 and V2 stand for Round One and Round Two respectively. ROB is the risk of bias).

5b71a495-f07f-4fa4-a846-8401e1b775c1_figure4.gif

Figure 4. Demonstrating the risk of bias with the arbitrated outcomes.

(ROB is the risk of bias).

Analysis of inter-rater reliability via Weighted Cohens Kappa showed a ‘moderate agreement’ for AIr with a weight value of 0.47. In comparison, there was less inter-rater reliability with a ‘fair agreement’ between analyses for PROBAST with a Weighted Cohens Kappa of 0.35.

Discussion

At present, PROBAST is the sole tool for assessing the ROB in artificial intelligence/machine learned models. As aforementioned, AI models are frequently developed but often have a high risk of bias and therefore this needs to be accurately assessed.

Our results suggest that for a tool with significantly fewer questions, twenty seven versus ten, AIr was as effective as PROBAST in assessing the risk of bias whilst improving the inter-rater reliability of the results. This suggests that our tool is effective in assessing the risk of bias and can be applied to further artificial intelligence/machine-learned studies.

Limitations

This is an area in which there is only one widely used tool which makes it the only possible comparison. One potential limitation of our analysis is that we are comparing our tool to a different tool rather than directly assessing its validity. This could be an issue if there were errors within PROBAST as this would be more likely to bias our results. This was mitigated by the papers produced by PROBAST demonstrating its validity and therefore making this a suitable comparison for our tool.

The AIr tool has only ten questions which makes it shorter than the PROBAST tool but makes it easier to use. We only concentrated on papers in high impact Cardiology journals making it difficult to predict further about this tool and its usage in fields outside of cardiology and in papers outside of these journals.

Conclusion

AIr is a tool analysing the risk of bias in studies using artificial intelligence or machine learning to predict an outcome or a complication. This is as effective as the current leading tool and has a greater degree of inter-rater reliability. It is much easier to use than the most widely used tool, i.e. PROBAST. This is a promising means of measuring risk of bias in Artificial intelligence papers. It needs further validation outside of cardiology.

Extended data

Repository name: AIr - Artificial Intelligence Risk of bias tool appendices. https://doi.org/10.5281/zenodo.17130713.19

The project contains the following underlying data:

  • - Appendix 1 – Table of included studies

  • - Appendix 2 - AIr - Tool 1 – for studies that either develop or externally validate an artificial intelligence/machine learning model

  • - Appendix 3 - AIr - Tool 2 – for studies that develop AND externally validate an artificial intelligence/machine-learned model

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 03 Nov 2025
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Glatzel H, Bramer S, Malone K et al. AIr - Artificial Intelligence Risk of bias tool (AIr) [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:1198 (https://doi.org/10.12688/f1000research.169545.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status:
AWAITING PEER REVIEW
AWAITING PEER REVIEW
?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 03 Nov 2025
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.