ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Opinion Article

Perspectives and guidance for developing artificial intelligence-based applications for healthcare using medical images

[version 1; peer review: 2 not approved]
PUBLISHED 23 Aug 2024
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the AI in Medicine and Healthcare collection.

Abstract

Artificial intelligence (AI) has significant potential to transform healthcare and improve patient care. However, successful development and integration of AI models requires careful consideration of study designs and sample size calculations for development and validation of models, publishing standards, prototype development for translation and collaboration with stakeholders. As the field is relatively new and rapidly evolving there is a lack of guidance and agreement on best practices for most of these steps. We engaged stakeholders in the form of clinicians, researchers from academia and industry, and data scientists to discuss various aspects of the translational pipeline and identified the challenges researchers in the field face and potential solutions to them. In this viewpoint, we present the summary of our discussions as a brief guide on the process of developing AI-based applications for healthcare using medical images. We organized the entire process into six major themes (i.e., The gaps AI can fill in healthcare, Development of AI models for healthcare: practical and important things to consider, Good practices for validation of AI models for healthcare: study designs and sample size calculation, Points to consider when publishing AI models, Translation towards products, Challenges and potential solutions from a technical perspective) and presented important points as a rule of thumb. We conclude that successful integration of AI in healthcare requires a collaborative approach, rigorous validation, adherence to best practices as described and cited, and consideration of technical aspects.

Keywords

Medical imaging, Artificial intelligence, AI in healthcare, Guide , Introduction, Rules of thumb

Introduction

Artificial intelligence (AI) in healthcare is an interdisciplinary field that requires expertise across several disciplines including clinical medicine, engineering, computer science, and statistics.1,2 As this field evolves, workshops and discussions can serve an important purpose to educate early career researchers and all those new to AI. We, an interdisciplinary and international research collaborative team (The University of Oxford, UK and the Translational Health Science and Technology Institute (THSTI), India) organized three workshops on AI in healthcare attended by clinicians, physician-scientists, biologists, computer vision scientists, engineering and medical students. We summarize here the key messages from the thought-provoking discussions held in the six apriori identified themes of the three workshops (Figure 1). We hope that this article serves both as an introductory guide and as a compendium of rules of thumb checklist for researchers (publicly available on our YouTube channel).

a7c7feef-fd54-417d-9d48-8098112e7a2c_figure1.gif

Figure 1. Different steps in the translational pipeline discussed in this viewpoint.

The first workshop focused on introducing this interdisciplinary field to the participants, with discussions on specific use cases of AI in clinical practice. This was followed by the identification of problem areas (research questions) in maternal and child health where AI may be used to find sustainable solutions. In the second workshop, we discussed in detail the process of development, validation, and reporting strategies for AI models. In the third and final workshop, we discussed challenges of integrating AI-enabled solutions into public health, and potential solutions and strategies for addressing bias in data, generalizability, and data sharing. We also explored the difficult task of collecting and managing multidimensional clinical data and how to address the associated challenges.

Summary of the discussions

Theme 1- The gaps AI can fill in healthcare

Different classes of AI models support different types of healthcare tasks. Understanding the use case is key and can influence the study designs of development and validation. This ensures that appropriate evidence is generated to enable clinical application.

  • Assistive models help physicians automate non-trivial but repetitive tasks. One important example is of models which can automatically identify anatomy of interest in an ultrasound scan and measure relevant parameters such as fetal biometry.3,4 These models might reduce the time taken and enable consistency of quality for healthcare practitioners to complete their tasks.

  • Diagnostic models are intended to automatically diagnose a disease. Some examples are models that screen large numbers of chest X-rays for tuberculosis5 or retinal images for diabetic retinopathy.6 These facilitate preliminary risk stratification (screen) before a medical expert confirms the diagnosis.

  • Prognostic models are designed to predict a patient’s prognosis, e.g cancer recurrence,7 5-year survival, etc. It is prudent to note that unlike the previous two examples, these types of models predict future events and therefore a physician cannot identify and correct a model’s mistake.

Theme 2 - Development of AI models for healthcare: practical and important things to consider

At the outset of the research process, there is an element of context - the research question. We strongly recommend thinking about a few questions like “What are we endeavoring to predict?”, “How important is the problem at the level of application?” and “How would clinicians use the tool in their workflow?”. We opine that there should be a level of engagement from the beginning, right at the point of study design, with all stakeholders. This includes regulators, clinicians, patients, and the investigators who will conduct the study.

From the outset, we recommend planning for validation and appropriate metrics for evaluating the model. A few relevant examples would be discussions about ground truth, study design for validation and appropriate metrics (for example sensitivity or specificity) and to optimize for a given clinical context. There are some specific challenges in finding the benchmarks against which these AI models can be compared. The most intuitive way is to compare against a panel of clinical experts. However, it is advisable to keep some caveats in mind, while considering this mode of comparison. When clinicians make a diagnosis, they have more contextual or corroborating information about patients, such as their symptoms, investigations, and clinical history. In contrast, AI models are typically provided with information limited to images and therefore a direct comparison of models with clinicians maybe not appropriate. It is recommended that such contextual information is incorporated while developing the AI models. Quality assurance of the input data, manual annotations (ground truth) and the deidentification process are other pertinent factors. Questions like “How do you measure intra- and inter-annotator agreement?”, “What should be considered the gold standard?”, “How much error is tolerable?” must be discussed while designing the study. In addition, we suggest that both automated data collection pipelines and quality assurance checks should be in place before data collection starts. Automation in both quality assurance and annotation is essential as investigators cannot manually check all the data points, for large-scale studies involving several thousands of images or videos. All these considerations clearly emphasize the importance of an interdisciplinary team and constant engagement from the start with all multidisciplinary stakeholders.

Theme 3 - Good practices for validation of the AI models for healthcare: study designs and sample size calculation

It is well-known that the models perform optimistically when evaluated or tested in the population in which they were developed. The demand for external validation along with development studies has increased from the scientific community. A recent review of studies that developed AI models for diagnostic analysis of medical images found that only 6% of 516 studies had conducted external validation.8 Another study reported that there were a relatively small number of prospective studies on medical imaging. Randomized clinical trials conducted in the past have been found to be at high risk of bias and deviating from existing reporting standards.9,10 Hence, it is essential to carefully consider validation methods and appropriate practices to follow for external validation studies of AI models. In the literature there is sometimes confusion between internal and external validations. During internal validation, the model is evaluated for accuracy and robustness. In contrast, external validation measures the model’s generalizability and clinical effectiveness. We reiterate that testing the model on data collected along with the training data but kept separate from the training process is not external validation. True external validation is when the model is tested on data collected at a site (or setting) completely different from the site (or setting) at which data is collected and used for the development of the model. External validation should have a different setting, population, geographical location, or time and should depend on the context in which the model is meant to be used. Ideally, external validation would be conducted with data from more than one setting.

For models that predict a future outcome or event, the ideal study design for validation of AI models would be a cohort study. A cross-sectional study will be appropriate if the model is intended to diagnose a disease. We suggest that interventional study, like a randomized controlled trial should be considered only when clinical effectiveness of a model is being evaluated. Validation on retrospective data is simpler and quicker but needs to be done keeping in mind that investigators may not have had control over how the data was collected; these validations would have a high risk of bias. The aim of prediction model development studies is to estimate the coefficients robustly, but in external validation studies the focus shifts to estimation of model performance. Since the two steps focus on two different but important aspects, validation studies should not be an afterthought. Critical to good validation study design is sample size estimation. We recommend investigators to follow recently developed guidelines and best practices for sample size estimation for prediction model validation.11,12

While discrimination is the ability of the model to distinguish between the groups, calibration, on the other hand, measures the alignment between predicted probabilities and the actual frequency of events, ensuring that the model’s confidence scores are reliable. The model performance should be measured in terms of discrimination, calibration, and clinical utility on external validation data sets. Discrimination should be measured area under the curve of receiver operating curve (AUROC) for classification models, Calibration slope and intercept should be reported to assess model bias. Finally, to assess clinical utility, a decision curve analysis is recommended.13

Theme 4 - Points to consider when publishing AI models

This theme identifies key considerations while writing a manuscript for peer review and publication. A recent systematic review of the diagnostic accuracy of deep learning and medical imaging found that there is high heterogeneity between studies. This is due to varying methods, terminology, and outcome measures, indicating a need for developing reporting and methodological guidelines for significant issues in this field. Reporting guidelines are not new in the field of medical and epidemiological research (EQUATOR network) The domain of AI in medical imaging is multidisciplinary and mandates that reporting should be in a way that is easily comprehensible by the end users of these models. While a broad consensus towards such reporting is evolving, there are existing publishing guidelines and checklists such as the CLAIM14 and TRIPOD.15 The publication of the model is an important intermediary step between the development and validation of the model. Authors should consider sharing all essential information required for reproducibility and transparent implementation of the model by other researchers. Reproducibility is greatly facilitated by releasing the model and implementation code. We recognize that this is not always possible in healthcare AI, especially for stakeholders in industry and where data governance restrictions forbid release of data used to build a model. In such situations, wherever possible, release of a web-based implementation of a model, which would keep intellectual property secure would be important. We also recognize that grand challenges, open competitions with data and code, play an important role in model benchmarking. Although they typically capture only real-world scenarios in a limited way they can encourage “data engineering gaming” to incrementally beat prior work. Overall, researchers need to take the responsibility to report their work in a scientifically rigorous way with clear descriptions of the data set, eligibility criteria, the context in which the model is to be used, appropriate metrics, and an implementation plan to facilitate the next steps in the translation pipeline. Peer review needs to become more rigorous to ensure that AI model reporting standards are elevated.

Theme 5 - Translation towards products

The final step in translating any AI model is to produce a prototype that can be used by healthcare workers. Academic publication plays a crucial role in technology translation and how authors present the information is of paramount importance. Authors should maintain a balance between sharing information and retaining the details that help commercialization of their technology. We emphasize that authors should define the technology performance level clearly and report limitations so that the end-user can utilize it effectively. A major obstacle to adoption of AI is because benefits of AI are not clear to clinicians.16 It would be advisable that instead of simply handing the stakeholders any new AI-based technology, they should be trained on use of model in real life situations which will also allow them to appreciate the improvement the technology brings to their daily workflow. There is an urgent need for low-cost (or zero extra cost) AI solutions requiring minimal clinical expertise for healthcare workers. In summary we recommend four stages that any AI prototype should undergo before being accepted for clinical use: peer-reviewed publication of the model, external validation, regulatory approval, and recommendation by professional societies. A clear plan of the deployment scenario while developing the product will facilitate translation.

Theme 6 – Challenges and potential solutions from technical perspective

To date, the black-box nature of current AI models has led to slow adoption of this technology in medicine where the cost of a mistake is high. Explainable artificial intelligence (also called interpretable AI) has been suggested to generate trust among the health-care professionals, to bring transparency into AI decision-making, and mitigate biases. Currently, there are several methods to test the explainability of AI models. A heat map (also known as a saliency map) is a popular method that uses activations of convolutional layers to demonstrate the extent to which each region of the image contributed to the model’s decision. They are illustrative and are easy to understand but do not always correspond to human intuition for important in decision-making. Class activation maps (CAM)17 and its extension Grad-CAM18 are among the popular methods that are used to generate these explanations. Besides heat maps, locally interpretable model-agnostic explanations (LIME)19 and Shapley values (SHAP)20 seek to understand decisions at the individual level by altering the input example and identifying the alterations that contributed to the decision. In the case of image analysis, this is done by occluding parts of the image. While the above approaches aid in understanding the clinical perspective, other approaches such as feature visualization are used by machine learning engineers. Feature visualization involves producing synthetic inputs that activate specific parts of a machine learning model strongly. Each model decision can then be described as a combination of a series of features that were detected in the input. Nevertheless, all these methods are just approximations and do not include explanations in terms of medical findings. This in turn makes the interpretation of AI models rather subjective. This area of research is still in its nascent stage and more work is needed to develop truly explainable AI.21 In the absence of suitable explainability methods, we advocate for rigorous internal and external validation of AI models as a more direct means of achieving the goals.

The ability of the model to give accurate predictions on an external data set collected separately from the training data set is called generalizability. Despite a large amount of published work on AI applications in medicine, only a relatively small number of these models are implemented in clinical practice primarily due to the lack of generalizability of models, though this number is also growing.22,23 Site to site customization in image acquisition, different standards of implementation in clinical practice, differences in patient demographics across centers, genotypic and phenotypic characteristics of patients, and tools and methods used to process and develop medical data are just some of the factors that may affect generalizability. Another significant factor that affects generalizability is training the model on biased data. Some of these factors can be adjusted using recent methodological advances such as domain adaptation techniques. In domain adaptation, the goal is to adapt models to target data sets based on labeled samples from the source environment as well as a limited set of unlabeled samples from the target environment.

Probably the best solution is to collect training data from different centers and incorporate all the real-world variations in it. However, collecting a huge medical data set from multiple sites is time-consuming, expensive, and poses data sharing challenges. Federated learning,24 an emerging methodological advancement, may aid researchers in overcoming the data sharing challenge. This framework helps researchers to train AI models while retaining data at individual sites. To summarize, different models of the same architecture are trained separately at each site using data from the site, and then these partial models are combined to create a global model.

Conclusion

In this article, we present the discussions held during the three workshops as useful take home messages. The shared insights offer a foundational guide for researchers aiming to embark on their journey in this rapidly advancing and transformative field. In summary, to ensure successful integration of AI models in clinical practice, researchers must engage with all stakeholders, including clinicians, regulators, and patients, from the outset and use robust study designs for validation ‘The crucial steps are validating AI models in external data sets using adequate sample sizes for estimation of robust performance metrics. While publishing these models, transparency, reproducibility, and following reporting guidelines are strongly emphasized. Clinician training on AI technologies is vital for their understanding of the benefits these models bring in their clinical workflows for effective adoption. From a technical standpoint, explainability and generalizability are major challenges. Overcoming variations in data collection and ensuring models’ applicability across different settings are essential for generalizability.

Contributions

BKD, RT, NW, AK, AP, AN & SB have designed and coceptualised the scientific content of the workshops. BKD, RT, AN & SB have written the viewpoint.

Ethics and consent

Ethical approval and consent were not required.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 23 Aug 2024
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Desiraju BK, Thiruvengadam R, Wadhwa N et al. Perspectives and guidance for developing artificial intelligence-based applications for healthcare using medical images [version 1; peer review: 2 not approved]. F1000Research 2024, 13:954 (https://doi.org/10.12688/f1000research.152426.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 23 Aug 2024
Views
6
Cite
Reviewer Report 31 Dec 2024
Michał Strzelecki, Lodz University of Technology, Łódź, Poland 
Not Approved
VIEWS 6
The reviewed paper fits into the works discussing the direction of development of artificial intelligence in medicine, where its presence is becoming increasingly noticeable. The work was created as a summary of the discussion held during several workshops organized by ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Strzelecki M. Reviewer Report For: Perspectives and guidance for developing artificial intelligence-based applications for healthcare using medical images [version 1; peer review: 2 not approved]. F1000Research 2024, 13:954 (https://doi.org/10.5256/f1000research.167182.r339980)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
19
Cite
Reviewer Report 10 Sep 2024
Piotr Szczypinski, Lodz University of Technology, Łódź, Poland 
Not Approved
VIEWS 19
The language of the article is confusing, unclear and imprecise. The text uses buzzwords without explaining their meanings. Lack of in-depth presentation of the problems. Lack of discussion and critical approach to the presented problems. The text of the article ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Szczypinski P. Reviewer Report For: Perspectives and guidance for developing artificial intelligence-based applications for healthcare using medical images [version 1; peer review: 2 not approved]. F1000Research 2024, 13:954 (https://doi.org/10.5256/f1000research.167182.r320762)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 23 Aug 2024
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.