ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Evaluation of accuracy and potential harm of ChatGPT in medical nutrition therapy - a case-based approach

[version 1; peer review: 1 approved with reservations, 1 not approved]
PUBLISHED 22 Feb 2024
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

This article is included in the Datta Meghe Institute of Higher Education and Research collection.

Abstract

Background

ChatGPT is a conversational large language model (LLM) based on artificial intelligence (AI). LLMs may be applied in health care education, research, and practice if relevant valid concerns are proactively addressed. The current study aimed to investigate ChatGPT’s ability to generate accurate and comprehensive responses to nutritional queries created by nutritionists/dieticians.

Methods

An in-depth case study approach was used to accomplish the research objectives. Functional testing was performed, creating test cases based on the functional requirement of the software application. ChatGPT responses were evaluated and analyzed using various scenarios requiring medical nutritional therapy, which were created with varied complexity. Based on the accuracy of the generated data, which were evaluated by a registered nutritionist, a potential harm score for the responses from Chat GPT was used as evaluation.

Results

Eight case scenarios with varied complexity when evaluated revealed that, as the complexity of the scenario increased, it led to an increase in the risk potential. Although the accuracy of the generated response does not change much with the complexity of the case scenarios, the study suggests that ChatGPT should be avoided for generating responses for complex medical nutritional conditions or scenarios.

Conclusions

The need for an initiative that engages all stakeholders involved in healthcare education, research, and practice is urgently needed to set up guidelines for the responsible use of ChatGPT by healthcare educators, researchers, and practitioners. The findings of the study are useful for healthcare professionals and health technology regulators.

Keywords

Medical Nutrition Therapy, Generative AI, Large Language Models, ChatGPT

Introduction

Noncommunicable diseases (NCDs), which are also called chronic diseases, are long-lasting and occur because of a combination of factors including genetics, physiology, environment, and behavior.1 The major categories of NCDs are known as chronic diseases, and they include cardiovascular diseases, which cause 17.9 million deaths every year across the globe. Cancers also contribute significantly to chronic disease, causing 9 million deaths annually. Additionally, chronic respiratory diseases result in 3.9 million deaths each year, and diabetes causes 1.6 million deaths per year.1

The rising incidence of chronic illnesses is having a significant financial impact on healthcare systems worldwide, and it has attracted the interest and attention of policymakers and researchers at all levels of government.2 Typically, the methods employed to manage chronic illnesses are multifaceted, and they revolve around dietary or nutritional interventions, consistent physical exercise, and lifestyle adjustments at their core.3

Studies have demonstrated that low-glycemic index (GI) and low-carbohydrate diets are successful in treating type 2 diabetes, and there has been extensive research into the use of unsaturated fatty acids, vitamins, and bioactive compounds in the management of chronic diseases. Although multidimensional approaches are crucial in managing these chronic illnesses, dietary interventions are of paramount importance and occupy a significant role in these strategies.2

A chatbot powered by artificial intelligence (AI), ChatGPT (Chat Generative Pre-Trained Transformer), was launched by OpenAI in November 2022. With both supervised and reinforcement learning techniques, it is built on top of OpenAI’s GPT-3.5 and GPT-4 large language models (LLMs).4 By using a two-stage training process, large language models learn from data more efficiently than traditional deep learning models, as they begin self-supervised learning on huge amounts of unannotated data, then fine-tune their performance on smaller, task-specific, annotated datasets based on user specifications.5

The original ChatGPT release was based on GPT-3.5 as the foundation, an LLM (Large Language Model) with over 175 billion parameters.6 The newest OpenAI model, GPT-4 was released on March 14, 2023. It is important to note that ChatGPT’s training data is derived from a wide range of online sources, including books, articles, and websites. Utilizing reinforcement learning from human feedback in conversational tasks,7 ChatGPT can consider the complexity of users’ intentions to respond effectively to a variety of end-user tasks, such as medical queries.

A growing amount of medical data and the complexity of clinical decision-making could theoretically benefit clinicians through NLP tools, allowing doctors to make timely, informed decisions. In addition, technological advancements have democratized knowledge, enabling patients to access medical information without relying solely on healthcare professionals. Instead, they are increasingly using search engines, and now artificial intelligence chatbots, to find medical information.8

By engaging in conversational interactions, Chat GPT and other recent chatbots provide authoritative-sounding responses to complicated medical queries. Even though ChatGPT is a promising technology, it often produces inaccurate results, meaning caution is warranted when applying it to medical practice and research.913 These engines have not been evaluated for accuracy and reliability, especially in terms of open-ended medical questions that doctors and patients might ask.1012

Our study aims to assess ChatGPT’s ability to generate accurate and comprehensive responses to nutritional queries created by Nutritionist/Dietician. In addition, this will provide an early indication of ChatGPT’s reliability as a provider of accurate and complete information. Furthermore, this study will highlight limitations and propose an approach for addressing those.

Methods

Ethical considerations

All participants gave written informed consent. Ethical approval was not required as the study had low risk to participants.

Study design

The study uses a case study approach to achieve the research objectives stated in the earlier section. It provides rich and detailed data that can be used to gain a deep understanding of a particular case. It allows for the exploration of complex phenomena that cannot be easily studied through other research methods.14 Although there are limitations to the case study method it is one of the most useful tools in the exploratory study of abstract and evolving phenomena. The type of case study method utilized in this study is Illustrative case studies.15 The approach used in this study is borrowed from functional testing and quality Assurance practices in software development. Functional testing involves creating test cases based on the functional requirements of the software application. These test cases are designed to evaluate whether the software performs as expected. Functional testing is typically performed using black box testing techniques, which means that the tester does not have access to the source code of the software application. In this case, ChatGPT acts as a black box for the researchers involved in this study.

To evaluate the performance of ChatGPT in medical nutrition therapy a well-defined Study Protocol was used. The steps followed in the study as follows:

Step 1: Creation of questions (scenarios) of varied complexity by public health professionals. The questions were selected by the licensed medial nutrition therapist working in UAE. The selected scenario was simple diet consultation to patient with comorbid conditions.

Step 2: The response of ChatGPT was taken and recorded for further analysis.

Step 3: The responses from Step 2 were evaluated by a registered nutritionist for accuracy.

Step 4: Based on the accuracy the potential of harm score for the response was created.

Step 5: Data was summarized and analyzed by the expert group used in Step 1.

Sample

The expert group for deciding complexity contained five public health professionals working in the United Arab Emirates. The experts were selected from Gulf Medical University, UAE and method of selection was nonrandom purposive sampling. The inclusion criteria for the expert were master’s degree and clinical experience greater than five years. The researchers involved in this study approached 7 healthcare professionals out of which five agreed to be part of the expert group. The researchers wanted to recruit five to nine experts as a number greater than that if difficult to handle and a number less than that may result in bias.

For accuracy, one registered nutritionist’s response was taken for step 3. The nutritionist gave a score on a ordinal scale of one to ten where one being least and ten being most accurate.

To ascertain the potential of harm in Step 4 all five experts discussed earlier worked together.

The method utilized for reaching consensus was the Delphi method depicted in Figure 1.16 Using the steps mentioned above and data provided in the support material the reproducibility of the research can be established. Again, a scale of one to ten was used to ascertain the potential to harm where one being least and ten being highest.

efcf7f3f-2289-403c-9956-46ec075c72ba_figure1.gif

Figure 1. Approach for Delphi method used in the study.

Source: Author’s Compilation.

The Delphi method is a structured communication technique originally developed as a systematic, interactive forecasting method that relies on a panel of experts. The experts answered questionnaires in three rounds. After each round, a researcher VM provides an anonymous summary of the from the previous round as well as the reasons they provided for their judgments. Thus, the experts are encouraged to revise their earlier answers considering the replies of other members of their panel. It was observed that the range of the answers decreased, and the group converged. Finally, the process is stopped after a predefined stop and the median scores of the final rounds determined the results (Figure 1).

The conceptual definitions of the terms used in the study are as follows:

Clinical Accuracy: “A clinical accuracy is a qualitative approach that describes the clinical outcome of basing a treatment decision on the result of a measurement method being evaluated”.17

Complexity of the Clinical Problem: “Clinical complexity is a protean term encompassing multiple levels and domains. Illustratively, a prominent concern in health care involves a multiplicity of disorders and conditions experienced by a person along with their cross-sectional and longitudinal contexts”.18

Potential for Harm: “Harm means an injury to the rights, safety or welfare of a research participant that may include physical, psychological, social, financial or economic factors”.19

Results & discussion

This section discusses the results obtained from the illustrative case study method described in the earlier section.

Case 1: 35-year-old female to reduce 10 kgs in a month

The question is simple, with age, gender, and a weight loss goal provided. The statement emphasizes the importance of sustainable weight loss and the potential risks of rapid weight loss. The provided diet chart is low carb, high fiber/protein, suitable for the given condition. There is a negligible risk for a user following this diet unless they have comorbid conditions. The statement that a diet chart need not contain caloric information is not true, as it serves as a guideline for achieving a caloric deficit to aid in weight loss. In terms of the evaluation criteria, the statement receives a complexity score of 2, an accuracy score of 8, and a potential risk score of 1. This result suggests that in case of lower complexity the accuracy of the information is higher and potential risk is lower.

Case 2: 35-year-old female with BMI 34 to reduce weight

The question is slightly more complex than the previous one, with the addition of BMI information. For a person with a BMI of 34, a calorie-deficit diet is required for weight loss. The diet, however, does not specify the amount of oil to be consumed, which can significantly increase the calorie count. The diet is not specific, and portions are assumed, which may result in a diet of around 1400-1500 calories, which may not be enough to achieve the target weight loss. A layperson following this guide may not achieve their weight loss target as the diet provided is not guided. In terms of the evaluation criteria, the statement receives a complexity score of 3, an accuracy score of 7, and a potential risk score of 3. This result again supports the finding of the case 1 as increase in the complexity score reduces the accuracy while increasing the potential to harm.

Case 3: 35-year-old female with BMI 34 also having PCOS to reduce weight

The complexity of the question increases with the addition of the condition of PCOS. The diet provided is like the previous question with the addition of extra guidelines for PCOS, which is general information. However, the diet provided is not specific to the condition, and a user following it may not achieve their weight loss target, but they are not at potential risk for harm. In terms of the evaluation criteria, the statement receives a complexity score of 4, an accuracy score of 6, and a potential risk score of 4. The result of this case is also concurring the hypothesis complexity to question asked results in less accuracy and higher risk to harm.

Case 4: 40-year-old male with diabetes

The question is complex due to the mention of diabetes, which requires consideration of many factors before preparing a diet chart. A simple statement of diabetes does not provide enough information, and the patient should be asked about the type of diabetes, medications, and recent blood reports. Calories, BMI, and current physical activity are critical considerations for a diabetic diet. The patient is at risk of developing hypoglycemia if they are on insulin and have a low BMI or high activity levels. A dietitian would consider all these factors while preparing a plan for a diabetic patient. In terms of the evaluation criteria, the statement receives a complexity score of 4, an accuracy score of 5, and a potential risk score of 6. The complexity score of four for an older patient has less accuracy and high potential of harm. Does age contribute to potential to harm? This question needs to be further tested empirically.

Case 5: 40-year-old male with diabetes and CKD

The complexity of the question increases with the addition of chronic kidney disease (CKD), which requires consideration of several factors while preparing a diet chart, such as the stage of CKD and the current level of potassium and sodium in the blood. However, the statement receives a low accuracy score of 4 as the diet generated does not mention limiting the sodium intake to at least 1.5 g/day, which is essential for CKD patients. Additionally, the diet contains high sources of protein, 75-80 g, which is much higher than what is recommended for a CKD patient and not calculated as per patient weight and CKD level. As a result, a layperson following this diet may be at a high potential for risk, indicating a high potential for risk score of 8. In terms of the evaluation criteria, the statement receives a complexity score of 5, an accuracy score of 4, and a potential risk score of 8. Increasing complexity of the query from 4 to 5 increases the potential of risk from 6 to 8. This makes us conclude that with increasing complexity the increase in potential harm increases exponentially after a point. This phenomenon needs to be further tested empirically.

Case 6: 40-year-old male with diabetes, hypertension, and CKD

The complexity of the question increases with the addition of hypertension as a comorbidity. However, the diet chart provided is the same as the previous question, which does not pose much risk for diabetes and hypertension but poses all the risks previously mentioned for CKD. Therefore, the statement receives a low accuracy score of 4. Additionally, patients need to be educated about sugar and salt sources, and general guidelines are not enough. Measurements should be incorporated into the diet plan itself to avoid potential risks. As a result, the statement receives a high potential for risk score of 8. In terms of the evaluation criteria, the statement receives a complexity score of 6, an accuracy score of 4, and a potential risk score of 8. The finding of this study doesn’t concurs the finding of the case 5 that increase in potential harm with increase in exponential as increasing complexity potential risk remains same.

Case 7: 40-year-old male with diabetes, hypertension, and CKD Indian with a gluten allergy

The complexity of the question increases with the addition of gluten sensitivity. As a result, a proper dietitian is required to prepare a diet plan that takes into consideration the patient’s multiple comorbidities and dietary restrictions. However, the statement is lacking in accuracy as it does not provide any specific information on how to prepare a diet plan for a person with these conditions. Therefore, the accuracy score is low at 4. Additionally, without a specific diet plan, there is potential for risk for the patient with so many comorbidities and dietary restrictions. As a result, the statement receives a high potential for risk score of 8. In terms of the evaluation criteria, the statement receives a complexity score of 7, an accuracy score of 4, and a potential risk score of 8. The findings from this case also concurs the finding from the earlier cases. The increase in complexity is inversely proportional to accuracy while directly proportional to the potential risk.

Case 8: 30-year-old female height 150 cm weight 80 kg, having PCOS, hypothyroidism, insulin resistance with gluten sensitivity, HbA1c 6% for weight loss

This question is extraordinarily complex with multiple parameters given, including the condition of hypothyroidism. The ideal diet for this patient is a low-carb, high-protein, anti-inflammatory diet, with the need to avoid goitrogenic foods like soy products. However, the accuracy of the given information is relatively low, and there is still a potential risk for the patient if not properly guided by a qualified dietitian. Overall scores are as follows: Complexity - 8, Accuracy - 3, Potential for Risk - 6. This study examines the case of highest complexity and hence minimum accuracy. The risk score for this study was expected to be highest but that is not the case. This finding does not concur the findings from the earlier seven cases.

The summary of the analysis of eight cases is listed in Table 1.

Table 1. Summary of the illustrative case study analysis.

Case numberComplexityAccuracyPotential to harm
Case 1281
Case 2373
Case 3464
Case 4456
Case 5548
Case 6648
Case 7748
Case 8836

As depicted in Figure 2, the complexity of the scenario increases risk potential also increases. That suggests that ChatGPT should be avoided for complex medical conditions/scenarios. Researchers believe accuracy does not change much with an increase in complexity and needs to be further evaluated empirically.

efcf7f3f-2289-403c-9956-46ec075c72ba_figure2.gif

Figure 2. Summary of the case analysis.

The findings of the study are supported by the researcher Johnson et al. (2023). They observed that ChatGPT can produce accurate information to diverse medical queries as judged by academic physician specialists although with important limitations. Further research and model development are needed to correct inaccuracies and for validation.20 Another group of researchers found that ChatGPT provides medical information of comparable quality to available static internet information.21 Another recent study suggests cautious approach against use of the ChatGPT in clinical practice. They lament that it doesn’t provide references for the information hence is not reliable for clinical use. Thus, the findings of this study also suggest the cautious use of ChatGPT in medical nutrition therapy as irresponsible use has potential harm for the user. The study assessing the accuracy and potential risks of using nutrition therapy information provided by ChatGPT, evaluated by nutritionists and a group of experts, has several limitations that must be considered when interpreting its results. ChatGPT’s responses are based on the information available up to its last training data, which might not include the latest research or updated guidelines in nutrition therapy. This time-lag in information can introduce a bias towards outdated practices or missing new evidence-based approaches. The accuracy and risk assessments made by the nutritionists and experts are subjective and can vary based on their individual experiences, knowledge, and biases. This variability can introduce both direction and magnitude biases in the evaluation process. The experts and nutritionists might have preconceived notions about the reliability of AI-generated information, which could influence their assessment of ChatGPT’s responses, either positively or negatively. The range and type of nutrition therapy questions asked may not comprehensively cover the vast field of nutrition. Thus, the study’s findings might not be generalizable across all areas of nutrition therapy.

Conclusion

The primary objective of the present study was to assess the accuracy and comprehensiveness of ChatGPT’s responses to nutritional queries generated by nutritionists/dieticians. To achieve this, an in-depth case study approach was employed. Functional testing was conducted by creating test cases that aligned with the functional requirements of the software application. ChatGPT’s responses were evaluated and analyzed in different scenarios that involved medical nutritional therapy, varying in complexity. The accuracy of the generated data was assessed by a registered nutritionist, and a potential harm score was used to evaluate the responses provided by ChatGPT.

When several case scenarios with varying levels of complexity were evaluated for their risk potential, it was demonstrated that as the complexity increased, so did the potential risk. The study suggests that the ChatGPT should not be used for complex medical nutrition situations and conditions, even though the accuracy of the generated response does not change much with the complexity of the case scenario.

The study’s findings have important clinical implications for practitioners, particularly nutritionists, and dieticians, who may use ChatGPT or similar AI-powered tools in their practice. Practitioners should exercise caution and avoid relying solely on ChatGPT for complex cases that require specialized knowledge and expertise.

The study’s findings underscore the importance of using ChatGPT or similar AI-powered tools appropriately in clinical practice. It should not be used as a replacement for professional judgment or clinical decision-making, particularly in complex medical nutrition situations. Practitioners, especially nutritionists and dietitians, should consider ChatGPT as a complementary tool to support their clinical practice, and not solely rely on it for making critical nutrition-related decisions. This study emphasizes the importance of human verification and not solely relying on AI-generated information.

The findings of the study have important implications for policymakers. One key recommendation is to exercise caution when implementing generative AI, such as ChatGPT, in clinical practice. Rushing to adopt such tools without thorough evaluation and validation may not be advisable. While generative AI has the potential to improve efficiency in healthcare operations, it should be considered as a decision support system for registered practitioners, rather than a standalone tool for making clinical decisions.

It is important to note that patients should not rely solely on generative AI for self-medication or medical nutrition therapy, especially in situations where multiple health conditions (comorbidities) are involved. This is because generative AI tools like ChatGPT may not have the ability to fully assess and address the complexities of comorbid conditions, which could potentially result in harm to patients.22

In conclusion, a collaborative effort involving all stakeholders in healthcare education, research, and practice is urgently needed to establish guidelines for the responsible use of ChatGPT by educators, researchers, and practitioners.

Limitations of the study

The study used a small sample size which could affect the accuracy of the results. Another limitation is the dynamic nature of technology. Since technology is constantly evolving and improving, the results of the study may need to be reevaluated after a few days or weeks to account for any changes or updates. Additionally, the study’s reliance on only one nutritionist to assess accuracy introduces the possibility of bias and human errors.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 22 Feb 2024
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Mishra V, Jafri F, Abdul Kareem N et al. Evaluation of accuracy and potential harm of ChatGPT in medical nutrition therapy - a case-based approach [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2024, 13:137 (https://doi.org/10.12688/f1000research.142428.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 22 Feb 2024
Views
10
Cite
Reviewer Report 06 May 2024
Maren C. Podszun, University of Hohenheim, Stuttgart, Germany 
Not Approved
VIEWS 10
The authors have investigated the accuracy of ChatGPT for medical nutrition therapy. They selected a case study-based approach with queries increasing in complexity and then a nutritionist evaluated the given output for accuracy. The output was furthermore scaled for potential ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Podszun MC. Reviewer Report For: Evaluation of accuracy and potential harm of ChatGPT in medical nutrition therapy - a case-based approach [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2024, 13:137 (https://doi.org/10.5256/f1000research.155982.r258226)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
16
Cite
Reviewer Report 25 Apr 2024
Daniel Kirk, Wageningen University & Research, Wageningen, Gelderland, The Netherlands;  Department of Twin Research & Genetic Epidemiology, King's College London (Ringgold ID: 4616), London, England, UK 
Approved with Reservations
VIEWS 16
As more people rely on ChatGPT as a source of information, the authors aim to evaluate the competency of ChatGPT at answering nutrition questions asked by a nutritionist using a case-study approach. The approach is interesting and the topic of ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Kirk D. Reviewer Report For: Evaluation of accuracy and potential harm of ChatGPT in medical nutrition therapy - a case-based approach [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2024, 13:137 (https://doi.org/10.5256/f1000research.155982.r265582)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 22 Feb 2024
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.