Evaluation of accuracy and potential harm of ChatGPT in medical nutrition therapy - a case-based approach

Vinaytosh Mishra; Fahmida Jafri; Nafeesa Abdul Kareem; Raseena Aboobacker; Fatma Noora

doi:10.12688/f1000research.142428.1

Home Browse Evaluation of accuracy and potential harm of ChatGPT in medical nutrition...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Evaluation of accuracy and potential harm of ChatGPT in medical nutrition therapy - a case-based approach

[version 1; peer review: 1 approved with reservations, 1 not approved]

Vinaytosh Mishra ^1,2, Fahmida Jafri², Nafeesa Abdul Kareem², Raseena Aboobacker², Fatma Noora²

Vinaytosh Mishra ^1,2, Fahmida Jafri², [...] Nafeesa Abdul Kareem², Raseena Aboobacker², Fatma Noora²

PUBLISHED 22 Feb 2024

Author details Author details

¹ Datta Meghe Institute of Higher Education and Research, Nagpur, Maharashtra, India
² Gulf Medical University, Ajman, UAE, Ajman, United Arab Emirates

Vinaytosh Mishra
Roles: Conceptualization, Formal Analysis, Methodology, Project Administration

Fahmida Jafri
Roles: Data Curation, Investigation

Nafeesa Abdul Kareem
Roles: Data Curation, Formal Analysis, Methodology

Raseena Aboobacker
Roles: Data Curation, Investigation, Methodology, Writing – Review & Editing

Fatma Noora
Roles: Data Curation, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

This article is included in the Datta Meghe Institute of Higher Education and Research collection.

Abstract

Background

ChatGPT is a conversational large language model (LLM) based on artificial intelligence (AI). LLMs may be applied in health care education, research, and practice if relevant valid concerns are proactively addressed. The current study aimed to investigate ChatGPT’s ability to generate accurate and comprehensive responses to nutritional queries created by nutritionists/dieticians.

Methods

An in-depth case study approach was used to accomplish the research objectives. Functional testing was performed, creating test cases based on the functional requirement of the software application. ChatGPT responses were evaluated and analyzed using various scenarios requiring medical nutritional therapy, which were created with varied complexity. Based on the accuracy of the generated data, which were evaluated by a registered nutritionist, a potential harm score for the responses from Chat GPT was used as evaluation.

Results

Eight case scenarios with varied complexity when evaluated revealed that, as the complexity of the scenario increased, it led to an increase in the risk potential. Although the accuracy of the generated response does not change much with the complexity of the case scenarios, the study suggests that ChatGPT should be avoided for generating responses for complex medical nutritional conditions or scenarios.

Conclusions

The need for an initiative that engages all stakeholders involved in healthcare education, research, and practice is urgently needed to set up guidelines for the responsible use of ChatGPT by healthcare educators, researchers, and practitioners. The findings of the study are useful for healthcare professionals and health technology regulators.

Keywords

Medical Nutrition Therapy, Generative AI, Large Language Models, ChatGPT

Corresponding author: Vinaytosh Mishra

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2024 Mishra V et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Mishra V, Jafri F, Abdul Kareem N et al. Evaluation of accuracy and potential harm of ChatGPT in medical nutrition therapy - a case-based approach [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2024, 13:137 (https://doi.org/10.12688/f1000research.142428.1) First published: 22 Feb 2024, 13:137 (https://doi.org/10.12688/f1000research.142428.1) Latest published: 22 Feb 2024, 13:137 (https://doi.org/10.12688/f1000research.142428.1)

Introduction

Noncommunicable diseases (NCDs), which are also called chronic diseases, are long-lasting and occur because of a combination of factors including genetics, physiology, environment, and behavior.¹ The major categories of NCDs are known as chronic diseases, and they include cardiovascular diseases, which cause 17.9 million deaths every year across the globe. Cancers also contribute significantly to chronic disease, causing 9 million deaths annually. Additionally, chronic respiratory diseases result in 3.9 million deaths each year, and diabetes causes 1.6 million deaths per year.¹

The rising incidence of chronic illnesses is having a significant financial impact on healthcare systems worldwide, and it has attracted the interest and attention of policymakers and researchers at all levels of government.² Typically, the methods employed to manage chronic illnesses are multifaceted, and they revolve around dietary or nutritional interventions, consistent physical exercise, and lifestyle adjustments at their core.³

Studies have demonstrated that low-glycemic index (GI) and low-carbohydrate diets are successful in treating type 2 diabetes, and there has been extensive research into the use of unsaturated fatty acids, vitamins, and bioactive compounds in the management of chronic diseases. Although multidimensional approaches are crucial in managing these chronic illnesses, dietary interventions are of paramount importance and occupy a significant role in these strategies.²

A chatbot powered by artificial intelligence (AI), ChatGPT (Chat Generative Pre-Trained Transformer), was launched by OpenAI in November 2022. With both supervised and reinforcement learning techniques, it is built on top of OpenAI’s GPT-3.5 and GPT-4 large language models (LLMs).⁴ By using a two-stage training process, large language models learn from data more efficiently than traditional deep learning models, as they begin self-supervised learning on huge amounts of unannotated data, then fine-tune their performance on smaller, task-specific, annotated datasets based on user specifications.⁵

The original ChatGPT release was based on GPT-3.5 as the foundation, an LLM (Large Language Model) with over 175 billion parameters.⁶ The newest OpenAI model, GPT-4 was released on March 14, 2023. It is important to note that ChatGPT’s training data is derived from a wide range of online sources, including books, articles, and websites. Utilizing reinforcement learning from human feedback in conversational tasks,⁷ ChatGPT can consider the complexity of users’ intentions to respond effectively to a variety of end-user tasks, such as medical queries.

A growing amount of medical data and the complexity of clinical decision-making could theoretically benefit clinicians through NLP tools, allowing doctors to make timely, informed decisions. In addition, technological advancements have democratized knowledge, enabling patients to access medical information without relying solely on healthcare professionals. Instead, they are increasingly using search engines, and now artificial intelligence chatbots, to find medical information.⁸

By engaging in conversational interactions, Chat GPT and other recent chatbots provide authoritative-sounding responses to complicated medical queries. Even though ChatGPT is a promising technology, it often produces inaccurate results, meaning caution is warranted when applying it to medical practice and research.⁹^–¹³ These engines have not been evaluated for accuracy and reliability, especially in terms of open-ended medical questions that doctors and patients might ask.¹⁰^–¹²

Our study aims to assess ChatGPT’s ability to generate accurate and comprehensive responses to nutritional queries created by Nutritionist/Dietician. In addition, this will provide an early indication of ChatGPT’s reliability as a provider of accurate and complete information. Furthermore, this study will highlight limitations and propose an approach for addressing those.

Methods

Ethical considerations

All participants gave written informed consent. Ethical approval was not required as the study had low risk to participants.

Study design

The study uses a case study approach to achieve the research objectives stated in the earlier section. It provides rich and detailed data that can be used to gain a deep understanding of a particular case. It allows for the exploration of complex phenomena that cannot be easily studied through other research methods.¹⁴ Although there are limitations to the case study method it is one of the most useful tools in the exploratory study of abstract and evolving phenomena. The type of case study method utilized in this study is Illustrative case studies.¹⁵ The approach used in this study is borrowed from functional testing and quality Assurance practices in software development. Functional testing involves creating test cases based on the functional requirements of the software application. These test cases are designed to evaluate whether the software performs as expected. Functional testing is typically performed using black box testing techniques, which means that the tester does not have access to the source code of the software application. In this case, ChatGPT acts as a black box for the researchers involved in this study.

To evaluate the performance of ChatGPT in medical nutrition therapy a well-defined Study Protocol was used. The steps followed in the study as follows:

Step 1: Creation of questions (scenarios) of varied complexity by public health professionals. The questions were selected by the licensed medial nutrition therapist working in UAE. The selected scenario was simple diet consultation to patient with comorbid conditions.

Step 2: The response of ChatGPT was taken and recorded for further analysis.

Step 3: The responses from Step 2 were evaluated by a registered nutritionist for accuracy.

Step 4: Based on the accuracy the potential of harm score for the response was created.

Step 5: Data was summarized and analyzed by the expert group used in Step 1.

Sample

The expert group for deciding complexity contained five public health professionals working in the United Arab Emirates. The experts were selected from Gulf Medical University, UAE and method of selection was nonrandom purposive sampling. The inclusion criteria for the expert were master’s degree and clinical experience greater than five years. The researchers involved in this study approached 7 healthcare professionals out of which five agreed to be part of the expert group. The researchers wanted to recruit five to nine experts as a number greater than that if difficult to handle and a number less than that may result in bias.

For accuracy, one registered nutritionist’s response was taken for step 3. The nutritionist gave a score on a ordinal scale of one to ten where one being least and ten being most accurate.

To ascertain the potential of harm in Step 4 all five experts discussed earlier worked together.

The method utilized for reaching consensus was the Delphi method depicted in Figure 1.¹⁶ Using the steps mentioned above and data provided in the support material the reproducibility of the research can be established. Again, a scale of one to ten was used to ascertain the potential to harm where one being least and ten being highest.

Figure 1. Approach for Delphi method used in the study.

Source: Author’s Compilation.

The Delphi method is a structured communication technique originally developed as a systematic, interactive forecasting method that relies on a panel of experts. The experts answered questionnaires in three rounds. After each round, a researcher VM provides an anonymous summary of the from the previous round as well as the reasons they provided for their judgments. Thus, the experts are encouraged to revise their earlier answers considering the replies of other members of their panel. It was observed that the range of the answers decreased, and the group converged. Finally, the process is stopped after a predefined stop and the median scores of the final rounds determined the results (Figure 1).

The conceptual definitions of the terms used in the study are as follows:

Clinical Accuracy: “A clinical accuracy is a qualitative approach that describes the clinical outcome of basing a treatment decision on the result of a measurement method being evaluated”.¹⁷

Complexity of the Clinical Problem: “Clinical complexity is a protean term encompassing multiple levels and domains. Illustratively, a prominent concern in health care involves a multiplicity of disorders and conditions experienced by a person along with their cross-sectional and longitudinal contexts”.¹⁸

Potential for Harm: “Harm means an injury to the rights, safety or welfare of a research participant that may include physical, psychological, social, financial or economic factors”.¹⁹

Results & discussion

This section discusses the results obtained from the illustrative case study method described in the earlier section.

Case 1: 35-year-old female to reduce 10 kgs in a month

The question is simple, with age, gender, and a weight loss goal provided. The statement emphasizes the importance of sustainable weight loss and the potential risks of rapid weight loss. The provided diet chart is low carb, high fiber/protein, suitable for the given condition. There is a negligible risk for a user following this diet unless they have comorbid conditions. The statement that a diet chart need not contain caloric information is not true, as it serves as a guideline for achieving a caloric deficit to aid in weight loss. In terms of the evaluation criteria, the statement receives a complexity score of 2, an accuracy score of 8, and a potential risk score of 1. This result suggests that in case of lower complexity the accuracy of the information is higher and potential risk is lower.

Case 2: 35-year-old female with BMI 34 to reduce weight

The question is slightly more complex than the previous one, with the addition of BMI information. For a person with a BMI of 34, a calorie-deficit diet is required for weight loss. The diet, however, does not specify the amount of oil to be consumed, which can significantly increase the calorie count. The diet is not specific, and portions are assumed, which may result in a diet of around 1400-1500 calories, which may not be enough to achieve the target weight loss. A layperson following this guide may not achieve their weight loss target as the diet provided is not guided. In terms of the evaluation criteria, the statement receives a complexity score of 3, an accuracy score of 7, and a potential risk score of 3. This result again supports the finding of the case 1 as increase in the complexity score reduces the accuracy while increasing the potential to harm.

Case 3: 35-year-old female with BMI 34 also having PCOS to reduce weight

The complexity of the question increases with the addition of the condition of PCOS. The diet provided is like the previous question with the addition of extra guidelines for PCOS, which is general information. However, the diet provided is not specific to the condition, and a user following it may not achieve their weight loss target, but they are not at potential risk for harm. In terms of the evaluation criteria, the statement receives a complexity score of 4, an accuracy score of 6, and a potential risk score of 4. The result of this case is also concurring the hypothesis complexity to question asked results in less accuracy and higher risk to harm.

Case 4: 40-year-old male with diabetes

The question is complex due to the mention of diabetes, which requires consideration of many factors before preparing a diet chart. A simple statement of diabetes does not provide enough information, and the patient should be asked about the type of diabetes, medications, and recent blood reports. Calories, BMI, and current physical activity are critical considerations for a diabetic diet. The patient is at risk of developing hypoglycemia if they are on insulin and have a low BMI or high activity levels. A dietitian would consider all these factors while preparing a plan for a diabetic patient. In terms of the evaluation criteria, the statement receives a complexity score of 4, an accuracy score of 5, and a potential risk score of 6. The complexity score of four for an older patient has less accuracy and high potential of harm. Does age contribute to potential to harm? This question needs to be further tested empirically.

Case 5: 40-year-old male with diabetes and CKD

The complexity of the question increases with the addition of chronic kidney disease (CKD), which requires consideration of several factors while preparing a diet chart, such as the stage of CKD and the current level of potassium and sodium in the blood. However, the statement receives a low accuracy score of 4 as the diet generated does not mention limiting the sodium intake to at least 1.5 g/day, which is essential for CKD patients. Additionally, the diet contains high sources of protein, 75-80 g, which is much higher than what is recommended for a CKD patient and not calculated as per patient weight and CKD level. As a result, a layperson following this diet may be at a high potential for risk, indicating a high potential for risk score of 8. In terms of the evaluation criteria, the statement receives a complexity score of 5, an accuracy score of 4, and a potential risk score of 8. Increasing complexity of the query from 4 to 5 increases the potential of risk from 6 to 8. This makes us conclude that with increasing complexity the increase in potential harm increases exponentially after a point. This phenomenon needs to be further tested empirically.

Case 6: 40-year-old male with diabetes, hypertension, and CKD

The complexity of the question increases with the addition of hypertension as a comorbidity. However, the diet chart provided is the same as the previous question, which does not pose much risk for diabetes and hypertension but poses all the risks previously mentioned for CKD. Therefore, the statement receives a low accuracy score of 4. Additionally, patients need to be educated about sugar and salt sources, and general guidelines are not enough. Measurements should be incorporated into the diet plan itself to avoid potential risks. As a result, the statement receives a high potential for risk score of 8. In terms of the evaluation criteria, the statement receives a complexity score of 6, an accuracy score of 4, and a potential risk score of 8. The finding of this study doesn’t concurs the finding of the case 5 that increase in potential harm with increase in exponential as increasing complexity potential risk remains same.

Case 7: 40-year-old male with diabetes, hypertension, and CKD Indian with a gluten allergy

The complexity of the question increases with the addition of gluten sensitivity. As a result, a proper dietitian is required to prepare a diet plan that takes into consideration the patient’s multiple comorbidities and dietary restrictions. However, the statement is lacking in accuracy as it does not provide any specific information on how to prepare a diet plan for a person with these conditions. Therefore, the accuracy score is low at 4. Additionally, without a specific diet plan, there is potential for risk for the patient with so many comorbidities and dietary restrictions. As a result, the statement receives a high potential for risk score of 8. In terms of the evaluation criteria, the statement receives a complexity score of 7, an accuracy score of 4, and a potential risk score of 8. The findings from this case also concurs the finding from the earlier cases. The increase in complexity is inversely proportional to accuracy while directly proportional to the potential risk.

Case 8: 30-year-old female height 150 cm weight 80 kg, having PCOS, hypothyroidism, insulin resistance with gluten sensitivity, HbA1c 6% for weight loss

This question is extraordinarily complex with multiple parameters given, including the condition of hypothyroidism. The ideal diet for this patient is a low-carb, high-protein, anti-inflammatory diet, with the need to avoid goitrogenic foods like soy products. However, the accuracy of the given information is relatively low, and there is still a potential risk for the patient if not properly guided by a qualified dietitian. Overall scores are as follows: Complexity - 8, Accuracy - 3, Potential for Risk - 6. This study examines the case of highest complexity and hence minimum accuracy. The risk score for this study was expected to be highest but that is not the case. This finding does not concur the findings from the earlier seven cases.

The summary of the analysis of eight cases is listed in Table 1.

Table 1. Summary of the illustrative case study analysis.

Case number	Complexity	Accuracy	Potential to harm
Case 1	2	8	1
Case 2	3	7	3
Case 3	4	6	4
Case 4	4	5	6
Case 5	5	4	8
Case 6	6	4	8
Case 7	7	4	8
Case 8	8	3	6

As depicted in Figure 2, the complexity of the scenario increases risk potential also increases. That suggests that ChatGPT should be avoided for complex medical conditions/scenarios. Researchers believe accuracy does not change much with an increase in complexity and needs to be further evaluated empirically.

Figure 2. Summary of the case analysis.

The findings of the study are supported by the researcher Johnson et al. (2023). They observed that ChatGPT can produce accurate information to diverse medical queries as judged by academic physician specialists although with important limitations. Further research and model development are needed to correct inaccuracies and for validation.²⁰ Another group of researchers found that ChatGPT provides medical information of comparable quality to available static internet information.²¹ Another recent study suggests cautious approach against use of the ChatGPT in clinical practice. They lament that it doesn’t provide references for the information hence is not reliable for clinical use. Thus, the findings of this study also suggest the cautious use of ChatGPT in medical nutrition therapy as irresponsible use has potential harm for the user. The study assessing the accuracy and potential risks of using nutrition therapy information provided by ChatGPT, evaluated by nutritionists and a group of experts, has several limitations that must be considered when interpreting its results. ChatGPT’s responses are based on the information available up to its last training data, which might not include the latest research or updated guidelines in nutrition therapy. This time-lag in information can introduce a bias towards outdated practices or missing new evidence-based approaches. The accuracy and risk assessments made by the nutritionists and experts are subjective and can vary based on their individual experiences, knowledge, and biases. This variability can introduce both direction and magnitude biases in the evaluation process. The experts and nutritionists might have preconceived notions about the reliability of AI-generated information, which could influence their assessment of ChatGPT’s responses, either positively or negatively. The range and type of nutrition therapy questions asked may not comprehensively cover the vast field of nutrition. Thus, the study’s findings might not be generalizable across all areas of nutrition therapy.

Conclusion

The primary objective of the present study was to assess the accuracy and comprehensiveness of ChatGPT’s responses to nutritional queries generated by nutritionists/dieticians. To achieve this, an in-depth case study approach was employed. Functional testing was conducted by creating test cases that aligned with the functional requirements of the software application. ChatGPT’s responses were evaluated and analyzed in different scenarios that involved medical nutritional therapy, varying in complexity. The accuracy of the generated data was assessed by a registered nutritionist, and a potential harm score was used to evaluate the responses provided by ChatGPT.

When several case scenarios with varying levels of complexity were evaluated for their risk potential, it was demonstrated that as the complexity increased, so did the potential risk. The study suggests that the ChatGPT should not be used for complex medical nutrition situations and conditions, even though the accuracy of the generated response does not change much with the complexity of the case scenario.

The study’s findings have important clinical implications for practitioners, particularly nutritionists, and dieticians, who may use ChatGPT or similar AI-powered tools in their practice. Practitioners should exercise caution and avoid relying solely on ChatGPT for complex cases that require specialized knowledge and expertise.

The study’s findings underscore the importance of using ChatGPT or similar AI-powered tools appropriately in clinical practice. It should not be used as a replacement for professional judgment or clinical decision-making, particularly in complex medical nutrition situations. Practitioners, especially nutritionists and dietitians, should consider ChatGPT as a complementary tool to support their clinical practice, and not solely rely on it for making critical nutrition-related decisions. This study emphasizes the importance of human verification and not solely relying on AI-generated information.

The findings of the study have important implications for policymakers. One key recommendation is to exercise caution when implementing generative AI, such as ChatGPT, in clinical practice. Rushing to adopt such tools without thorough evaluation and validation may not be advisable. While generative AI has the potential to improve efficiency in healthcare operations, it should be considered as a decision support system for registered practitioners, rather than a standalone tool for making clinical decisions.

It is important to note that patients should not rely solely on generative AI for self-medication or medical nutrition therapy, especially in situations where multiple health conditions (comorbidities) are involved. This is because generative AI tools like ChatGPT may not have the ability to fully assess and address the complexities of comorbid conditions, which could potentially result in harm to patients.²²

In conclusion, a collaborative effort involving all stakeholders in healthcare education, research, and practice is urgently needed to establish guidelines for the responsible use of ChatGPT by educators, researchers, and practitioners.

Limitations of the study

The study used a small sample size which could affect the accuracy of the results. Another limitation is the dynamic nature of technology. Since technology is constantly evolving and improving, the results of the study may need to be reevaluated after a few days or weeks to account for any changes or updates. Additionally, the study’s reliance on only one nutritionist to assess accuracy introduces the possibility of bias and human errors.

Data availability

Figshare: Evaluation of accuracy and potential harm of ChatGPT in medical nutrition therapy – a case-based approach. https://dx.doi.org/10.6084/m9.figshare.24547276.

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Acknowledgments

The authors of this study are grateful to Datta Meghe Institute of Higher Education & Research, Gulf Medical University, and Thumbay University Hospital for the infrastructural support provided for completion of this research work.

References

1. Drozd M, Pujades-Rodriguez M, Lillie PJ, et al.: Non-communicable disease, sociodemographic factors, and risk of death from infection: a UK Biobank observational cohort study. Lancet Infect. Dis. 2021; 21(8): 1184–1191. PubMed Abstract | Publisher Full Text | Free Full Text
2. Stefano M, Marco S, Daniela C, et al.: Nutritional knowledge of nursing students: A systematic literature review. Nurse Educ. Today. 2023; 105826.
3. Magliano DJ, Boyko EJ: IDF diabetes atlas.2022.
4. Biswas SS: Role of ChatGPT in public health. Ann. Biomed. Eng. 2023; 51(5): 868–869. PubMed Abstract | Publisher Full Text
5. Shen Y, Heacock L, Elias J, et al.: ChatGPT and other large language models are double-edged swords. Radiology. 2023; 307(2): e230163. PubMed Abstract | Publisher Full Text
6. Shen Y, Heacock L, Elias J, et al.: ChatGPT and other large language models are double-edged swords. Radiology. 2023; 307(2): e230163. PubMed Abstract | Publisher Full Text
7. Jaques N, Ghandeharioun A, Shen JH, et al.: Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456. 2019.
8. Vaira LA, Lechien JR, Abbate V, et al.: Accuracy of ChatGPT-Generated Information on Head and Neck and Oromaxillofacial Surgery: A Multicenter Collaborative Analysis. Otolaryngol. Head Neck Surg. 2023. PubMed Abstract | Publisher Full Text
9. Hosseini M, Rasmussen LM, Resnik DB: Using AI to write scholarly publications. Account. Res. 2023; 1–9. Publisher Full Text
10. Thorp HH: ChatGPT is fun, but not an author. Science. 2023; 379(6630): 313–313. PubMed Abstract | Publisher Full Text
11. Shah FA: IS Chat-GPT A Silver Bullet for Scientific Manuscript Writing? J. Postgrad. Med. Inst. 2023; 37(1): 1–2.
12. Flanagin A, Bibbins-Domingo K, Berkwits M, et al.: Nonhuman “authors” and implications for the integrity of scientific publication and medical knowledge. JAMA. 2023; 329(8): 637–639. PubMed Abstract | Publisher Full Text
13. Goodman RS, Patrinely JR, Osterman T, et al.: On the cusp: Considering the impact of artificial intelligence language models in healthcare. Med. 2023; 4(3): 139–140. PubMed Abstract | Publisher Full Text
14. Yin RK: The case study method as a tool for doing evaluation. Curr. Sociol. 1992; 40(1): 121–137. Publisher Full Text
15. Heaton J, Day J, Britten N: Collaborative research and the co-production of knowledge for practice: an illustrative case study. Implement. Sci. 2015; 11: 1–10. Publisher Full Text
16. Chapman B, MacLaurin T, Powell D: Food safety info sheets: Design and refinement of a narrative-based training intervention. Br. Food J. 2011; 113(2): 160–186. Publisher Full Text
17. Boren SA, Clarke WL: Analytical and clinical performance of blood glucose monitors. J. Diabetes Sci. Technol. 2010; 4(1): 84–97. PubMed Abstract | Publisher Full Text | Free Full Text
18. Mezzich JE, Salloum IM: Clinical complexity and person-centered integrative diagnosis. World Psychiatry. 2008; 7(1): 1–2. PubMed Abstract | Publisher Full Text | Free Full Text
19. Guideline: Harm and risk in research - University College Dublin.[cited 2023Apr25]. Reference Source
20. Johnson D, Goodman R, Patrinely J, et al.: Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Research Square. 2023.
21. Walker HL, Ghani S, Kuemmerli C, et al.: Reliability of medical information provided by ChatGPT: assessment against clinical guidelines and patient information quality instrument. J. Med. Internet Res. 2023; 25: e47479. PubMed Abstract | Publisher Full Text | Free Full Text
22. Whiles BB, Bird VG, Canales BK, et al.: Caution! AI bot has entered the patient chat: ChatGPT has limitations in providing accurate urologic healthcare advice. Urology. 2023; 180: 278–284. PubMed Abstract | Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 22 Feb 2024

Author details Author details

¹ Datta Meghe Institute of Higher Education and Research, Nagpur, Maharashtra, India
² Gulf Medical University, Ajman, UAE, Ajman, United Arab Emirates

Vinaytosh Mishra
Roles: Conceptualization, Formal Analysis, Methodology, Project Administration

Fahmida Jafri
Roles: Data Curation, Investigation

Nafeesa Abdul Kareem
Roles: Data Curation, Formal Analysis, Methodology

Raseena Aboobacker
Roles: Data Curation, Investigation, Methodology, Writing – Review & Editing

Fatma Noora
Roles: Data Curation, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (1)

version 1

Published: 22 Feb 2024, 13:137

https://doi.org/10.12688/f1000research.142428.1

Copyright

© 2024 Mishra V et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Mishra V, Jafri F, Abdul Kareem N et al. Evaluation of accuracy and potential harm of ChatGPT in medical nutrition therapy - a case-based approach [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2024, 13:137 (https://doi.org/10.12688/f1000research.142428.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 22 Feb 2024

Views

10

Reviewer Report 06 May 2024

Maren C. Podszun, University of Hohenheim, Stuttgart, Germany

Not Approved

https://doi.org/10.5256/f1000research.155982.r258226

The authors have investigated the accuracy of ChatGPT for medical nutrition therapy. They selected a case study-based approach with queries increasing in complexity and then a nutritionist evaluated the given output for accuracy. The output was furthermore scaled for potential ... Continue reading

The authors have investigated the accuracy of ChatGPT for medical nutrition therapy. They selected a case study-based approach with queries increasing in complexity and then a nutritionist evaluated the given output for accuracy. The output was furthermore scaled for potential of harm. While the topic is certainly a hot one and very important there are some shortcomings in the current version that need to be addressed.

One diet plan per condition is too little to make any inference about the accuracy, please consult a statistician to calculate the number of plans needed for a sound statistical analysis.
Assessment for accuracy by one nutritionist is too little, I would suggest to at least add two others that are blinded to the previous answers. The rational for the number of experts is week. It’s further confusing why public health professionals were chosen and not nutritionists
Please indicate the version of ChatGPT. It is a tremendous difference whether ChatGPT-3.5 or 4 was used, as ChatGPT4 is connected to the internet and will have different output. Please also indicate the time (date) of data collection
The exact prompts used for the cases need to be given in the method section and not just the supplementary data
The manuscript would benefit from English language/ grammar service to improve clarity

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

No
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Nutrition Science, MASLD, AI in nutrition

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Respond or Comment

Views

16

Reviewer Report 25 Apr 2024

Daniel Kirk, Wageningen University & Research, Wageningen, Gelderland, The Netherlands; Department of Twin Research & Genetic Epidemiology, King's College London (Ringgold ID: 4616), London, England, UK

Approved with Reservations

https://doi.org/10.5256/f1000research.155982.r265582

As more people rely on ChatGPT as a source of information, the authors aim to evaluate the competency of ChatGPT at answering nutrition questions asked by a nutritionist using a case-study approach. The approach is interesting and the topic of ... Continue reading

As more people rely on ChatGPT as a source of information, the authors aim to evaluate the competency of ChatGPT at answering nutrition questions asked by a nutritionist using a case-study approach. The approach is interesting and the topic of chatbots for managing chronic diseases is important and highly relevant.

“The major categories of NCDs are known as chronic diseases,” this is repetitive and unnecessary given the first sentence. I would just say “The major categorised of NCDs are….”
“Chat GPT” should be corrected to ChatGPT (without a space between)
We have previously published an article that compared the quality of answers to nutrition questions between ChatGPT and human dietitians (Ref [1]). In our study, we found ChatGPT performed well on all metrics but, importantly, we excluded medical questions. Given that the authors’ find that accuracy was not compromised but risk potential was higher with increasing complexity, this is interesting. Discussing these findings in the context of their own would enrich the author’s article.
At the end of the intro, the authors state that the questions were asked by “Nutritionist/Dietician”. First, “a” is missing before this (or the noun should be pluralized). Second, nutritionists and dietitians are similar but different, with the conditions for naming oneself a dietitian being more stringent. The authors should specify which the chose here.
“The approach used in this study is borrowed from functional testing and quality Assurance practices in software development.” There should be citations here for those that are not from a software development background.
“The researchers wanted to recruit five to nine experts as a number greater than that if difficult to handle and a number less than that may result in bias.” We chose a similar number of experts for similar reasons. This may be used in support of the authors’ approach.
“Step 3: The responses from Step 2 were evaluated by a registered nutritionist for accuracy.” This represents one of the most vulnerable points of the authors’ study. What constitutes as a “good”/correct answer in the field of nutrition can be (unfortunately) quite subjective. Topics in nutrition can be polarized and a nutritionist’s interpretation of the science can be disproportionately influenced by their own experience. Since the scoring of ChatGPT’s answers was only performed by only one individual, the author’s results become subject to the knowledge and belief’s of only one single individual. How do the author’s justify their methodology in spite of this?
The prompts given to, and responses from, ChatGPT should be made available.
I think the research would benefit from having some type of a control group (i.e., scores of answers from human experts). At present, it cannot be discounted that questions of increasing complexity naturally lead to higher potential of harm. In this case, this would not be a limitation of ChatGPT but rather a function of complicated questions. However, since there is no control group, this cannot be known.
There is insufficient detail in the discussion of the results. The authors mention the results of similar studies but do not contextualise these their own findings in these others.
The authors do well in the introduction to set the scene for the motivation of research on chatbots for managing chronic diseases but then do not elaborate further on this when discussing their own results. The authors mention what the findings mean for policymakers and practitioners but it would be worth discussing what these findings mean for individuals with chronic diseases who might wish to use ChatGPT for obtaining information and what they mean for the future of chatbots in a medical context.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

No
Are the conclusions drawn adequately supported by the results?

Partly

References

1. Kirk D, van Eijnatten E, Camps G: Comparison of Answers between ChatGPT and Human Dieticians to Common Nutrition Questions.J Nutr Metab. 2023; 2023: 5548684 PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: nutrition, machine learning, biochemistry

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 22 Feb 2024

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 22 Feb 24	read	read

Daniel Kirk, Wageningen University & Research, Wageningen, The Netherlands; King's College London (Ringgold ID: 4616), London, UK
Maren C. Podszun, University of Hohenheim, Stuttgart, Germany

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

10 Views

06 May 2024 | for Version 1

Maren C. Podszun, University of Hohenheim, Stuttgart, Germany

10 Views Cite this report Responses(0)

Not Approved

The authors have investigated the accuracy of ChatGPT for medical nutrition therapy. They selected a case study-based approach with queries increasing in complexity and then a nutritionist evaluated the given output for accuracy. The output was furthermore scaled for potential of harm. While the topic is certainly a hot one and very important there are some shortcomings in the current version that need to be addressed.

One diet plan per condition is too little to make any inference about the accuracy, please consult a statistician to calculate the number of plans needed for a sound statistical analysis.
Assessment for accuracy by one nutritionist is too little, I would suggest to at least add two others that are blinded to the previous answers. The rational for the number of experts is week. It’s further confusing why public health professionals were chosen and not nutritionists
Please indicate the version of ChatGPT. It is a tremendous difference whether ChatGPT-3.5 or 4 was used, as ChatGPT4 is connected to the internet and will have different output. Please also indicate the time (date) of data collection
The exact prompts used for the cases need to be given in the method section and not just the supplementary data
The manuscript would benefit from English language/ grammar service to improve clarity

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

No
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Nutrition Science, MASLD, AI in nutrition

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

16 Views

25 Apr 2024 | for Version 1

Daniel Kirk, Wageningen University & Research, Wageningen, Gelderland, The Netherlands; Department of Twin Research & Genetic Epidemiology, King's College London (Ringgold ID: 4616), London, England, UK

16 Views Cite this report Responses(0)

Approved With Reservations

As more people rely on ChatGPT as a source of information, the authors aim to evaluate the competency of ChatGPT at answering nutrition questions asked by a nutritionist using a case-study approach. The approach is interesting and the topic of chatbots for managing chronic diseases is important and highly relevant.

“The major categories of NCDs are known as chronic diseases,” this is repetitive and unnecessary given the first sentence. I would just say “The major categorised of NCDs are….”
“Chat GPT” should be corrected to ChatGPT (without a space between)
We have previously published an article that compared the quality of answers to nutrition questions between ChatGPT and human dietitians (Ref [1]). In our study, we found ChatGPT performed well on all metrics but, importantly, we excluded medical questions. Given that the authors’ find that accuracy was not compromised but risk potential was higher with increasing complexity, this is interesting. Discussing these findings in the context of their own would enrich the author’s article.
At the end of the intro, the authors state that the questions were asked by “Nutritionist/Dietician”. First, “a” is missing before this (or the noun should be pluralized). Second, nutritionists and dietitians are similar but different, with the conditions for naming oneself a dietitian being more stringent. The authors should specify which the chose here.
“The approach used in this study is borrowed from functional testing and quality Assurance practices in software development.” There should be citations here for those that are not from a software development background.
“The researchers wanted to recruit five to nine experts as a number greater than that if difficult to handle and a number less than that may result in bias.” We chose a similar number of experts for similar reasons. This may be used in support of the authors’ approach.
“Step 3: The responses from Step 2 were evaluated by a registered nutritionist for accuracy.” This represents one of the most vulnerable points of the authors’ study. What constitutes as a “good”/correct answer in the field of nutrition can be (unfortunately) quite subjective. Topics in nutrition can be polarized and a nutritionist’s interpretation of the science can be disproportionately influenced by their own experience. Since the scoring of ChatGPT’s answers was only performed by only one individual, the author’s results become subject to the knowledge and belief’s of only one single individual. How do the author’s justify their methodology in spite of this?
The prompts given to, and responses from, ChatGPT should be made available.
I think the research would benefit from having some type of a control group (i.e., scores of answers from human experts). At present, it cannot be discounted that questions of increasing complexity naturally lead to higher potential of harm. In this case, this would not be a limitation of ChatGPT but rather a function of complicated questions. However, since there is no control group, this cannot be known.
There is insufficient detail in the discussion of the results. The authors mention the results of similar studies but do not contextualise these their own findings in these others.
The authors do well in the introduction to set the scene for the motivation of research on chatbots for managing chronic diseases but then do not elaborate further on this when discussing their own results. The authors mention what the findings mean for policymakers and practitioners but it would be worth discussing what these findings mean for individuals with chronic diseases who might wish to use ChatGPT for obtaining information and what they mean for the future of chatbots in a medical context.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

No
Are the conclusions drawn adequately supported by the results?

Partly

References

1. Kirk D, van Eijnatten E, Camps G: Comparison of Answers between ChatGPT and Human Dieticians to Common Nutrition Questions.J Nutr Metab. 2023; 2023: 5548684 PubMed Abstract | Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

nutrition, machine learning, biochemistry

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. Drozd M, Pujades-Rodriguez M, Lillie PJ, et al.: Non-communicable disease, sociodemographic factors, and risk of death from infection: a UK Biobank observational cohort study. Lancet Infect. Dis. 2021; 21(8): 1184–1191. PubMed Abstract | Publisher Full Text | Free Full Text

[2] 2. Stefano M, Marco S, Daniela C, et al.: Nutritional knowledge of nursing students: A systematic literature review. Nurse Educ. Today. 2023; 105826.

[3] 3. Magliano DJ, Boyko EJ: IDF diabetes atlas.2022.

[4] 4. Biswas SS: Role of ChatGPT in public health. Ann. Biomed. Eng. 2023; 51(5): 868–869. PubMed Abstract | Publisher Full Text

[5] 5. Shen Y, Heacock L, Elias J, et al.: ChatGPT and other large language models are double-edged swords. Radiology. 2023; 307(2): e230163. PubMed Abstract | Publisher Full Text

[6] 6. Shen Y, Heacock L, Elias J, et al.: ChatGPT and other large language models are double-edged swords. Radiology. 2023; 307(2): e230163. PubMed Abstract | Publisher Full Text

[7] 7. Jaques N, Ghandeharioun A, Shen JH, et al.: Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456. 2019.

[8] 8. Vaira LA, Lechien JR, Abbate V, et al.: Accuracy of ChatGPT-Generated Information on Head and Neck and Oromaxillofacial Surgery: A Multicenter Collaborative Analysis. Otolaryngol. Head Neck Surg. 2023. PubMed Abstract | Publisher Full Text

[9] 9. Hosseini M, Rasmussen LM, Resnik DB: Using AI to write scholarly publications. Account. Res. 2023; 1–9. Publisher Full Text

[10] 10. Thorp HH: ChatGPT is fun, but not an author. Science. 2023; 379(6630): 313–313. PubMed Abstract | Publisher Full Text

[11] 11. Shah FA: IS Chat-GPT A Silver Bullet for Scientific Manuscript Writing? J. Postgrad. Med. Inst. 2023; 37(1): 1–2.

[12] 12. Flanagin A, Bibbins-Domingo K, Berkwits M, et al.: Nonhuman “authors” and implications for the integrity of scientific publication and medical knowledge. JAMA. 2023; 329(8): 637–639. PubMed Abstract | Publisher Full Text

[13] 13. Goodman RS, Patrinely JR, Osterman T, et al.: On the cusp: Considering the impact of artificial intelligence language models in healthcare. Med. 2023; 4(3): 139–140. PubMed Abstract | Publisher Full Text

[14] 14. Yin RK: The case study method as a tool for doing evaluation. Curr. Sociol. 1992; 40(1): 121–137. Publisher Full Text

[15] 15. Heaton J, Day J, Britten N: Collaborative research and the co-production of knowledge for practice: an illustrative case study. Implement. Sci. 2015; 11: 1–10. Publisher Full Text

[16] 16. Chapman B, MacLaurin T, Powell D: Food safety info sheets: Design and refinement of a narrative-based training intervention. Br. Food J. 2011; 113(2): 160–186. Publisher Full Text

[17] 17. Boren SA, Clarke WL: Analytical and clinical performance of blood glucose monitors. J. Diabetes Sci. Technol. 2010; 4(1): 84–97. PubMed Abstract | Publisher Full Text | Free Full Text

[18] 18. Mezzich JE, Salloum IM: Clinical complexity and person-centered integrative diagnosis. World Psychiatry. 2008; 7(1): 1–2. PubMed Abstract | Publisher Full Text | Free Full Text

[19] 19. Guideline: Harm and risk in research - University College Dublin.[cited 2023Apr25]. Reference Source

[20] 20. Johnson D, Goodman R, Patrinely J, et al.: Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Research Square. 2023.

[21] 21. Walker HL, Ghani S, Kuemmerli C, et al.: Reliability of medical information provided by ChatGPT: assessment against clinical guidelines and patient information quality instrument. J. Med. Internet Res. 2023; 25: e47479. PubMed Abstract | Publisher Full Text | Free Full Text

[22] 22. Whiles BB, Bird VG, Canales BK, et al.: Caution! AI bot has entered the patient chat: ChatGPT has limitations in providing accurate urologic healthcare advice. Urology. 2023; 180: 278–284. PubMed Abstract | Publisher Full Text

Evaluation of accuracy and potential harm of ChatGPT in medical nutrition therapy - a case-based approach

Abstract

Background

Methods

Results

Conclusions

Keywords

Introduction

Methods

Ethical considerations

Study design

Sample

Figure 1. Approach for Delphi method used in the study.

Results & discussion

Case 1: 35-year-old female to reduce 10 kgs in a month

Case 2: 35-year-old female with BMI 34 to reduce weight

Case 3: 35-year-old female with BMI 34 also having PCOS to reduce weight

Case 4: 40-year-old male with diabetes

Case 5: 40-year-old male with diabetes and CKD

Case 6: 40-year-old male with diabetes, hypertension, and CKD

Case 7: 40-year-old male with diabetes, hypertension, and CKD Indian with a gluten allergy

Case 8: 30-year-old female height 150 cm weight 80 kg, having PCOS, hypothyroidism, insulin resistance with gluten sensitivity, HbA1c 6% for weight loss

Table 1. Summary of the illustrative case study analysis.

Figure 2. Summary of the case analysis.

Conclusion

Limitations of the study

Data availability

Acknowledgments

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated