Artificial Intelligence in Dietary Recommendations for Type 1 Diabetes: Evaluating the Performance of ChatGPT-4o, Bing AI, and Bard AI

Nasser S Alqahtani

doi:10.12688/f1000research.167902.1

Home Browse Artificial Intelligence in Dietary Recommendations for Type 1 Diabetes:...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Artificial Intelligence in Dietary Recommendations for Type 1 Diabetes: Evaluating the Performance of ChatGPT-4o, Bing AI, and Bard AI

[version 1; peer review: 1 approved with reservations]

Nasser S Alqahtani

PUBLISHED 22 Aug 2025

Author details Author details

Community Health Department, Northern Border University, Arar, Saudi Arabia

Nasser S Alqahtani
Roles: Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Investigation, Methodology, Project Administration, Resources, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Global Public Health gateway.

Abstract

Background

Effective dietary management is essential for individuals with type 1 diabetes (T1D). Artificial intelligence (AI) tools such as ChatGPT-4o, Bard AI, and Bing AI are increasingly being used to assist in healthcare tasks, including nutrition advice. This study evaluates the performance of these AI models in generating dietary recommendations when compared to input from human dietitians.

Methods

Sixty expert-written, hypothetical T1D patient cases were submitted to ChatGPT-4o, Bard AI, and Bing AI. Each model’s responses were assessed as either “Correct” or “Incomplete” relative to dietitian recommendations. Descriptive statistics and McNemar’s test were used to compare pairwise performance across models.

Results

ChatGPT-4o provided correct recommendations in 60% of cases, followed by Bard AI (50%) and Bing AI (26.7%). McNemar’s test showed that ChatGPT-4o significantly outperformed Bing AI (p < 0.05), while performance differences between ChatGPT-4o and Bard AI were not statistically significant (p > 0.05). ChatGPT-4o demonstrated superior resilience across case complexity levels, showed the highest rate of unique correct answers, and exhibited only modest agreement with other models. This highlights ChatGPT-4o’s relative independence and robustness. An interactive version of the analysis can be accessed here: https://cnpdata.shinyapps.io/ai_diabetes/

Conclusions

ChatGPT-4o generated more accurate dietary suggestions than Bing AI and performed comparably to Bard AI. However, AI tools still lack the contextual nuance of human dietitians and should be used to supplement, rather than replace, professional guidance in diabetes care.

Keywords

Artificial intelligence, Type 1 diabetes, Dietary recommendations, ChatGPT, Bard AI, Bing AI, Digital health

Corresponding author: Nasser S Alqahtani

Competing interests: No competing interests were disclosed.

Grant information: The author extends his appreciation to the Deanship of Scientific Research at Northern Border University, Arar, KSA, for funding this research work through the project number “NBU-FFR-2025-2975-13
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2025 Alqahtani NS. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Alqahtani NS. Artificial Intelligence in Dietary Recommendations for Type 1 Diabetes: Evaluating the Performance of ChatGPT-4o, Bing AI, and Bard AI [version 1; peer review: 1 approved with reservations]. F1000Research 2025, 14:812 (https://doi.org/10.12688/f1000research.167902.1) First published: 22 Aug 2025, 14:812 (https://doi.org/10.12688/f1000research.167902.1) Latest published: 22 Aug 2025, 14:812 (https://doi.org/10.12688/f1000research.167902.1)

Introduction

Diabetes mellitus represents a growing global public health concern. Although previously more prevalent in high-income countries, the condition now affects populations worldwide. This is largely due to increasingly sedentary lifestyles and the consumption of energy-dense, nutrient-poor diets (Saeedi et al., 2019). Type 1 diabetes (T1D), in particular, is characterised by the autoimmune destruction of pancreatic β-cells, leading to insufficient insulin production and the abnormal regulation of blood glucose levels. Maintaining blood glucose within the target range of 70–180 mg/dL is essential to avoid acute complications such as hypoglycaemia and hyperglycaemia (Ogurtsova et al., 2017). However, achieving this target remains a challenge for most individuals with T1D.

The global burden of diabetes has grown rapidly. The World Health Organization has estimated 422 million cases as of 2014, a fourfold increase since 1980 and approximately two million deaths up until 2019 (Tyler & Jacobs, 2020). Challenges for conventional diabetes care include late diagnosis, limited access to multidisciplinary care, and the need for frequent patient follow-ups. Diabetes management also implies active and constant self-regulation by patients, specifically when it comes to their eating choices, which can considerably impact glycaemic control (Hermanns et al., 2022). There is increasing demand for scalable digital tools to support nutrition guidance, especially in resource-limited settings.

Lately, advanced AI chatbots have been built upon large language models (LLMs), such as ChatGPT, Google Bard, and Microsoft Bing Chat. These chatbots have been investigated in the context of diabetes nutritional counselling. Such systems can digest a patient’s profile and generate personalised meal recommendations, potentially extending professional guidance to increasing numbers of people. For example, a prototype “AI dietitian” leveraging ChatGPT yielded meal plans consistent with expert nutrition standards. Indeed, registered dietitians rated 96% of the responses as appropriate (Sun et al., 2023). Others have noted that ChatGPT could support patient education and individualised dietary advice during diabetes care (Sridhar & Gumpeny, 2025). By embedding LLM chatbots in apps or web portals, patients can receive tailored, on-demand nutrition advice without waiting for clinic appointments (Kassem et al., 2025). This helps improve accessibility and scalability.

These AI tools could also integrate with wearable health sensors, such as continuous glucose monitors, to adjust recommendations in real time (Wang et al., 2025). This can further enhance personalisation. Nevertheless, current LLMs are not diabetes-specific and can “hallucinate” errors if unchecked (Sridhar & Gumpeny, 2025). Any AI-generated diet plan must, therefore, be rigorously evaluated for medical accuracy and appropriateness before clinical use. Ultimately, robust human oversight and validation are essential to ensure that AI-driven guidance is safe, effective, and clinically aligned.

The goal of this study is:

1. Assess whether new dialogue models ChatGPT-4o, Bard AI (Bing AI Reference Design), and Bing AI can provide dietary recommendations for a hypothetical person with T1D.
2. Compare those recommendations to the reference recommendations of a licensed human dietitian.

The results may aid in establishing the extent to which these devices can function as reliable, inexpensive adjuncts to diabetes nutrition education and self-management.

Methods

Study design

This comparative analysis evaluated the performance of three major LLMs ChatGPT-4o (OpenAI), Bard AI (Google), and Bing AI (Microsoft) when generating dietary recommendations for patients with T1D. A panel of clinical experts developed 60 hypothetical T1D cases. Per case, these incorporated pertinent clinical and lifestyle information, such as insulin regimen, weight, level of activity, and glucose targets.

All three AI models processed each case separately, and a total dietary assessment and an individualised personalised nutrition plan were carried out with the prompt to do so. The outputs were then compared against the standard of care provided by a registered clinical dietitian.

Evaluation and scoring

AI responses were independently reviewed by two clinical experts and classified as one of the following two options:

• Correct: Complete and clinically appropriate dietary guidance, comparable to that of a human dietitian.
• Incomplete: Missing essential components, overly generic, or lacking contextual relevance.

Discrepancies between reviewers were resolved by discussion.

Case construction

The 60 Hypothetical Cases of Type 1 Diabetes were developed in scientific method to guarantee clinical realism and scientific integrity by following these steps:

1. Expert Authorship
- ○ All cases were created by clinical dietitians licenced as senior specialist from The Saudi Commission for Health Specialties.
- ○ These specialists have worked across tertiary hospitals, diabetes centres in Arar and Riyadh, Saudi Arabia.
2. Clinical Validity and Diversity
- ○ Each case scenario has evidence-based parameters relevant to routine T1D, including:
  - ▪ Age, sex, body weight, insulin regimen and dosage.
  - ▪ Physical activity and lifestyle.
  - ▪ Glycaemic targets and recent glucose trends.
  - ▪ Patients’ history and medical diagnosis.
  - ▪ Biochemistry/Laboratory results.
  - ▪ In some cases, details given like meal timing, daily carbohydrate intake.
Source reliability
- ○ Case was adapted from:
  - ▪ Validated educational material used in American Association of Diabetes Educators (AADE).
  - ▪ International standards of care for T1D published by the American Diabetes Association (ADA).
- ○ All results reviewed and evaluated by senior clinical dietitians.
3. Standardization and Ethical Integrity
- ○ No real patient data were used.
- ○ This study has no human subjects were involved; hence it adhered fully to ethical standards for AI evaluation research.

Statistical analysis

All statistical analyses were conducted using R version 4.4.0 with relevant packages including dplyr, ggplot2, irr, DescTools, and reshape2. Initial analysis quantified the frequency and percentage of correct versus incomplete responses for each model. Numbers and percentages were used to quantify the data.

To evaluate statistical differences in performance between models, McNemar’s test was applied to pairwise comparisons of model outputs. Cohen’s kappa coefficients were computed for each pair of models to assess the degree of agreement beyond chance. The magnitude of agreement was interpreted using standard benchmarks: slight (κ = 0.01–0.20), fair (κ = 0.21–0.40), moderate (κ = 0.41–0.60), substantial (κ = 0.61–0.80), and almost perfect (κ = 0.81–1.00). Each coefficient was reported with its corresponding p-value.

A complementarity framework was applied to examine the unique and overlapping contributions of each model. Each case was then categorised based on whether only one, multiple, or none of the models provided a correct response. Frequencies for each pattern were calculated to reveal instances where specific models demonstrated a unique value.

To assess model sensitivity to task difficulty, each case was classified into one of three complexity categories based on the number of models that provided a correct response: Simple (all three correct), Moderate (two correct), and Complex (zero or one correct). Model accuracy was then computed within each complexity stratum to explore performance variations across difficulty levels. Model outputs were binaries (1 = Correct, 0 = Incomplete), and Pearson correlation coefficients were calculated among the models to assess pairwise linear relationships in decision patterns.

A heatmap visualisation of the correlation matrix was used to highlight areas of redundancy or divergence in performance. All tests were two-tailed, and statistical significance was determined at an alpha level of 0.05. Finally, ShinyApp (a package in R) was used to create an interactive version of the analysis.

Results

Overall model accuracy

Descriptive analysis of response outcomes revealed clear differences in accuracy among the AI models. ChatGPT-4o demonstrated the highest correctness rate, answering 36 out of 60 cases correctly (60%). Bard AI correctly solved 30 cases (50%). Bing AI only generated 16 correct responses (26.7%) ( Table 1, Figure 1).

Table 1. Summary of responses analysis by AI model.

Model	Correct	Incomplete
ChatGPT-4o	36 (60%)	24 (40%)
Bard AI	30 (50%)	30 (50%)
Bing AI	16 (26.7%)	44 (73.3%)

Figure 1. Correct vs. Incomplete response rates for ChatGPT-4o, Bard AI, and Bing AI.

McNemar pairwise comparisons

To statistically evaluate differences in model performance across paired comparisons, a McNemar’s chi-square test was conducted for each model pair. The comparison between ChatGPT-4o and Bard AI did not yield a statistically significant difference in accuracy (χ² = 1.14, p = 0.29). However, significant differences were observed between ChatGPT-4o and Bing AI (χ² = 10.028, p = 0.002), as well as between Bard AI and Bing AI (χ² = 5.63, p = 0.02). These findings indicate that both ChatGPT-4o and Bard AI outperform Bing AI at a statistically meaningful level, while the gap between ChatGPT-4o and Bard AI is not significant ( Table 2).

Table 2. McNemar test pairwise comparison table.

Model pair	McNemar's χ²	p-value
ChatGPT-4o vs Bard AI	1.14	0.29
ChatGPT-4o vs Bing AI	10.028	0.002
Bard AI vs Bing AI	5.63	0.02

Inter-model agreement: Cohen’s Kappa

Although overall accuracy is informative, it is also critical to examine how frequently the models agree with one another. Cohen’s kappa was calculated for each model pair to quantify agreement beyond chance. Agreement between ChatGPT-4o and Bard AI was modest (κ = 0.27, p = 0.035), which was categorized as “fair agreement”. The agreement between ChatGPT-4o and Bing AI was slightly negative (κ = -0.097, p = 0.34), indicating less than chance agreement. Bard AI and Bing AI yielded a κ of 0 (p = 1), reflecting slight or no agreement ( Table 3).

Table 3. Cohen’s kappa agreement table among models.

Model pair	Kappa coefficient (P-value)	Agreement strength
ChatGPT-4o vs Bard AI	0.27 (0.035)	Fair
ChatGPT-4o vs Bing AI	-0.097 (0.34)	Less than chance agreement
Bard AI vs Bing AI	0 (1)	Slight

Model complementarity

Complementarity, or the ability of different models to independently provide correct answers when others fail, was examined. For 10 cases, ChatGPT-4o uniquely answered correctly while Bard and Bing AI failed. Bard and Bing AI each had four uniquely correct responses. Only four cases were solved correctly by all three models, while 12 cases were missed by all. These results align with the overall accuracy findings in this study and demonstrate that ChatGPT-4o not only outperforms the other two AI models but also provides the most correct responses in cases where Bard AI and Bing AI both fail ( Figure 2).

Figure 2. Distribution of correct responses across models highlighting complementarity and unique contributions.

Sensitivity to case complexity

To explore performance sensitivity across varying levels of question difficulty, the dataset was stratified into three cases: Simple (three models correct), Moderate (two models correct), and Complex (zero or one models correct). Accuracy rates across these categories are visualized in Figure 3. In simple cases, all models performed at 100% accuracy. For Moderate cases, ChatGPT-4o continued to perform strongly (86.4% accuracy), while Bing AI dropped to 30.8%. The trend held in Complex cases, where ChatGPT-4o maintained 33.3% accuracy, outperforming Bard AI (13.3%) and Bing AI (13.3%). These findings suggest that ChatGPT-4o is more resilient to complexity, while the other models degrade more sharply in performance under challenging conditions ( Figure 3).

Figure 3. Model accuracy across case complexity categories: Simple, Moderate, and Complex.

Correlation of model decisions

Finally, the degree of linear correlation between model outputs was examined using a Pearson correlation heatmap ( Figure 4). The correlation matrix confirms limited agreement between model predictions, which is consistent with previous findings. Notably, ChatGPT-4o and Bard AI showed a moderate positive correlation (r = 0.27), while ChatGPT-4o and Bing AI displayed a weak negative correlation (r = -0.12) ( Figure 4).

Figure 4. Heat map of Pearson correlation coefficients between model outputs for 60 T1D dietary recommendation cases.

An interactive version of the analysis can be accessed here: https://cnpdata.shinyapps.io/ai_diabetes/

Discussion

The above comparative analysis demonstrates that LLM chatbots can often produce appropriate dietary guidance for T1D patients, with ChatGPT-4o performing best. In this study, ChatGPT-4o provided fully correct nutrition plans in 60% of cases, compared to 50% for Google Bard and 26.7% for Microsoft Bing. Both ChatGPT-4o and Bard significantly outperformed Bing AI (McNemar’s test, p < 0.05), while the difference between ChatGPT-4o and Bard was not statistically significant. Notably, in 10 cases, ChatGPT-4o uniquely answered correctly while both other models failed. Only four cases were solved correctly by all three models.

These results are consistent with emerging literature on AI-supported nutrition counselling. Recent studies have found that ChatGPT’s dietary advice often aligns with expert standards. Sun et al. (2023), for instance, showed that ChatGPT’s nutrition recommendations for diabetes were largely consistent with best practices, and professional dietitians rated most of its answers as appropriate. Another analysis likewise found that ChatGPT-generated meal plans for diabetes closely follow the American Diabetes Association’s plate method (Chatelan et al., 2023). In that evaluation, the meals included the recommended balance of non-starchy vegetables, proteins, and carbohydrates.

However, the study also highlighted some of ChatGPT’s known limitations. It was found that repeated queries could yield varying menus due to the model’s non-determinism (Chatelan et al., 2023), and some menus included suboptimal or inappropriate foods without warning (e.g. spinach and avocado in a renal diet plan). Even ChatGPT-4o left a substantial minority of scenarios unaddressed, indicating that current chatbots still miss important details in complex diabetic meal planning. The above findings reinforce that although LLMs can simulate expert guidance, their outputs must be carefully reviewed. This is because the models can occasionally produce misleading or incomplete advice.

There are several reasons why ChatGPT-4o is able to outperform ChatGPT-2o in generating dietetic recommendations. First, its big model size and breadth of available training data on medical literature, nutritional science (Garcia, 2023), and conversational language (Sharma & Gaur, 2024) allow the model to answer both scientifically grounded questions and generalise across various dietary situations. This kind of extensive pre-training enables ChatGPT-4o to emulate evidence-based clinical inferences and personalised nutritional advice (Garcia, 2023; Sharma & Gaur, 2024). ChatGPT-4o also works in real time and tends to generate responses that are immediate, uniform, and standardised. In contrast, dieticians’ recommendations can vary according to individual experiences and/or understandings (Bayram & Ozturkcan, 2024; Papastratis et al., 2024). These aspects likely enabled ChatGPT-4o to (a) exhibit higher performance than Bard and Bing AI and (b) perform similar to, or better than, human dietitians in controlled settings (Papastratis et al., 2024).

From a practical standpoint, AI-based tools such as ChatGPT-4o might function as useful supplements for dietitians engaged in T1D care (Tyler & Jacobs, 2020; Guan et al., 2023). In particular, there is the potential to provide rapid, accessible guideline-adherent recommendations that improve care delivery, particularly in areas with limited access to trained nutrition specialists (ElSayed & Aleppo, 2023). This is especially relevant in low-resource settings, where AI might function as ‘gap filler’ in patient care (Hou et al., 2023). It could also promote patient self-administration by providing users with instant, individualised dietary advice (Garcia, 2023; Sharma & Gaur, 2024). As intimated, these tools should not replace human insight and clinical (re) assessment, especially not in complex cases that require nuanced judgement (Papastratis et al., 2024; Hieronimus et al., 2024). That said, they could play a valuable role in diabetes care by extending coverage, amplifying effectiveness, and empowering patients (Tyler & Jacobs, 2020; Guan et al., 2023).

The evolving literature favours extending AI’s reach into the broader realm of diabetes care. Indeed, recent evidence suggests that AI-enabled systems (including LLMs) have tremendous potential to support and even impact decision-making, remote monitoring, and personalised treatment (Tyler & Jacobs, 2020; Guan et al., 2023; ElSayed & Aleppo, 2023). Technologies that enhance patient access and engagement in care support while promoting glycaemic control, especially in underserved populations, are available for use (Hou et al., 2023; Daly & Hovorka, 2021). Thus, the present study not only validates the utility of ChatGPT-4o as an adjunct dietary tool but also adds to the increasing range of roles AI technologies are playing in areas such as precision medicine and chronic condition management (Guan et al., 2023; Tyler & Jacobs, 2020).

The potential clinical impact of reliable AI dietary advice in diabetes is substantial. Sun et al. have, for instance, emphasised how diabetes nutrition management is often hindered by a “low supply of registered clinical dietitians” (2023, p. 1). In many health systems, patients face long waits or a lack of coverage for nutrition counselling. An AI chatbot such as ChatGPT can, in contrast, provide on-demand, personalised guidance 24 hours a day. ChatGPT is also available via smartphone or computer and can deliver immediate meal suggestions or educational explanations. This can help patients make healthier choices between clinic visits. Early evidence also suggests that people find such AI guidance motivating. A recent study reported that patients using ChatGPT described receiving informational support, personalised recommendations, reminders, and encouragement, with an overall “positive impact on [diabetes mellitus] self-management” (Alanzi et al., 2025, p. 1). Thus, if properly integrated (e.g. with glucose-monitoring apps or telehealth services), AI-driven advice could reinforce dietary adherence and thereby improve glycaemic control and reduce complications.

This study has several strengths, which point to its credibility and clinical value. First, the carefully constructed, real-world, contextualised, hypothetical cases maximise conformity between the dietary scenarios presented and what clinical care providers face during daily T1D management activities (Joslin, 2021). Such an approach enhances the results’ generalisability to real healthcare conditions. Second, integrating AI models (ChatGPT-4o, Bard AI, and Bing AI) allows one to compare a wide range of industry-standard digital health technologies. This can, in turn, engender a holistic view of AI performance in dietary recommendation tasks (Papastratis et al., 2024; Sharma & Gaur, 2024). Finally, the use of well-established statistical approaches (e.g. McNemar’s test and paired t-tests) resulted in robust evidence for comparing the relative accuracy between the models (Garcia, 2023). These techniques increased the results’ credibility and the conclusions’ appropriateness vis-à-vis AI’s comparative effects in dietetic practice (Sharma et al., 2023).

Despite the above strengths, there are some limitations to this study. Although designed by experts, the sample of theoretical cases does not exactly reproduce the multiple facets involved when dealing with actual patients. This could impact the results’ generalisability to live clinic settings (ElSayed & Aleppo, 2023). The AI-created dietary recommendations were also not subjected to clinical trials or related to patient outcomes. This prevents one from drawing any conclusions related to those recommendations’ efficacy or safety in clinical terms (Papastratis et al., 2024).

Another limitation is that evaluation judgments related to whether responses were “correct” versus “incomplete” were the classification criteria, which may introduce bias. This is because, even when based on expert feedback (Hieronimus et al., 2024), decisions entail some necessary subjectivity. Moreover, as AI continues to develop rapidly, tools such as ChatGPT-4o, Bard AI, and Bing AI will improve rapidly. The results from this study might, therefore, become redundant as subsequent versions of the model are developed (Sharma et al., 2023). Finally, the evaluation provided each chatbot with a single prompt and did not permit interactive follow-ups. In real clinical use, patients or clinicians would likely refine their questions over multiple turns, which can improve accuracy.

Ethical and safety concerns must be considered when incorporating AI tools such as ChatGPT-4o into clinical nutrition and diabetes care. While it is possible that AI systems would provide convenient and timely dietary recommendations, rigorous validation and clinical oversight must be available to ensure that the advice is correct, safe, and consistent with evidence-based recommendations (Papastratis et al., 2024; Sharma et al., 2023). Unless thoroughly tested, AI-generated guidance runs the risk of recommending the wrong diet, particularly in complex cases (Hieronimus et al., 2024). It is also necessary to individualise recommendations according to each patient’s needs. AI systems might oversimplify or fail to consider the context of coexisting conditions, psychosocial influences, and cultural preferences. These are factors that a human dietitian tends to properly account for (Chatelan et al., 2023; Sharma et al., 2023).

In sum, the danger of excessive dependence on unsupervised AI is clear. Although AI can improve care, it should not be used to substitute the subtle clinical judgment and empathy of skilled professionals (Garcia, 2023; Sharma & Gaur, 2024). The need for a balanced implementation of AI as a supportive, rather than an equal partner, is critical to protect patient safety and maintain ethical standards during care provision (ElSayed & Aleppo, 2023).

Further investigations into AI-based dietary advice could validate its effectiveness through the use of actual patient data and clinical results. Research based on real patient presentations and care would, in turn, be stronger evidence of AI’s use and constraints in real-life scenarios. There is also a need for longitudinal studies to determine the potential longer-range clinical effects of AI-guided dietary planning on glycaemic control, adherence to dietary plan, and the complication rate in people with T1D. Moreover, investigating how AI systems such as ChatGPT-4o can be suitably integrated into the care ecosystem (e.g., EHR) would reduce the time taken to get to care. AI might also bypass being too subjective and avoid basing decisions on a patient’s partial record by providing personalised suggestions that consider the individual’s whole medical history.

Promising directions in computational modelling include the process of hybrid models. These can take advantage of AI-generated insights in combination with clinician reviews, thereby balancing efficiency and clinical safety. These sorts of models could, in turn, harness the AI’s speed and scale while retaining the essential human oversight that is critical for nuanced decision-making during patient care.

Conclusion

In conclusion, the above comparative evaluation suggests that modern AI Chabot particularly ChatGPT-4o, show real promise in their ability to serve as supplementary tools during diabetes nutrition therapy. These models can rapidly generate individualised meal guidance and thereby potentially expand access to expert-level advice. This aligns with a growing consensus that AI can enhance chronic disease self-management. Indeed, patients are increasingly reporting the benefits of using ChatGPT for health education and support.

Moving forward, clinical trials are needed to test whether integrating AI chatbots into diabetes care through mobile apps or telehealth platforms genuinely improves adherence, patient satisfaction, and metabolic outcomes. If successfully implemented, AI-driven dietary counselling could reduce the burden on healthcare providers and empower patients to more effectively manage their condition. This can ultimately improve both clinical outcomes and overall quality of life.

Preregistered data analysis

The author did not preregister the research at an independent registry.

Ethical considerations

This study did not involve human participants, real patient data, or animal subjects. Instead, it utilised hypothetical clinical scenarios. Ethical approval and informed consent were, therefore, not required. All AI-generated responses were assessed in a research context without interacting with the actual individuals.

Author contributions

(CRediT Taxonomy)

• Conceptualisation, Methodology, Data curation, Writing review and editing: Nasser S. Alqahtani

Software availability

The AI models used in this study, including ChatGPT-4o, Bard AI, and Bing AI, were accessed via their respective web-based platforms. These tools are proprietary large language models (LLMs) developed by OpenAI, Google, and Microsoft, respectively. While the models themselves are proprietary and not available for direct download, access to these AI platforms can be obtained by visiting their respective websites:

• ChatGPT-4o: Accessed via Open AI’s platform .
• Bard AI: Accessed via Google Bard .
• Bing AI: Accessed via Microsoft Bing .

Data availability

All datasets generated and analysed during this study are publicly available in an open-access data repository to facilitate transparency and replication of results. The complete data supporting the findings including the values underlying the summary statistics (means, percentages), the numerical data used to construct all tables and figures, and the annotated responses from ChatGPT-4o, Bard AI, and Bing AI can be accessed at ZENODO via the following persistent identifier: https://doi.org/10.5281/zenodo.16524383.

No restrictions or embargoes apply to the data, and it is shared under a CC-BY 4.0 International license permitting reuse with attribution. This includes the hypothetical T1D patient case inputs, the AI-generated dietary recommendation outputs, and the expert evaluations used for scoring and statistical analyses.

If additional materials or clarifications are required to replicate the study, they can be obtained by contacting the corresponding author.

References

Alanzi TM, et al.: Impact of ChatGPT on Diabetes Mellitus Self-Management Among Patients in Saudi Arabia. Cureus. 2025; 17: e81855. PubMed Abstract | Publisher Full Text | Free Full Text
Bayram HM, Ozturkcan A: AI Showdown: Info Accuracy on Protein Quality Content in Foods From ChatGPT 3.5, ChatGPT 4, Bard AI and Bing Chat. Br. Food J. 2024; 3: 4–126. Publisher Full Text
Chatelan A, Clerc A, Fonta P-A: ChatGPT and Future Artificial Intelligence Chatbots: What May be the Influence on Credentialed Nutrition and Dietetics Practitioners? J. Acad. Nutr. Diet. 2023; 123: 1525–1531. PubMed Abstract | Publisher Full Text
Daly A, Hovorka R: Technology in the Management of Type 2 Diabetes: Present Status and Future Prospects. Diabetes. Obes. Metab. 2021; 23: 1722–1732. PubMed Abstract | Publisher Full Text | Free Full Text
ElSayed NA, Aleppo G: Facilitating Positive Health Behaviors and Well-Being to Improve Health Outcomes: Standards of Care in Diabetes–2023. Diabetes Care. 2023; 46: S68–S96. PubMed Abstract | Publisher Full Text | Free Full Text
Garcia MB: ChatGPT as a Virtual Dietitian: Exploring Its Potential as a Tool for Improving Nutrition Knowledge. Appl. Syst. Innov. 2023; 6: 96. Publisher Full Text
Guan Z, et al.: Artificial Intelligence in Diabetes Management: Advancements, Opportunities, and Challenges. Cell Rep. Med. 2023; 4: 101213. PubMed Abstract | Publisher Full Text | Free Full Text
Hermanns N, et al.: Coordination of Glucose Monitoring, Self-Care Behaviour and Mental Health: Achieving Precision Monitoring in Diabetes. Diabetologia. 2022; 65: 1883–1894. PubMed Abstract | Publisher Full Text | Free Full Text
Hieronimus B, Hammann S, Podszun MC: Can the AI tools ChatGPT and Bard Generate Energy, Macro-and Micro-Nutrient Sufficient Meal Plans for Different Dietary Patterns? Nutr. Res. 2024; 128: 105–114. PubMed Abstract | Publisher Full Text
Hou, et al.: Exercise Modalities for Type 2 Diabetes: A Systematic Review and Network Meta-Analysis of Randomized Trials. Diabetes Metab. Res. Rev. 2023; 39: e3591. PubMed Abstract | Publisher Full Text
Joslin EP: The Prevention of Diabetes Mellitus. J. Am. Med. Assoc. 2021; 325: 190. Publisher Full Text
Kassem H, et al.: Investigation and Assessment of AI’s Role in Nutrition – An Updated Narrative Review of the Evidence. Nutrients. 2025; 17: 190. PubMed Abstract | Publisher Full Text | Free Full Text
Ogurtsova K, et al.: IDF Diabetes Atlas: Global Estimates for the Prevalence of Diabetes for 2015 and 2040. Diabetes Res. Clin. Pract. 2017; 128: 40–50. PubMed Abstract | Publisher Full Text
Papastratis I, et al.: Can ChatGPT Provide Appropriate Meal Plans for NCD Patients? Nutrition. 2024; 121: 112291. PubMed Abstract | Publisher Full Text
Saeedi P, et al.: Global and Regional Diabetes Prevalence Estimates for 2019 and Projections for 2030 and 2045. Diabetes Res. Clin. Pract. 2019; 157: 107843. PubMed Abstract | Publisher Full Text
Sharma S, et al.: A Critical Review of ChatGPT as a Potential Substitute for Diabetes Educators. Cureus. 2023; 15: e37569. PubMed Abstract | Publisher Full Text | Free Full Text
Sharma S, Gaur S: Optimizing Nutritional Outcomes: The Role of AI in Personalized Diet Planning. International Journal for Research Publication and Seminar. 2024; 15: 107–116. Publisher Full Text
Sridhar GR, Gumpeny L: Prospects and Perils of ChatGPT in Diabetes. World J. Diabetes. 2025; 16: 98408. PubMed Abstract | Publisher Full Text | Free Full Text
Sun H, et al.: An AI Dietitian for Type 2 Diabetes Mellitus Management Based on Large Language and Image Recognition Models: Preclinical Concept Validation Study. J. Med. Internet Res. 2023; 25: e51300. PubMed Abstract | Publisher Full Text | Free Full Text
Tyler NS, Jacobs PG: Artificial Intelligence in Decision Support Systems for Type 1 Diabetes. Sensors. 2020; 20: 3214. PubMed Abstract | Publisher Full Text | Free Full Text
Wang X, Sun Z, Xue H, et al.: Artificial Intelligence Applications to Personalized Dietary Recommendations: A Systematic Review. Healthcare. 2025; 13: 1417. PubMed Abstract | Publisher Full Text | Free Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 22 Aug 2025

Author details Author details

Community Health Department, Northern Border University, Arar, Saudi Arabia

Competing interests

No competing interests were disclosed.

Grant information

The author extends his appreciation to the Deanship of Scientific Research at Northern Border University, Arar, KSA, for funding this research work through the project number “NBU-FFR-2025-2975-13
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 22 Aug 2025, 14:812

https://doi.org/10.12688/f1000research.167902.1

© 2025 Alqahtani NS. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Alqahtani NS. Artificial Intelligence in Dietary Recommendations for Type 1 Diabetes: Evaluating the Performance of ChatGPT-4o, Bing AI, and Bard AI [version 1; peer review: 1 approved with reservations]. F1000Research 2025, 14:812 (https://doi.org/10.12688/f1000research.167902.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 22 Aug 2025

Views

Reviewer Report 23 Sep 2025

Saeed Soliman, Cairo University, Cairo, Egypt

Approved with Reservations

https://doi.org/10.5256/f1000research.185047.r412410

The study evaluated ChatGPT-4o, Bard AI, and Bing AI on their ability to generate dietary recommendations for 60 hypothetical type 1 diabetes (T1D) cases created by expert dietitians.
ChatGPT-4o produced correct recommendations in 60% of

The study evaluated ChatGPT-4o, Bard AI, and Bing AI on their ability to generate dietary recommendations for 60 hypothetical type 1 diabetes (T1D) cases created by expert dietitians.
ChatGPT-4o produced correct recommendations in 60% of cases, outperforming Bard AI (50%) and Bing AI (26.7%).
Statistical analysis (McNemar’s test) confirmed ChatGPT-4o significantly outperformed Bing AI, while differences with Bard AI were not statistically significant.
ChatGPT-4o showed greater resilience in handling complex cases and provided more unique correct answers compared with other models.
The authors conclude that while AI tools—particularly ChatGPT-4o—show promise in supporting dietary guidance, they should only supplement and not replace human dietitians due to risks of incomplete or hallucinatory advice.

Review Report
Strengths:

The methodology is well designed, using 60 systematically developed, clinically validated hypothetical T1D cases.
Statistical rigor is applied with McNemar’s test, Cohen’s kappa, and complementarity analysis, strengthening the validity of comparisons.
The study openly provides data and an interactive dashboard, supporting transparency and reproducibility.

Limitations:

The binary classification of outputs as “Correct” or “Incomplete” may oversimplify nuanced dietary guidance and introduces subjective judgment.
No real patient outcomes were measured; thus, the clinical impact of AI-generated recommendations remains uncertain.
The evaluation is based on single-turn prompts, whereas real-world interactions typically involve iterative clarifications that could improve AI performance.

Critical Issue – Hallucination Risk:

A key limitation across all large language models (LLMs) is the tendency to hallucinate—that is, to generate factually incorrect, irrelevant, or fabricated content presented confidently.
In the context of dietary recommendations for T1D, hallucination could lead to unsafe advice, omission of critical nutritional details, or inappropriate food suggestions.
The article correctly identifies this as a common challenge with LLMs and underscores the need for human oversight and contextual tailoring by trained dietitians.

Overall Assessment:
This is a timely and well-conducted evaluation that highlights both the promise and pitfalls of AI in diabetes care. The study demonstrates ChatGPT-4o’s comparative strength while appropriately cautioning against uncritical reliance due to hallucination risks and lack of clinical nuance. It adds valuable evidence to the growing body of work on AI-assisted dietary management.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: public health, family medicine

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 26 Sep 2025

Nasser Alqahtani , Community Health Department, Northern Border University, Arar, Saudi Arabia

26 Sep 2025

Author Response

Thank you for your comment.
I really appreciate this insightful comment. In response, I have added explicit discussion of the hallucination risk of LLMs, noting their potential to generate factually ... Continue reading Thank you for your comment.
I really appreciate this insightful comment. In response, I have added explicit discussion of the hallucination risk of LLMs, noting their potential to generate factually incorrect or misleading dietary advice. We emphasize that such risks reinforce the need for human oversight and contextual tailoring by trained dietitians.
Thank you again.
Thank you for your comment.
I really appreciate this insightful comment. In response, I have added explicit discussion of the hallucination risk of LLMs, noting their potential to generate factually incorrect or misleading dietary advice. We emphasize that such risks reinforce the need for human oversight and contextual tailoring by trained dietitians.
Thank you again.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 26 Sep 2025

Nasser Alqahtani , Community Health Department, Northern Border University, Arar, Saudi Arabia

26 Sep 2025

Author Response

Thank you for your comment.
I really appreciate this insightful comment. In response, I have added explicit discussion of the hallucination risk of LLMs, noting their potential to generate factually ... Continue reading Thank you for your comment.
I really appreciate this insightful comment. In response, I have added explicit discussion of the hallucination risk of LLMs, noting their potential to generate factually incorrect or misleading dietary advice. We emphasize that such risks reinforce the need for human oversight and contextual tailoring by trained dietitians.
Thank you again.
Thank you for your comment.
I really appreciate this insightful comment. In response, I have added explicit discussion of the hallucination risk of LLMs, noting their potential to generate factually incorrect or misleading dietary advice. We emphasize that such risks reinforce the need for human oversight and contextual tailoring by trained dietitians.
Thank you again.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 22 Aug 2025

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1
Version 1 22 Aug 25	read

Saeed Soliman, Cairo University, Cairo, Egypt

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

6 Views

23 Sep 2025 | for Version 1

Saeed Soliman, Cairo University, Cairo, Egypt

6 Views Cite this report Responses(1)

Approved With Reservations

The study evaluated ChatGPT-4o, Bard AI, and Bing AI on their ability to generate dietary recommendations for 60 hypothetical type 1 diabetes (T1D) cases created by expert dietitians.
ChatGPT-4o produced correct recommendations in 60% of cases, outperforming Bard AI (50%) and Bing AI (26.7%).
Statistical analysis (McNemar’s test) confirmed ChatGPT-4o significantly outperformed Bing AI, while differences with Bard AI were not statistically significant.
ChatGPT-4o showed greater resilience in handling complex cases and provided more unique correct answers compared with other models.
The authors conclude that while AI tools—particularly ChatGPT-4o—show promise in supporting dietary guidance, they should only supplement and not replace human dietitians due to risks of incomplete or hallucinatory advice.

Review Report
Strengths:

The methodology is well designed, using 60 systematically developed, clinically validated hypothetical T1D cases.
Statistical rigor is applied with McNemar’s test, Cohen’s kappa, and complementarity analysis, strengthening the validity of comparisons.
The study openly provides data and an interactive dashboard, supporting transparency and reproducibility.

Limitations:

The binary classification of outputs as “Correct” or “Incomplete” may oversimplify nuanced dietary guidance and introduces subjective judgment.
No real patient outcomes were measured; thus, the clinical impact of AI-generated recommendations remains uncertain.
The evaluation is based on single-turn prompts, whereas real-world interactions typically involve iterative clarifications that could improve AI performance.

Critical Issue – Hallucination Risk:

A key limitation across all large language models (LLMs) is the tendency to hallucinate—that is, to generate factually incorrect, irrelevant, or fabricated content presented confidently.
In the context of dietary recommendations for T1D, hallucination could lead to unsafe advice, omission of critical nutritional details, or inappropriate food suggestions.
The article correctly identifies this as a common challenge with LLMs and underscores the need for human oversight and contextual tailoring by trained dietitians.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

public health, family medicine

Respond to this report

Responses (1)

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] Alanzi TM, et al.: Impact of ChatGPT on Diabetes Mellitus Self-Management Among Patients in Saudi Arabia. Cureus. 2025; 17: e81855. PubMed Abstract | Publisher Full Text | Free Full Text

[2] Bayram HM, Ozturkcan A: AI Showdown: Info Accuracy on Protein Quality Content in Foods From ChatGPT 3.5, ChatGPT 4, Bard AI and Bing Chat. Br. Food J. 2024; 3: 4–126. Publisher Full Text

[3] Chatelan A, Clerc A, Fonta P-A: ChatGPT and Future Artificial Intelligence Chatbots: What May be the Influence on Credentialed Nutrition and Dietetics Practitioners? J. Acad. Nutr. Diet. 2023; 123: 1525–1531. PubMed Abstract | Publisher Full Text

[4] Daly A, Hovorka R: Technology in the Management of Type 2 Diabetes: Present Status and Future Prospects. Diabetes. Obes. Metab. 2021; 23: 1722–1732. PubMed Abstract | Publisher Full Text | Free Full Text

[5] ElSayed NA, Aleppo G: Facilitating Positive Health Behaviors and Well-Being to Improve Health Outcomes: Standards of Care in Diabetes–2023. Diabetes Care. 2023; 46: S68–S96. PubMed Abstract | Publisher Full Text | Free Full Text

[6] Garcia MB: ChatGPT as a Virtual Dietitian: Exploring Its Potential as a Tool for Improving Nutrition Knowledge. Appl. Syst. Innov. 2023; 6: 96. Publisher Full Text

[7] Guan Z, et al.: Artificial Intelligence in Diabetes Management: Advancements, Opportunities, and Challenges. Cell Rep. Med. 2023; 4: 101213. PubMed Abstract | Publisher Full Text | Free Full Text

[8] Hermanns N, et al.: Coordination of Glucose Monitoring, Self-Care Behaviour and Mental Health: Achieving Precision Monitoring in Diabetes. Diabetologia. 2022; 65: 1883–1894. PubMed Abstract | Publisher Full Text | Free Full Text

[9] Hieronimus B, Hammann S, Podszun MC: Can the AI tools ChatGPT and Bard Generate Energy, Macro-and Micro-Nutrient Sufficient Meal Plans for Different Dietary Patterns? Nutr. Res. 2024; 128: 105–114. PubMed Abstract | Publisher Full Text

[10] Hou, et al.: Exercise Modalities for Type 2 Diabetes: A Systematic Review and Network Meta-Analysis of Randomized Trials. Diabetes Metab. Res. Rev. 2023; 39: e3591. PubMed Abstract | Publisher Full Text

[11] Joslin EP: The Prevention of Diabetes Mellitus. J. Am. Med. Assoc. 2021; 325: 190. Publisher Full Text

[12] Kassem H, et al.: Investigation and Assessment of AI’s Role in Nutrition – An Updated Narrative Review of the Evidence. Nutrients. 2025; 17: 190. PubMed Abstract | Publisher Full Text | Free Full Text

[13] Ogurtsova K, et al.: IDF Diabetes Atlas: Global Estimates for the Prevalence of Diabetes for 2015 and 2040. Diabetes Res. Clin. Pract. 2017; 128: 40–50. PubMed Abstract | Publisher Full Text

[14] Papastratis I, et al.: Can ChatGPT Provide Appropriate Meal Plans for NCD Patients? Nutrition. 2024; 121: 112291. PubMed Abstract | Publisher Full Text

[15] Saeedi P, et al.: Global and Regional Diabetes Prevalence Estimates for 2019 and Projections for 2030 and 2045. Diabetes Res. Clin. Pract. 2019; 157: 107843. PubMed Abstract | Publisher Full Text

[16] Sharma S, et al.: A Critical Review of ChatGPT as a Potential Substitute for Diabetes Educators. Cureus. 2023; 15: e37569. PubMed Abstract | Publisher Full Text | Free Full Text

[17] Sharma S, Gaur S: Optimizing Nutritional Outcomes: The Role of AI in Personalized Diet Planning. International Journal for Research Publication and Seminar. 2024; 15: 107–116. Publisher Full Text

[18] Sridhar GR, Gumpeny L: Prospects and Perils of ChatGPT in Diabetes. World J. Diabetes. 2025; 16: 98408. PubMed Abstract | Publisher Full Text | Free Full Text

[19] Sun H, et al.: An AI Dietitian for Type 2 Diabetes Mellitus Management Based on Large Language and Image Recognition Models: Preclinical Concept Validation Study. J. Med. Internet Res. 2023; 25: e51300. PubMed Abstract | Publisher Full Text | Free Full Text

[20] Tyler NS, Jacobs PG: Artificial Intelligence in Decision Support Systems for Type 1 Diabetes. Sensors. 2020; 20: 3214. PubMed Abstract | Publisher Full Text | Free Full Text

[21] Wang X, Sun Z, Xue H, et al.: Artificial Intelligence Applications to Personalized Dietary Recommendations: A Systematic Review. Healthcare. 2025; 13: 1417. PubMed Abstract | Publisher Full Text | Free Full Text

Artificial Intelligence in Dietary Recommendations for Type 1 Diabetes: Evaluating the Performance of ChatGPT-4o, Bing AI, and Bard AI

Abstract

Background

Methods

Results

Conclusions

Keywords

Introduction

Methods

Study design

Evaluation and scoring

Case construction

Statistical analysis

Results

Overall model accuracy

Table 1. Summary of responses analysis by AI model.

Figure 1. Correct vs. Incomplete response rates for ChatGPT-4o, Bard AI, and Bing AI.

McNemar pairwise comparisons

Table 2. McNemar test pairwise comparison table.

Inter-model agreement: Cohen’s Kappa

Table 3. Cohen’s kappa agreement table among models.

Model complementarity

Figure 2. Distribution of correct responses across models highlighting complementarity and unique contributions.

Sensitivity to case complexity

Figure 3. Model accuracy across case complexity categories: Simple, Moderate, and Complex.

Correlation of model decisions

Figure 4. Heat map of Pearson correlation coefficients between model outputs for 60 T1D dietary recommendation cases.

Discussion

Conclusion

Preregistered data analysis

Ethical considerations

Author contributions

Software availability

Data availability

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated