Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.178855.1

Brief Report

Articles

Understanding the Inner Workings of Large Language Models in Medicine

[version 1; peer review: 1 approved with reservations, 1 not approved]

Fuellen

Georg

Conceptualization Data Curation Formal Analysis Investigation Methodology Supervision Validation Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0000-0002-4994-9829 a 1 Jarchow

Hans

Data Curation Formal Analysis Investigation Methodology Writing – Original Draft Preparation Writing – Review & Editing 1 Põder

Johann-Christian

Investigation Methodology Writing – Review & Editing 2 1Institute for Biostatistics and Informatics in Medicine and Ageing Research, Rostock University Medical Center, Rostock, Germany 2Faculty of Theology, University of Rostock, Rostock, Germany

a fuellen@uni-rostock.de

No competing interests were disclosed.

5 5 2026

2026

669

6 3 2026

2026

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

Large language models (LLMs) are increasingly influencing medical practice, education, and research. Their responsible integration into healthcare requires expertise in medical, ethical, practical, and theoretical domains.

Methods

We prompted GPT-o1 to generate examples illustrating how understanding transformer architecture can facilitate output interpretation. Key topics were extracted from its responses, and illustrative cases were validated using Consensus.app, an AI-based web-search tool.

Results

Five key topics were identified: (1) anticipating contextual focus in medical reasoning, (2) explaining “generic” or “textbook” responses, (3) understanding strengths and weaknesses in differential diagnosis, (4) explaining ambiguous or contradictory responses, and (5) identifying hallucinations in unfamiliar scenarios. Case examples highlight both benefits and limitations, including accurate attention to salient clinical details, reliance on generalized patterns, risks of base rate neglect in differential diagnosis, challenges of ambiguous prompts, and hallucinations in rare or underrepresented cases.

Conclusions

A theoretical understanding of LLMs is crucial for responsible clinical integration. Distinguishing between well-represented (short head) and underrepresented (long tail) knowledge, recognizing generic responses, and identifying hallucinations are essential competencies. Coupled with medical and ethical expertise, these skills will enable healthcare professionals to leverage LLMs effectively while mitigating risks.

Natural Language Processing Large Language Models Medical Informatics Ethics Medical Clinical Decision-Making

The author(s) declared that no grants were involved in supporting this work.

Introduction

Background: The transformative potential of generative AI and, specifically, of large language models (LLMs) is reshaping contemporary medical theory, practice and education. ^{2–
5} It profoundly influences the future of healthcare as a system, and of its stakeholders. Alongside recognizing the strengths of AI, important questions about its accuracy, and about ethical implications in real-world clinical use have emerged. ⁶ The medical profession must shape its adoption, despite substantial uncertainties about future opportunities and risks. ^{7–
9}

As LLM-generated content continues to improve, there is a growing risk that decreasing error rates could make us less cautious in vetting its outputs. This problematic development (“automation bias”) can be further reinforced by evidence that, in certain tasks, LLMs already deliver the best results without human intervention. ^{10–
13} To harness the advantages of LLMs sensibly, ensuring a responsible integration of AI in healthcare, we must strive for profound understanding of these technologies, and it is thus crucial to develop and maintain expertise and skills in four key areas ^{14–
18}: 1.

Medical expertise: To be able to critically evaluate the validity of LLM-generated output.

Ethics expertise: To identify potential risks and to address instances of ethical dilemmas and violations of medical-ethical norms.

Practical knowledge: Familiarity with LLMs, by informed and critical use.

Theoretical knowledge: Knowledge of how LLMs operate, which allows for a more nuanced evaluation of LLM-generated content.

Our contribution: The focus of this paper is on the fourth skill, understanding the inner workings of LLMs to better interpret the contents they generate. One example for this skill is to discern whether the generated content reflects commonly available and well-represented information in their training data (short head knowledge), or whether it refers to underrepresented knowledge (long tail knowledge), potentially causing erroneous responses. These issues are at best mentioned in passing in recent reviews. ^{19–
21} Despite growing interest in interpretability and transparency, there is limited empirical work exploring how theoretical understanding of LLM mechanisms can enhance clinicians’ ability to evaluate model outputs. This study addresses that gap. Using a self-referential design, we prompted OpenAI’s GPT-o1 to generate examples illustrating how theoretical insights into transformer architecture can facilitate interpretation of medical outputs. Key topics were extracted from its responses and validated using Consensus.app, an AI-assisted literature search platform. By examining these examples, we aim to clarify specific situations in which theoretical knowledge of LLMs provides tangible benefits for responsible and informed clinical use.

Objectives

Overall, we aim to explore how theoretical knowledge of LLMs, specifically their architecture and internal mechanisms, can support the interpretation of model-generated content in medical contexts. In addition, we delineate concrete examples and situational characteristics in which such theoretical understanding provides the most interpretive benefit.

Methods Overview

We conducted an exploratory analysis using AI-assisted tools in a form of model introspection and self-reflection, respectively. An LLM was employed to generate key examples (key topics) that illustrate how theoretical knowledge of transformer architectures can support the interpretation and evaluation of AI-generated outputs in medical contexts. These key topics were then examined and substantiated using an AI-assisted literature search tool.

Large Language Models (LLMs)

LLMs are neural-network-based text processing tools composed of multi-head self-attention layers and multi-layer perceptrons (MLPs), trained on large volumes of text to predict the next word. During processing, multi-head attention allows the model to maintain several parallel foci on specific parts of the input, potentially connecting dispersed fragments of text, while the MLPs further integrate and transform these representations. ^{22,
23} After a pipeline of attention and MLP layers, the next token is predicted. As of mid-December 2024, several LLMs were available; OpenAI’s GPT-o1 (professional mode) was chosen for this study because of its strong generative capabilities. Since this model lacked built-in web search features, its references were often hallucinated, which made an independent validation step necessary.

Prompting of the generative AI tools

GPT-o1 in professional mode served as the primary tool for generating the key topics related to interpretability and clinical use. Consensus.app, an AI-assisted literature search engine, was used to retrieve peer-reviewed studies relevant to the model-generated key topics and to mitigate the risk of hallucinated citations. All prompts, GPT-o1 outputs, and Consensus.app search results are available in Supplementary Sections A and B.

Processing of the generative AI output

The exploratory analysis was conducted in three phases. First, we prompted GPT-o1 to propose key topics illustrating how theoretical insights into transformer architectures can support the interpretation of AI-generated text in medical scenarios and applications. We reviewed the output and extracted the relevant topics. To validate the output and literature references, we performed a literature search using Consensus.app, incorporating the GPT-o1–generated key topics into the search prompts. Two studies, one focusing on rehabilitation medicine (Zhang et al.) ²⁴ and another on clinical interview analysis using synthetic data (Wu et al.), ²⁵ were identified as especially illustrative and were included in the subsequent analysis. Finally, we compared the model-generated examples with the findings curated through Consensus.app and consolidated them into the final set of five key topics, presented in the Results section. All steps were documented to ensure transparency throughout the exploratory process.

Ethical considerations

This article does not involve research with human participants or animals. No ethical approval was required.

Results

Based on the initial prompt, GPT-o1 suggested the following key topics for which it provided example descriptions: 1.

Anticipating contextual focus in medical reasoning.

Explaining “generic” or “textbook” responses.

Understanding strengths and weaknesses in differential diagnosis.

Explaining ambiguous or contradictory responses.

Identifying hallucinations in unfamiliar scenarios.

Based on the follow-up prompt, Consensus.app suggested examples for the key topics, and we thus investigated the work of Zhang et al. to address key topics (1), (2), and (5) (see Table 1). In their study, the authors employ a stroke case to evaluate the capability of LLMs in generating rehabilitation recommendations and ICF codes. We also investigated Wu et al., who describe how LLMs can facilitate clinical interview analysis through their “CALLM” framework, a method for AI-driven synthetic data augmentation. This investigation contributed to the analysis of key topics (3) and (4) (see Table 2).

Table 1. Key Topics (1), (2) and (5), described by GPT-o1, exemplified by Consensus.app.

GPT-o1 explanation for the key topic	Consensus.app explanation for the example	Quotes from the paper (Zhang et al.)	Synopsis
(1) Anticipating contextual focus in medical reasoning
“Transformers use attention mechanisms to “focus” on specific parts of the input sequence. In a medical dialogue, the model might concentrate heavily on a particular symptom, lab value, or part of a patient’s history. Understanding this can help clinicians and medical researchers appreciate why a model’s conclusion might hinge on early-mentioned or highly salient details rather than a broad synthesis of all information.”	“Understanding the architecture of transformer models, particularly the attention mechanism, can help anticipate how these models focus on specific parts of the input text when generating medical reasoning. For instance, in the context of rehabilitation medicine, ChatGPT-4 was able to generate comprehensive medical and rehabilitation prescriptions by focusing on relevant parts of the input case description, demonstrating its ability to prioritize contextually important information.”	“[the LLM] also demonstrated the ability to propose specified approaches for each rehabilitation therapy.”	LLMs can indeed prioritize specific characteristics of a case or patient, and complex scenarios can be addressed through the parallel operations of multiple attention heads. While this capability offers advantages in focusing on critical aspects, it may also have drawbacks, particularly when LLMs fail to correctly integrate and synthesize the resulting information.
(2) Explaining “generic” or “textbook” responses
“MLP layers in transformers combine information gathered by attention heads into higher-level abstractions. When operating on medical queries, these layers may rely on well-learned, “standardized” patterns from training data (e.g., common guidelines or textbook phrasing) instead of tailoring responses to unusual clinical nuances. Understanding the MLP’s integrating role explains why a model might revert to a generic standard-of-care response even when presented with a complex or unique patient scenario.”	“In the study on rehabilitation medicine, ChatGPT-4 produced broader and more general prescriptions that were consistent with textbook answers, indicating its reliance on learned generic medical knowledge.”	“Compared with standard answers, the large language model generated broader and more general prescriptions in terms of medical problems and management plans, rehabilitation problems and management plans, as well as rehabilitation goals.”	Referring to knowledge that is well-represented in the training data (short head knowledge) can result in “generic” or “textbook” responses, raising concerns, however, about their adequacy when addressing atypical cases and patients.
(5) Identifying hallucinations in unfamiliar scenarios
“Transformers are trained on patterns within a certain data distribution. When confronted with rare conditions, novel treatments, or unusual clinical contexts, the model’s learned patterns may not apply. Attention could be misdirected, and the MLP layers might produce “hallucinated” content because they have no solid internal representation for the out-of-distribution input.”	“[…] while ChatGPT-4 made an error in the ICF category, it accurately generated ICF codes, highlighting the model’s potential to hallucinate in less familiar contexts.”	“A thorough review of the standard clinical ICF code assigned by 2 PMR clinicians was then conducted, comparing it with the table produced by the GPT-4 model (Table II). The 3-digit codes generated by the LLM were accurate (…) However, an error was found when reviewing the case record in the body structures category (s730). The patient had had a stroke, and the original impairment should have been classified as affecting the right precentral gyrus (s110.1), as outlined in the case section. Instead, the table displayed the damage as being in “the upper extremity, left hand.”“	LLM responses may exhibit hallucinations when referring to “long tail” knowledge that is not well-represented in the training data. This is hypothesized to be the case for the “body structures category”. Then again, the LLM-generated explanations in this table are not necessarily correct either. A simpler hypothesis regarding the LLM failure is that it did not know or did not consider that the “body structures category” is supposed to refer to the primary site of damage (the brain), not to the secondary site (the hand). Any lack of knowledge regarding the reporting of ICF categories may thus be attributed to insufficient training data regarding this meta-level information.

Table 2. Key Topics (3) and (4), described by GPT-o1, exemplified by Consensus.app.

GPT-o1 explanation for the key topic	Consensus.app explanation for the example	Quotes from the paper (Wu et al.)	Synopsis
(3) Understanding strengths and weaknesses in differential diagnosis
“Attention layers help identify connections between symptoms and conditions, while MLP layers synthesize these into coherent outputs. Knowing this pipeline is useful when the model suggests a differential diagnosis. If the model posits an unusual condition, it might be because it latched onto a distinctive symptom that strongly correlated with that condition in its training data—even if that condition is clinically improbable.”	“The strengths of transformer models in differential diagnosis can be attributed to their ability to synthesize information from diverse sources, while weaknesses may arise from their lack of real-world clinical experience. The CALLM framework, for example, enhances clinical interview analysis by generating synthetic data that can improve diagnostic accuracy, showcasing the model’s adaptability in learning from augmented datasets.”	“In automated mental health diagnosis, the scarcity and imbalance of clinical data pose considerable challenges for researchers, limiting the effectiveness of machine learning algorithms. To cope with this issue, this paper aims to introduce a novel clinical transcript data augmentation framework by leveraging large language models (CALLM). The framework follows a “patient-doctor role-playing” intuition to generate realistic synthetic data.”	A hypothesis about how LLMs handle differential diagnoses is that multi-head attention may be responsible for the matching of patient data to the sets of symptoms known for disease conditions, but this matching may ignore disease prevalence. Synthetic data may mitigate this weakness because researchers can generate examples following a data distribution they have under control, and provide these examples to the LLM.
(4) Explaining ambiguous or contradictory responses
“When the patient’s presentation is ambiguous or the prompts contain conflicting information, attention mechanisms may distribute focus across multiple, equally plausible interpretations. The MLP layers may fail to resolve these into a single, authoritative answer. Understanding this helps users interpret uncertain or oscillating responses as a reflection of the model’s internal struggle with ambiguity rather than mere randomness.”	“Ambiguities or contradictions in model outputs can often be traced back to the model’s training data or the inherent complexity of medical language. The CALLM framework’s use of a “Response-Reason” prompt engineering paradigm aims to generate diagnostically valuable transcripts, which can help mitigate such issues by providing clearer reasoning paths in the model’s responses.”	“Our “Response-Reason” prompting approach guides LLMs in generating highly authentic clinical interview transcripts for mental disorder diagnosis. This augmentation is tailored to enhance the training dataset, facilitating both FSL [Few-Shot-Learning] and, in certain cases, ZSL [Zero-Shot-Learning].“”This technique […] encouraged it to elucidate the rationale behind the responses, mirroring the profile and characteristics of a simulated patient.”	Contradictory responses can be attributed to ambiguous input, ambiguity within the training data, or ambiguity in the representation of knowledge by the trained model. Specialized prompting techniques may request that the reasoning path of the LLM is made more transparent, enhancing its reasoning capabilities along the way.

Key topic 1: Anticipating contextual focus in medical reasoning

Key topic (1) illustrates how the LLM’s attention mechanism enables it to identify and prioritize relevant information within a text, connecting specific parts of the text that may be far away in the input stream. In fact, this “focus on specific parts” may be done multiple times in parallel by multiple attention heads and integrated across layers until the final output is generated. The attention mechanism thus allows the model to focus on critical elements – such as specific patient symptoms or lab values – while ignoring less pertinent data. GPT-o1 underscores that clinicians and medical researchers can benefit from understanding that model attention might attend to highly salient details at the expense of a broader synthesis. This focused approach can be advantageous in ensuring the model’s output is closely aligned with key aspects of a query.

In the example by Consensus.app, Zhang et al. demonstrate how this capacity proves valuable in their example from rehabilitation medicine, where ChatGPT-4 generated targeted intervention plans by focusing on the most relevant details of a patient’s presentation.

Key topic 2: Explaining “generic” or “textbook” responses

“Generic” or “textbook” responses arise from a model’s tendency to draw on widely represented knowledge learned from its training data (“short head knowledge”). GPT-o1 suggests that, when responding to medical queries, the model’s MLPs often rely on well-learned patterns, which can lead to standardized procedures being presented even in atypical clinical situations. This is echoed in the Consensus.app findings, indicating ChatGPT-4’s propensity to default to generalized medical knowledge.

Key topic 3: Understanding strengths and weaknesses in differential diagnosis

With regard to key topic (3), GPT-o1 suggests that models sometimes include improbable clinical differential diagnoses due to strong correlations detected between input and training data, irrespective of the prevalence of the diagnosed disease; such base rate neglect may or may not be helpful for an accurate diagnosis. ²⁶ Consensus.app mentions a lack of direct clinical experience of the models, which may refer to base rate neglect. In the example of Wu et al., ²⁵ accuracy was supposedly enhanced by data augmentation using synthetic data as part of the CALLM framework, allowing better differential diagnoses based on more balanced data. However, skeptics may argue that true clinical complexity is difficult to replicate through synthetic data, casting doubt on the broader applicability of AI-generated simulations for real-world clinical settings. Thus, there is an evident tension between synthetic data generation and the complexity of capturing clinical “real-world” scenarios. This also underscores the critical importance of robust validation requirements, ²⁷ particularly when significant decisions are to be made by an LLM.

Key topic 4: Explaining ambiguous or contradictory responses

In addressing key topic (4), GPT-o1 links ambiguous or contradictory outputs to unclear or insufficiently specific prompts, whereas Consensus.app posits that ambiguity within the model’s training data is a contributing factor. One solution involves carefully crafted prompts designed to elicit the model’s reasoning processes, thereby mitigating confusion. Here, the CALLM framework successfully employs a “Response-Reason” prompting strategy.

Key topic 5: Identifying hallucinations in unfamiliar scenarios

By contrast, “unfamiliar scenarios” engage a model’s “long tail” knowledge, where there is a heightened risk of hallucination because the queries may diverge significantly from what could be learned from the training set.

A case in point, discussed by Zhang et al., shows ChatGPT-4 successfully generating International Classification of Functioning (ICF) codes for a stroke patient but misreporting the lesion site. Specifically, the model accurately identified the motor dysfunction in the left hand but failed to report the lesion in the right precentral gyrus. Although the model recognizes that the patient’s motor function is impaired, it does not appear to understand that this impairment originates from disrupted motor signals in the brain. As a result, the model interprets the limitation as purely motor-related rather than addressing the underlying neurological cause. Alternatively, we suggest that it may miss the meta-knowledge that the lesion site to be reported here shall refer to the underlying primary lesion, not its secondary consequences.

This distinction, however, is crucial: if a clinical decision-support system fails to report the specific neurological lesion, it may overlook critical rehabilitation strategies that are essential for effective patient care. Consequently, interventions might miss addressing the root cause, leading to slower or less effective patient recovery.

These examples highlight the importance of recognizing not only what a model can accomplish but also where gaps in its knowledge or reasoning may lead to clinically relevant inaccuracies.

Discussion

The implementation of AI, particularly LLMs, in healthcare will drive transformative changes in medical practice and theory while presenting significant challenges. A thorough understanding of the key ingredients of LLMs, based on their underlying architecture, including attention mechanisms and MLPs, can be particularly useful in situations where the model’s outputs are unexpected, ambiguous, or counterintuitive, necessitating critical analysis but also in routine cases where seemingly appropriate recommendations may invite automation bias.

Theoretical understanding of how LLMs process and generate information provides a conceptual framework for interpreting their outputs. ²⁸ It allows clinicians and researchers to anticipate how attention mechanisms determine contextual focus, how probabilistic prediction can lead to overly generic or “textbook” responses, and why models occasionally generate contradictory or fabricated information. This type of literacy enables users to distinguish between the model’s apparent confidence and fluency and the actual reliability of its reasoning. Understanding the model’s architecture thus directly supports critical and ethical evaluation of AI-generated content in clinical contexts.

The five key topics presented here contribute to the responsible use of large language models (LLMs) in medicine. Anticipating the contextual focus in medical reasoning (key topic 1) helps users understand why certain information is extracted and prioritized by the LLM, while other aspects are neglected. Awareness of the tendency of LLMs to generate generic responses (key topic 2) draws attention to the risk of overreliance on common patterns, which may not always apply to complex, rare, or atypical cases. LLMs can quickly generate potential differential diagnoses but often show weaknesses, for instance, in aligning these diagnoses with specific epidemiological contexts. Understanding such strengths and weaknesses (key topic 3) is crucial for assessing the validity and evidential value of LLM outputs. At times, LLMs tend to produce ambiguous or contradictory responses. Identifying these limitations (key topic 4) encourages critical prompt design and iterative clarification. Finally, recognizing hallucinations in unfamiliar scenarios (key topic 5) highlights the importance of verifying outputs when the model encounters data outside its training distribution. Together, these insights show that theoretical knowledge can serve as a safeguard, helping users interpret and evaluate LLM outputs more systematically.

This interpretive literacy also encompasses an ethical responsibility. Understanding and identifying where LLMs fail due to inherent limitations empowers more autonomous engagement with LLMs (avoiding so-called “computer paternalism”) and helps to safeguard patient safety. ^{29,
30} It also helps to see where such failures may introduce or amplify biases, enabling clinicians to better judge when model outputs risk marginalizing minority groups or reinforcing existing prejudices. ³¹ The ethical use of LLMs should therefore not be defined solely by regulations and guidelines but is equally determined by the user’s competence in engaging with this technology. Such competence includes the ability to discern biases and misinformation that may be concealed within outputs that appear neutral and confident. ³²

Several limitations of this work should be acknowledged. The examples analyzed were generated and selected using AI-based tools and therefore represent only a limited sample of possible cases. Although the use of GPT-o1 and Consensus.app provided a consistent exploratory framework, it may also have introduced biases related to model behavior and source retrieval. Furthermore, this study did not include feedback by external clinicians, which may further limit the generalizability of the findings. Nevertheless, by instructing LLMs and AI-assisted literature search tools to reflect on the requirements of their use in medical contexts, we applied a methodology that can serve as a powerful approach to studying AI reasoning in medicine.

Future research should build on this methodology through collaborative designs that integrate theoretical analysis with empirical testing in clinical and educational settings. Combining model introspection with user evaluation may help to further study how theoretical understanding enhances AI-literacy. Moreover, developing educational frameworks and training modules on LLM interpretability could further strengthen the competencies of students and clinicians to use AI responsibly.

The rapid evolution of LLMs makes it increasingly challenging to continuously adapt our understanding of their complexity. Nevertheless, it is essential to meet this development with a comprehensive theoretical, practical, and ethical skill set. These competencies will enable healthcare professionals to approach model outputs not as unquestionable truths but as context-dependent and probabilistic responses that warrant critical examination, thereby remaining vigilant regarding their limitations. ^{7,
33} Such a mindset fosters a reflexive awareness of how AI systems interact with clinical reasoning and decision-making, helping to preserve space for professional judgement, patient values, and situational nuance within emerging forms of human–AI collaboration. ^{34,
35} This will help maintain human oversight and ethical accountability even as models become more powerful. Ultimately, bridging medical expertise with theoretical and ethical knowledge will be necessary to ensure that AI contributes to, rather than undermines, the integrity of clinical practice.

Conclusions

The examples analyzed in this study demonstrate that LLMs hold transformative potential in healthcare, and theoretical knowledge of their architecture and mechanisms is important for interpreting their outputs responsibly. Understanding the balance between short head and long tail knowledge, recognizing generic responses, and identifying hallucinations are important skills. Developing these competencies can support healthcare professionals in using LLMs more effectively and safely, enabling them to integrate AI technologies while maintaining appropriate oversight, mitigating risks and ensuring ethical standards.

Data availability

Figshare. Extended Data. https://doi.org/10.6084/m9.figshare.31493524. ³⁶ This project contains the following extended data: Extended Data (All prompts, GPT-o1 outputs, and Consensus.app search results, in Supplementary Sections A and B.) Data is available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

A preliminary version of this work was made available as a preprint at https://www.preprints.org/manuscript/202510.0630/v1. ¹ The supplementary material is available at https://doi.org/10.6084/m9.figshare.31493524.

References 1

Fuellen

Jarchow

Põder

J-C

: Understanding the Inner Workings of Large Language Models in Medicine. Preprints: Preprints. 2025. 10.20944/preprints202510.0630.v1

Reference Source

Thirunavukarasu

Ting

DSJ

Elangovan

: Large language models in medicine. Nat. Med. 2023;29(8):1930–1940. 37460753

10.1038/s41591-023-02448-8

Clusmann

Kolbinger

Muti

: The future landscape of large language models in medicine. Commun. Med. 2023/10/10 2023;3. 10.1038/s43856-023-00370-1

Liu

Wang

Liu

: Utility of ChatGPT in Clinical Practice. J. Med. Internet Res. 2023;25:e48568. 37379067

10.2196/48568

PMC10365580

Wang

Wan

: Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review. J. Med. Internet Res. 2024;26:e22769. 39509695

10.2196/22769

PMC11582494

Jung

K-H

: Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations. Healthc Inform Res. 2025;31(2):114–124. 40384063

10.4258/hir.2025.31.2.114

PMC12086438

Klang

Tessler

Freeman

: If Machines Exceed Us: Health Care at an Inflection Point. NEJM AI. 2024;1. 10.1056/AIp2400559

Ong

JCL

Chang

SY-H

William

: Ethical and regulatory challenges of large language models in medicine. Lancet Digit Health. 2024;6(6):e428–e432. 10.1016/S2589-7500(24)00061-X

Bouderhem

: Shaping the future of AI in healthcare through ethics and governance. Humanit Soc Sci Commun. 2024;11(1):416. 10.1057/s41599-024-02894-w

Goddard

Roudsari

Wyatt

: Automation bias: Empirical results assessing influencing factors. Int. J. Med. Inform. 2014;83(5):368–375. 24581700

10.1016/j.ijmedinf.2014.01.001

Vaccaro

Almaatouq

Malone

: When combinations of humans and AI are useful: A systematic review and meta-analysis. Nat. Hum. Behav. 2024;8(12):2293–2303. 39468277

10.1038/s41562-024-02024-1

PMC11659167

Abdelwanis

Alarafati

Tammam

MMS

: Exploring the risks of automation bias in healthcare artificial intelligence applications: A Bowtie analysis. J Saf Sci Resil. 2024;5(4):460–469. 10.1016/j.jnlssr.2024.06.001

Ranji

: Large Language Models—Misdiagnosing Diagnostic Excellence? JAMA Netw. Open. 2024;7(10):e2440901. 10.1001/jamanetworkopen.2024.40901

Ang

C-S

: Developing AI literacy in healthcare education: bridging the gap in competency assessment. Discov. Educ. 2025;4(1):372. 10.1007/s44217-025-00812-z

Gazquez-Garcia

Sánchez-Bocanegra

Sevillano

: AI in the Health Sector: Systematic Review of Key Skills for Future Health Professionals. JMIR Med Educ. 2025;11:e58161. 39912237

10.2196/58161

PMC11822726

Ahsan

: Integrating artificial intelligence into medical education: a narrative systematic review of current applications, challenges, and future directions. BMC Med. Educ. 2025;25(1):1187. 40849650

10.1186/s12909-025-07744-0

PMC12374307

Ong

JCL

Chang

SY-H

William

: Medical Ethics of Large Language Models in Medicine. NEJM AI. 2024;1(7):AIra2400038. 10.1056/AIra2400038

Põder

J-C

Helgesson

: Ethical Aspects of Generative AI in Medicine. Hoffmann

Bansal

, editors. AI Ethics in Practice: Navigating Academic Insight, Managerial Expertise, and Philosophical Inquiry. Springer Nature Switzerland;2025;139–162. 10.1007/978-3-031-87023-1_12

McCoy

Ci Ng

Sauer

: Understanding and training for the impact of large language models and artificial intelligence in healthcare practice: a narrative review. BMC Med. Educ. 2024;24(1):1096. 10.1186/s12909-024-06048-z

Wang

Zhang

: Large language models in medical and healthcare fields: applications, advances, and challenges. Artif. Intell. Rev. 2024;57(11):299. 10.1007/s10462-024-10921-0

Xiao

Zhou

Liu

: A comprehensive survey of large language models and multimodal large language models in medicine. Inf. Fusion. 2025;117:102888. 10.1016/j.inffus.2024.102888

Vaswani

Shazeer

Parmar

: Attention is all you need. Guyon

Luxburg

Von Bengio

, editors. Advances in neural information processing systems. Curran Associates, Inc.;2017; Vol30. Reference Source

Zheng

Wang

Huang

: Attention heads of large language models. Patterns. 2025;6(2):101176. 40041856

10.1016/j.patter.2025.101176

PMC11873009

Zhang

Tashiro

Mukaino

: Use of artificial intelligence large language models as a clinical tool in rehabilitation medicine: a comparative test case. J. Rehabil. Med. 2023;55:jrm13373. 37691497

10.2340/jrm.v55.13373

PMC10501385

Mao

Zhang

: CALLM: Enhancing Clinical Interview Analysis Through Data Augmentation With Large Language Models. IEEE J. Biomed. Health Inform. 2024;28(12):7531–7542. 10.1109/JBHI.2024.3435085

Hamm

: Physicians neglect base rates, and it matters. Behav. Brain Sci. 1996;19(1):25–26. 10.1017/S0140525X00041261

Fuellen

Kulaga

Lobentanzer

: Validation requirements for AI-based intervention-evaluation in aging and longevity research and practice. Ageing Res. Rev. 2025;104:102617. 39643211

10.1016/j.arr.2024.102617

Reference Source

Mesinovic

Watkinson

Zhu

: Explainability in the age of large language models for healthcare. Commun Eng. 2025;4(1):128. 40676176

10.1038/s44172-025-00453-y

PMC12271443

Kühler

: Exploring the phenomenon and ethical issues of AI paternalism in health apps. Bioethics. 2022;36(2):194–200. 34031908

10.1111/bioe.12886

Heyen

Salloch

: The ethics of machine learning-based clinical decision support: an analysis through the lens of professionalisation theory. BMC Med. Ethics. 2021;22(1):112. 34412649

10.1186/s12910-021-00679-3

PMC8375118

Mahajan

Obermeyer

Daneshjou

: Cognitive bias in clinical large language models. NPJ Digit Med. 2025;8(1):428. 10.1038/s41746-025-01790-0

Ning

Liu

: Advancing ethical AI in healthcare through interpretability. Patterns. 2025;6(6):101290. 10.1016/j.patter.2025.101290

Tun

Rahman

Naing

: Trust in Artificial Intelligence–Based Clinical Decision Support Systems Among Health Care Workers: Systematic Review. J. Med. Internet Res. 2025;27:e69678. 10.2196/69678

McDougall

: Computer knows best? The need for value-flexibility in medical AI. J. Med. Ethics. 2019;45(3):156. 10.1136/medethics-2018-105118

Sokol

Fackler

Vogt

: Artificial intelligence should genuinely support clinical reasoning and decision making to bridge the translational gap. NPJ Digit Med. 2025;8(1):345. 40494886

10.1038/s41746-025-01725-9

PMC12152152

Fuellen

Jarchow

Põder

J-C

: Extended Data for: Understanding the Inner Workings of Large Language Models in Medicine. Figshare. 2026. 10.6084/m9.figshare.31493524

10.5256/f1000research.197293.r487509

Reviewer response for version 1

Webster

Craig S

1 Referee https://orcid.org/0000-0002-6997-4263 1The University of Auckland, Auckland, Auckland, New Zealand

Competing interests: No competing interests were disclosed.

4 6 2026

2026

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

This is an interesting paper that I enjoyed reading. It makes a number of key distinctions which I think are in fact important for the users of LLMs to better understand when they should be trusted and when users need to be more critical. This is an important and practical concern for clinicians using AI-based tools in healthcare, and so from this perspective I think the paper has merit. I like the short-head and long-tail distinction and how this relates to the risk of hallucination – this is of real interest to clinicians who are not technical experts in LLM technology, but need to know how best to use them. The five key topics will also be of interest to clinicians. However, these positives aside, there are a number of areas where the language in the paper needs to be tightened, as below.

Page 3, 2 ^nd paragraph: you mention that LLMs may on occasion work best without human intervention. I think some unpacking of what you mean by this is needed, as most authorities believe that AI use in healthcare must always be supervised by humans. Humans are the ultimate sense makers and decision makers, since the AI has no actual understanding of the tasks it performs or their consequences.

Page 4: You mention synthetic data augmentation to fill gaps in training data sets. It was unclear to me what you meant by synthetic data – do you mean exemplar data made up by humans, or data generated by other AI systems? Synthetic data generated by other AI systems has been shown to reduce the performance of LLMs, even to the extent of so-called model collapse, and even with surprisingly small amounts of synthetic data in the training data set. Hence, I think you need to be clearer about what you mean by synthetic data, and also to explain how you might avoid performance decline if you are using such data, or at least mention the risks. See: Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian and Julia Kempe. Strong Model Collapse, arXiv, 2024. https://arxiv.org/abs/2410.04840

One of the central problems of asking an LLM to explain itself, or to “introspect” is that all LLM responses are simply text outputs designed to have the highest probability of being correct given the training data and the prompt. Hence the LLM makes no distinction between an “introspective” prompt and one asking a more general inquiry. LLMs have no sense of self, or awareness of what they are doing, hence cannot actually introspect in the way we understand it. Often an LLM will give you the textbook answer to some task it just performed because that method was in its training data, but when you inspect the activity of the neural network itself using something like a mechanistic interpretability approach, you find that what it actually did was nothing like what it just claimed. It may claim some tidy, logical method, but what is actually going on inside the network is typically highly complicated, probabilistic, and essentially unintelligible – it just happens to get the right answer most of the time. See: On the Biology of a Large Language Model, https://transformer-circuits.pub/2025/attribution-graphs/biology.html

I think you need to mention this gap between what the LLM claims it is doing, and what is actually going on in the network.

Page 6, key topic 5: I think this is a very interesting discussion, and it underscores the point I was making in my previous comment – the LLM has no understanding of what it is doing. More critically for the use of LLMs in medicine, the rules that the LLM has extracted from the training data through statistical inference are not equivalent to evidence-based medicine or the causal theories of disease! And this is a key point that many clinicians do not appreciate. This is why it doesn’t make the connection between symptoms and underlying causes, it has only a probabilistic or correlational model, not a causal one. I think you need to be very careful about using words like “understand”, “knowledge”, “reasoning” and indeed “meta-knowledge” when describing what the LLMs is doing – as technically, it has none of these things (although it may appear to have them). For a discussion of these key distinctions, which are highly relevant to medicine, see: Webster, C. S. (2025). Natural and artificial intelligence – the psychotechnical agenda of the 21st century. Journal of Psychology and AI, 1(1). https://doi.org/10.1080/29974100.2025.2491445

Page 8, 5 ^th paragraph: You make a claim about the rapid evolution of LLMs making it hard to understand their inner workings. Actually, the underlying technology of deep learning models hasn’t changed much in years, what we have seen recently is a scaling up of this technology – whether it continues to scale up or not is a question of debate – although recent evidence suggests that performance of these models is plateauing. However, given that the underlying technology of deep learning remains the same in all these models, the aim of your paper remains relevant, despite the results of scaling up – and I think you should make this point in discussion.

Is the work clearly and accurately presented and does it cite the current literature?

Partly

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Neuropsychology, clinical education, system redesign, artificial intelligence

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

10.5256/f1000research.197293.r482232

Reviewer response for version 1

Kıyak

Yavuz Selim

1 Referee https://orcid.org/0000-0002-5026-3234 1Gazi University, Ankara, Turkey

Competing interests: No competing interests were disclosed.

18 5 2026

2026

recommendation

reject

The manuscript focuses on an important possible use of LLMs. It is if theoretical knowledge of LLM architecture can help doctors better interpret AI-generated outputs. However, the current version includes several methodological concerns.

The main methodological concern is the self-referential design. The authors used GPT-o1 to generate examples of how understanding LLMs can help evaluate LLM outputs. This leads to a circularity problem. The problem is that the tool being examined also becomes a source of evidence for the claims being developed. This might be seen acceptable for exploratory idea generation but it is not sufficiently strong for a paper.

Another concern is overclaiming contribution. The introduction states that the study addresses a gap in empirical work but the current approach in the manuscript does not include clinicians, users, performance outcomes, comparison groups, or observed decision-making. Therefore, the study cannot show that theoretical knowledge actually improves clinicians’ ability to evaluate LLM outputs.

The authors also used Consensus.app as a validation tool. This is another important concern. An AI-assisted literature search platform can help identify relevant papers but it does not itself validate GPT-generated themes. Another concern is the limited and selective evidence base. Only a small number of “especially illustrative” studies are used to support the final themes. There is also a need for the language of the introduction to be softened. Terms such as “validated” gives a signal like strong evidence.

Is the work clearly and accurately presented and does it cite the current literature?

Partly

If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable

Are all the source data underlying the results available to ensure full reproducibility?

Partly

Is the study design appropriate and is the work technically sound?

Are the conclusions drawn adequately supported by the results?

Are sufficient details of methods and analysis provided to allow replication by others?

Reviewer Expertise:

medical education, large language models

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.