Understanding the Inner Workings of Large Language Models in Medicine

Georg Fuellen; Hans Jarchow; Johann-Christian Põder

doi:10.12688/f1000research.178855.2

Home Browse Understanding the Inner Workings of Large Language Models in Medicine

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Brief Report

Revised

Understanding the Inner Workings of Large Language Models in Medicine

[version 2; peer review: 1 approved with reservations, 1 not approved]

Georg Fuellen ¹, Hans Jarchow¹, Johann-Christian Põder²

PUBLISHED 01 Jul 2026

Author details Author details

¹ Institute for Biostatistics and Informatics in Medicine and Ageing Research, Rostock University Medical Center, Rostock, Germany
² Faculty of Theology, University of Rostock, Rostock, Germany

Georg Fuellen
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Supervision, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Hans Jarchow
Roles: Data Curation, Formal Analysis, Investigation, Methodology, Writing – Original Draft Preparation, Writing – Review & Editing

Johann-Christian Põder
Roles: Investigation, Methodology, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

This article is included in the Bioinformatics gateway.

This article is included in the AI in Medicine and Healthcare collection.

Abstract

Background

Large language models (LLMs) are increasingly influencing medical practice, education, and research. Their responsible integration into healthcare requires expertise in medical, ethical, practical, and theoretical domains.

Methods

We prompted GPT-o1 to generate examples illustrating how understanding transformer architecture can facilitate output interpretation. Key topics were extracted from its responses, and illustrative cases were identified and cross-checked against the peer-reviewed literature using Consensus.app, an AI-assisted literature-search tool.

Results

Five key topics were identified: (1) anticipating contextual focus in medical reasoning, (2) explaining “generic” or “textbook” responses, (3) understanding strengths and weaknesses in differential diagnosis, (4) explaining ambiguous or contradictory responses, and (5) identifying hallucinations in unfamiliar scenarios. Case examples highlight both benefits and limitations, including accurate attention to salient clinical details, reliance on generalized patterns, risks of base rate neglect in differential diagnosis, challenges of ambiguous prompts, and hallucinations in rare or underrepresented cases.

Conclusions

A theoretical understanding of LLMs is crucial for responsible clinical integration. Distinguishing between well-represented (short head) and underrepresented (long tail) knowledge, recognizing generic responses, and identifying hallucinations are essential competencies. Coupled with medical and ethical expertise, these skills will enable healthcare professionals to leverage LLMs effectively while mitigating risks.

Keywords

Natural Language Processing, Large Language Models, Medical Informatics, Ethics, Medical, Clinical Decision-Making

Corresponding author: Georg Fuellen

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2026 Fuellen G et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Fuellen G, Jarchow H and Põder JC. Understanding the Inner Workings of Large Language Models in Medicine [version 2; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2026, 15:669 (https://doi.org/10.12688/f1000research.178855.2) First published: 05 May 2026, 15:669 (https://doi.org/10.12688/f1000research.178855.1) Latest published: 01 Jul 2026, 15:669 (https://doi.org/10.12688/f1000research.178855.2)

Revised Amendments from Version 1

This revised version reframes the Brief Report as an exploratory, hypothesis-generating study rather than as empirical evidence that theoretical knowledge improves clinicians' performance. The manuscript language has been revised throughout to ensure that interpretations and conclusions are appropriately qualified.
References to "validation" have been replaced with "cross-checking" or "illustration" where appropriate. The role of Consensus.app is now described more precisely as facilitating the retrieval of corroborating literature rather than validating model-generated themes.
Additional caveats have been added to clarify that large language models cannot genuinely introspect and that mentalistic terms (e.g., "knowledge," "reasoning," and "meta-knowledge"), where retained, are used in a qualified sense to describe the outputs of statistical and correlational processes. The manuscript now explicitly notes that learned correlations do not necessarily reflect causal relationships or evidence-based medicine.
The description of automated analysis has been clarified, including the meaning of "without human intervention" and the definition of "synthetic data." The potential risk of model collapse and the pace of model evolution are discussed in a more nuanced way.
The reference list has been updated to include additional supporting literature. No changes were made to the author list, figures, tables, supplementary materials, underlying data, or code availability statements.

To read any peer review reports and author responses for this article, follow the "read" links in the Open Peer Review table.

Introduction

Background: The transformative potential of generative AI and, specifically, of large language models (LLMs) is reshaping contemporary medical theory, practice and education.^2–5 It profoundly influences the future of healthcare as a system, and of its stakeholders. Alongside recognizing the strengths of AI, important questions about its accuracy, and about ethical implications in real-world clinical use have emerged.⁶ The medical profession must shape its adoption, despite substantial uncertainties about future opportunities and risks.^7–9

As LLM-generated content continues to improve, there is a growing risk that decreasing error rates could make us less cautious in vetting its outputs. This problematic development (“automation bias”) can be further reinforced by evidence that, on certain benchmark tasks, unaided LLMs can match or even exceed the accuracy of human–AI teams.^10–15 We stress that this is a narrow, task-specific performance observation and not a normative claim about clinical autonomy: in healthcare, LLM outputs must remain under human supervision, because the model has no human-like understanding of the tasks it performs or of their consequences, and the clinician remains the responsible sense-maker and decision-maker. To harness the advantages of LLMs sensibly, ensuring a responsible integration of AI in healthcare, we must strive for profound understanding of these technologies, and it is thus crucial to develop and maintain expertise and skills in four key areas^16–20:

1. Medical expertise: To be able to critically evaluate the validity of LLM-generated output.
2. Ethics expertise: To identify potential risks and to address instances of ethical dilemmas and violations of medical-ethical norms.
3. Practical knowledge: Familiarity with LLMs, by informed and critical use.
4. Theoretical knowledge: Knowledge of how LLMs operate, which allows for a more nuanced evaluation of LLM-generated content.

Our contribution: The focus of this paper is on the fourth skill, understanding the inner workings of LLMs to better interpret the contents they generate. One example for this skill is to discern whether the generated content reflects commonly available and well-represented information in their training data (short head knowledge), or whether it refers to underrepresented knowledge (long tail knowledge), potentially causing erroneous responses. These issues are at best mentioned in passing in recent reviews.^21–23 Despite growing interest in interpretability and transparency, there is limited work exploring how a theoretical understanding of LLM mechanisms might enhance clinicians’ ability to evaluate model outputs. Instead of testing clinicians directly, this study uses a self-referential, hypothesis-generating design to identify and illustrate concrete situations in which such theoretical understanding is likely to be interpretively useful, thereby motivating the empirical, systematic studies that would be needed to demonstrate its benefits. Using a self-referential design, we prompted OpenAI’s GPT-o1 to generate examples illustrating how theoretical insights into transformer architecture can facilitate interpretation of medical outputs. Key topics were extracted from its responses and cross-checked against the peer-reviewed literature using Consensus.app, an AI-assisted literature search platform; importantly, this step served to locate corroborating or illustrative studies, not to confirm the validity of the model-generated topics. By examining these examples, we aim to clarify specific situations in which theoretical knowledge of LLMs provides tangible benefits for responsible and informed clinical use.

Objectives

Overall, we aim to explore how theoretical knowledge of LLMs, specifically their architecture and internal mechanisms, can support the interpretation of model-generated content in medical contexts. In addition, we delineate concrete examples and situational characteristics in which such theoretical understanding provides the most interpretive benefit. We do not aim to measure clinician performance or to establish empirically that theoretical knowledge improves output evaluation; rather, we aim to generate and illustrate hypotheses that future user-based studies can test.

Methods

Overview

We conducted an exploratory analysis using AI-assisted tools in a form of model “introspection”. This term is used here as a procedural shorthand for prompting the model to generate explanations of its own outputs: an LLM does not have direct access to its own internal computations, and its explanations of its outputs should not be treated as genuine introspective reports. Mechanistic-interpretability work has shown that the procedure a model reports can diverge substantially from the computation it actually performs.²⁴ We therefore treat the model’s outputs throughout as generated hypotheses to be examined critically by the human authors. An LLM was employed to generate key examples (key topics) that illustrate how theoretical knowledge of transformer architectures can support the interpretation and evaluation of AI-generated outputs in medical contexts. These key topics were then examined and substantiated using an AI-assisted literature search tool.

Large Language Models (LLMs)

LLMs are neural-network-based text processing tools composed of multi-head self-attention layers and multi-layer perceptrons (MLPs), trained on large volumes of text to predict the next word. During processing, multi-head attention allows the model to maintain several parallel foci on specific parts of the input, potentially connecting dispersed fragments of text, while the MLPs further integrate and transform these representations.^25,26 After a pipeline of attention and MLP layers, the next token is predicted. Throughout this paper, terms such as “attention”, “focus”, “knowledge”, “reasoning”, and “meta-knowledge” are used only in a careful and qualified sense. They do not attribute human-like mental states, beliefs, intentionality, or clinical responsibility to the model. Natural and artificial intelligence must be distinguished however²⁷ and here we refer to the functional role of statistically generated outputs within distributed, hybrid, and human-interpreted epistemic practices. We avoid attributing literal “understanding” to LLMs: the model does not possess understanding or causal models of disease in the human sense; the regularities it extracts by statistical inference cannot be equivalent to executing evidence-based medicine or employing causal theories of disease. As of mid-December 2024, several LLMs were available; OpenAI’s GPT-o1 (professional mode) was chosen for this study because of its strong generative capabilities. Since this model lacked built-in web search features, its references were often hallucinated, which made an independent validation step necessary.

Prompting of the generative AI tools

GPT-o1 in professional mode served as the primary tool for generating the key topics related to interpretability and clinical use. Consensus.app, an AI-assisted literature search engine, was used to retrieve peer-reviewed studies relevant to the model-generated key topics and to mitigate the risk of hallucinated citations. All prompts, GPT-o1 outputs, and Consensus.app search results are available in Supplementary Sections A and B.

Processing of the generative AI output

The exploratory analysis was conducted in three phases. First, we prompted GPT-o1 to propose key topics illustrating how theoretical insights into transformer architectures can support the interpretation of AI-generated text in medical scenarios and applications. We reviewed the output and extracted the relevant topics. To cross-check the outputs and to replace potentially hallucinated references with verifiable ones, we performed a literature search using Consensus.app, incorporating the GPT-o1–generated key topics into the search prompts. From the retrieved results we selected two studies – one focusing on rehabilitation medicine (Zhang et al.)²⁸ and another on clinical interview analysis using synthetic data (Wu et al.),²⁹ – as especially illustrative; this selection was non-systematic and intended to exemplify, rather than to comprehensively or quantitatively test, the key topics. Finally, we compared the model-generated examples with the findings curated through Consensus.app and consolidated them into the final set of five key topics, presented in the Results section. All steps were documented to ensure transparency throughout the exploratory process.

Ethical considerations

This article does not involve research with human participants or animals. No ethical approval was required.

Results

Based on the initial prompt, GPT-o1 suggested the following key topics for which it provided example descriptions:

1. Anticipating contextual focus in medical reasoning.
2. Explaining “generic” or “textbook” responses.
3. Understanding strengths and weaknesses in differential diagnosis.
4. Explaining ambiguous or contradictory responses.
5. Identifying hallucinations in unfamiliar scenarios.

Based on the follow-up prompt, Consensus.app suggested examples for the key topics, and we thus investigated the work of Zhang et al. to address key topics (1), (2), and (5) (see Table 1). In their study, the authors employ a stroke case to evaluate the capability of LLMs in generating rehabilitation recommendations and ICF codes. We also investigated Wu et al., who describe how LLMs can facilitate clinical interview analysis through their “CALLM” framework, a method for AI-driven synthetic data augmentation. This investigation contributed to the analysis of key topics (3) and (4) (see Table 2).

Table 1. Key Topics (1), (2) and (5), described by GPT-o1, exemplified by Consensus.app.

GPT-o1 explanation for the key topic	Consensus.app explanation for the example	Quotes from the paper (Zhang et al.)	Synopsis
(1) Anticipating contextual focus in medical reasoning
“Transformers use attention mechanisms to “focus” on specific parts of the input sequence. In a medical dialogue, the model might concentrate heavily on a particular symptom, lab value, or part of a patient’s history. Understanding this can help clinicians and medical researchers appreciate why a model’s conclusion might hinge on early-mentioned or highly salient details rather than a broad synthesis of all information.”	“Understanding the architecture of transformer models, particularly the attention mechanism, can help anticipate how these models focus on specific parts of the input text when generating medical reasoning. For instance, in the context of rehabilitation medicine, ChatGPT-4 was able to generate comprehensive medical and rehabilitation prescriptions by focusing on relevant parts of the input case description, demonstrating its ability to prioritize contextually important information.”	“[the LLM] also demonstrated the ability to propose specified approaches for each rehabilitation therapy.”	LLMs can indeed prioritize specific characteristics of a case or patient, and complex scenarios can be addressed through the parallel operations of multiple attention heads. While this capability offers advantages in focusing on critical aspects, it may also have drawbacks, particularly when LLMs fail to correctly integrate and synthesize the resulting information.
(2) Explaining “generic” or “textbook” responses
“MLP layers in transformers combine information gathered by attention heads into higher-level abstractions. When operating on medical queries, these layers may rely on well-learned, “standardized” patterns from training data (e.g., common guidelines or textbook phrasing) instead of tailoring responses to unusual clinical nuances. Understanding the MLP’s integrating role explains why a model might revert to a generic standard-of-care response even when presented with a complex or unique patient scenario.”	“In the study on rehabilitation medicine, ChatGPT-4 produced broader and more general prescriptions that were consistent with textbook answers, indicating its reliance on learned generic medical knowledge.”	“Compared with standard answers, the large language model generated broader and more general prescriptions in terms of medical problems and management plans, rehabilitation problems and management plans, as well as rehabilitation goals.”	Referring to knowledge that is well-represented in the training data (short head knowledge) can result in “generic” or “textbook” responses, raising concerns, however, about their adequacy when addressing atypical cases and patients.
(5) Identifying hallucinations in unfamiliar scenarios
“Transformers are trained on patterns within a certain data distribution. When confronted with rare conditions, novel treatments, or unusual clinical contexts, the model’s learned patterns may not apply. Attention could be misdirected, and the MLP layers might produce “hallucinated” content because they have no solid internal representation for the out-of-distribution input.”	“[…] while ChatGPT-4 made an error in the ICF category, it accurately generated ICF codes, highlighting the model’s potential to hallucinate in less familiar contexts.”	“A thorough review of the standard clinical ICF code assigned by 2 PMR clinicians was then conducted, comparing it with the table produced by the GPT-4 model (Table II). The 3-digit codes generated by the LLM were accurate (…) However, an error was found when reviewing the case record in the body structures category (s730). The patient had had a stroke, and the original impairment should have been classified as affecting the right precentral gyrus (s110.1), as outlined in the case section. Instead, the table displayed the damage as being in “the upper extremity, left hand.”“	LLM responses may exhibit hallucinations when referring to “long tail” knowledge that is not well-represented in the training data. This is hypothesized to be the case for the “body structures category”. Then again, the LLM-generated explanations in this table are not necessarily correct either. A simpler hypothesis regarding the LLM failure is that it did not know or did not consider that the “body structures category” is supposed to refer to the primary site of damage (the brain), not to the secondary site (the hand). Any lack of knowledge regarding the reporting of ICF categories may thus be attributed to insufficient training data regarding this meta-level information.

Table 2. Key Topics (3) and (4), described by GPT-o1, exemplified by Consensus.app.

GPT-o1 explanation for the key topic	Consensus.app explanation for the example	Quotes from the paper (Wu et al.)	Synopsis
(3) Understanding strengths and weaknesses in differential diagnosis
“Attention layers help identify connections between symptoms and conditions, while MLP layers synthesize these into coherent outputs. Knowing this pipeline is useful when the model suggests a differential diagnosis. If the model posits an unusual condition, it might be because it latched onto a distinctive symptom that strongly correlated with that condition in its training data—even if that condition is clinically improbable.”	“The strengths of transformer models in differential diagnosis can be attributed to their ability to synthesize information from diverse sources, while weaknesses may arise from their lack of real-world clinical experience. The CALLM framework, for example, enhances clinical interview analysis by generating synthetic data that can improve diagnostic accuracy, showcasing the model’s adaptability in learning from augmented datasets.”	“In automated mental health diagnosis, the scarcity and imbalance of clinical data pose considerable challenges for researchers, limiting the effectiveness of machine learning algorithms. To cope with this issue, this paper aims to introduce a novel clinical transcript data augmentation framework by leveraging large language models (CALLM). The framework follows a “patient-doctor role-playing” intuition to generate realistic synthetic data.”	A hypothesis about how LLMs handle differential diagnoses is that multi-head attention may be responsible for the matching of patient data to the sets of symptoms known for disease conditions, but this matching may ignore disease prevalence. Synthetic data may mitigate this weakness because researchers can generate examples following a data distribution they have under control, and provide these examples to the LLM.
(4) Explaining ambiguous or contradictory responses
“When the patient’s presentation is ambiguous or the prompts contain conflicting information, attention mechanisms may distribute focus across multiple, equally plausible interpretations. The MLP layers may fail to resolve these into a single, authoritative answer. Understanding this helps users interpret uncertain or oscillating responses as a reflection of the model’s internal struggle with ambiguity rather than mere randomness.”	“Ambiguities or contradictions in model outputs can often be traced back to the model’s training data or the inherent complexity of medical language. The CALLM framework’s use of a “Response-Reason” prompt engineering paradigm aims to generate diagnostically valuable transcripts, which can help mitigate such issues by providing clearer reasoning paths in the model’s responses.”	“Our “Response-Reason” prompting approach guides LLMs in generating highly authentic clinical interview transcripts for mental disorder diagnosis. This augmentation is tailored to enhance the training dataset, facilitating both FSL [Few-Shot-Learning] and, in certain cases, ZSL [Zero-Shot-Learning].“”This technique […] encouraged it to elucidate the rationale behind the responses, mirroring the profile and characteristics of a simulated patient.”	Contradictory responses can be attributed to ambiguous input, ambiguity within the training data, or ambiguity in the representation of knowledge by the trained model. Specialized prompting techniques may request that the reasoning path of the LLM is made more transparent, enhancing its reasoning capabilities along the way.

Key topic 1: Anticipating contextual focus in medical reasoning

Key topic (1) illustrates how the LLM’s attention mechanism enables it to identify and prioritize relevant information within a text, connecting specific parts of the text that may be far away in the input stream. In fact, this “focus on specific parts” may be done multiple times in parallel by multiple attention heads and integrated across layers until the final output is generated. The attention mechanism thus allows the model to focus on critical elements – such as specific patient symptoms or lab values – while ignoring less pertinent data. GPT-o1 underscores that clinicians and medical researchers can benefit from understanding that model attention might attend to highly salient details at the expense of a broader synthesis. This focused approach can be advantageous in ensuring the model’s output is closely aligned with key aspects of a query.

In the example by Consensus.app, Zhang et al. demonstrate how this capacity proves valuable in their example from rehabilitation medicine, where ChatGPT-4 generated targeted intervention plans by focusing on the most relevant details of a patient’s presentation.

Key topic 2: Explaining “generic” or “textbook” responses

“Generic” or “textbook” responses arise from a model’s tendency to draw on widely represented knowledge learned from its training data (“short head knowledge”). GPT-o1 suggests that, when responding to medical queries, the model’s MLPs often rely on well-learned patterns, which can lead to standardized procedures being presented even in atypical clinical situations. This is echoed in the Consensus.app findings, indicating ChatGPT-4’s propensity to default to generalized medical knowledge.

Key topic 3: Understanding strengths and weaknesses in differential diagnosis

With regard to key topic (3), GPT-o1 suggests that models sometimes include improbable clinical differential diagnoses due to strong correlations detected between input and training data, irrespective of the prevalence of the diagnosed disease; such base rate neglect may or may not be helpful for an accurate diagnosis.³⁰ Consensus.app mentions a lack of direct clinical experience of the models, which may refer to base rate neglect. In the example of Wu et al.,²⁹ accuracy was supposedly enhanced by data augmentation using synthetic data as part of the CALLM framework, allowing better differential diagnoses based on more balanced data. Here, “synthetic data” refers to artificial training examples generated by an LLM itself – in CALLM, transcripts produced through a “patient–doctor role-playing” procedure – rather than exemplars hand-crafted by humans. This distinction matters: training models on AI-generated data carries its own risks, because recursive or large-scale use of synthetic data can degrade performance and, in the limit, lead to “model collapse”, an effect that can be triggered even by relatively small fractions of synthetic data in the training set.³¹ Any benefit from augmentation therefore depends on careful curation, validation, and an appropriate balance of real and synthetic examples. However, skeptics may argue that true clinical complexity is difficult to replicate through synthetic data, casting doubt on the broader applicability of AI-generated simulations for real-world clinical settings. Thus, there is an evident tension between synthetic data generation and the complexity of capturing clinical “real-world” scenarios. This also underscores the critical importance of robust validation requirements,³² particularly when significant decisions are to be made by an LLM.

Key topic 4: Explaining ambiguous or contradictory responses

In addressing key topic (4), GPT-o1 links ambiguous or contradictory outputs to unclear or insufficiently specific prompts, whereas Consensus.app posits that ambiguity within the model’s training data is a contributing factor. One solution involves carefully crafted prompts designed to elicit the model’s reasoning processes, thereby mitigating confusion. Here, the CALLM framework successfully employs a “Response-Reason” prompting strategy.

Key topic 5: Identifying hallucinations in unfamiliar scenarios

By contrast, “unfamiliar scenarios” engage a model’s “long tail” knowledge, where there is a heightened risk of hallucination because the queries may diverge significantly from what could be learned from the training set.

A case in point, discussed by Zhang et al., shows ChatGPT-4 successfully generating International Classification of Functioning (ICF) codes for a stroke patient but misreporting the lesion site. Specifically, the model accurately identified the motor dysfunction in the left hand but failed to report the lesion in the right precentral gyrus. Although the model’s output correctly flags the patient’s impaired motor function, it does not relate this impairment to its origin in disrupted motor signalling in the brain. Rather, the case illustrates a limitation of learned statistical and correlational regularities when they are not embedded in the broader causal, evidential, and clinical interpretation required in medicine. Consequently, the output treats the limitation as purely motor-related rather than reflecting the underlying neurological cause. A simpler account is that the relevant regularity – that the ICF body-structures code should refer to the primary lesion rather than its secondary consequences – was insufficiently represented in the training data (“long tail” knowledge).

This distinction, however, is crucial: if a clinical decision-support system fails to report the specific neurological lesion, it may overlook critical rehabilitation strategies that are essential for effective patient care. Consequently, interventions might miss addressing the root cause, leading to slower or less effective patient recovery.

These examples highlight the importance of recognizing not only what a model can accomplish but also where gaps in its knowledge or reasoning may lead to clinically relevant inaccuracies.

Discussion

The implementation of AI, particularly LLMs, in healthcare will drive transformative changes in medical practice and theory while presenting significant challenges. A thorough understanding of the key ingredients of LLMs, based on their underlying architecture, including attention mechanisms and MLPs, can be particularly useful in situations where the model’s outputs are unexpected, ambiguous, or counterintuitive, necessitating critical analysis but also in routine cases where seemingly appropriate recommendations may invite automation bias.

Theoretical understanding of how LLMs process and generate information provides a conceptual framework for interpreting their outputs.³³ It allows clinicians and researchers to anticipate how attention mechanisms determine contextual focus, how probabilistic prediction can lead to overly generic or “textbook” responses, and why models occasionally generate contradictory or fabricated information. This type of literacy enables users to distinguish between the model’s apparent confidence and fluency and the actual reliability of its reasoning. Understanding the model’s architecture thus directly supports critical and ethical evaluation of AI-generated content in clinical contexts.

The five key topics presented here contribute to the responsible use of large language models (LLMs) in medicine. Anticipating the contextual focus in medical reasoning (key topic 1) helps users understand why certain information is extracted and prioritized by the LLM, while other aspects are neglected. Awareness of the tendency of LLMs to generate generic responses (key topic 2) draws attention to the risk of overreliance on common patterns, which may not always apply to complex, rare, or atypical cases. LLMs can quickly generate potential differential diagnoses but often show weaknesses, for instance, in aligning these diagnoses with specific epidemiological contexts. Understanding such strengths and weaknesses (key topic 3) is crucial for assessing the validity and evidential value of LLM outputs. At times, LLMs tend to produce ambiguous or contradictory responses. Identifying these limitations (key topic 4) encourages critical prompt design and iterative clarification. Finally, recognizing hallucinations in unfamiliar scenarios (key topic 5) highlights the importance of verifying outputs when the model encounters data outside its training distribution. Together, these insights show that theoretical knowledge can serve as a safeguard, helping users interpret and evaluate LLM outputs more systematically.

This interpretive literacy also encompasses an ethical responsibility. Understanding and identifying where LLMs fail due to inherent limitations empowers more autonomous engagement with LLMs (avoiding so-called “computer paternalism”) and helps to safeguard patient safety.^34,35 It also helps to see where such failures may introduce or amplify biases, enabling clinicians to better judge when model outputs risk marginalizing minority groups or reinforcing existing prejudices.³⁶ The ethical use of LLMs should therefore not be defined solely by regulations and guidelines but is equally determined by the user’s competence in engaging with this technology. Such competence includes the ability to discern biases and misinformation that may be concealed within outputs that appear neutral and confident.³⁷

Several limitations of this work should be acknowledged. The examples analyzed were generated and selected using AI-based tools, critically examined and assessed by human authors, and represent only a small, non-systematic sample of possible cases rather than a representative or exhaustive survey. Although the use of GPT-o1 and Consensus.app provided a consistent exploratory framework, it may also have introduced biases related to model behavior and source retrieval. We also acknowledge the circularity inherent in a partially self-referential design, in which the system under study (GPT-o1) is also the source of the illustrative material. We have sought to mitigate this by treating the model’s outputs as hypotheses rather than as evidence, by cross-checking them against independent peer-reviewed studies, and by subjecting them to critical appraisal. A further, fundamental limitation concerns the partially self-referential design itself: because an LLM cannot genuinely introspect, the explanations it produces about its own “reasoning” need not correspond to the computations actually carried out inside the network, as demonstrated by mechanistic-interpretability studies. We have therefore used the model’s outputs only as a source of candidate topics and illustrations, which were then critically appraised and cross-checked against independent literature; they should not be read as evidence about the model's internal mechanisms.²⁴ Furthermore, this study did not include feedback by external clinicians, which may further limit the generalizability of the findings. Nevertheless, by instructing LLMs and AI-assisted literature search tools to reflect on the requirements of their use in medical contexts, we applied a methodology that can serve as a powerful approach to studying AI reasoning in medicine.

Future research should build on this methodology through collaborative designs that integrate theoretical analysis with empirical testing in clinical and educational settings. Combining model introspection with user evaluation may help to further study how theoretical understanding enhances AI-literacy. Moreover, developing educational frameworks and training modules on LLM interpretability could further strengthen the competencies of students and clinicians to use AI responsibly.

It is sometimes assumed that the rapid evolution of LLMs makes it futile to build a stable understanding of their inner workings. The underlying deep-learning architecture – the transformer, with its attention mechanisms and MLP layers – has remained largely unchanged since 2017, and recent progress has come primarily from scaling up data, parameters, and compute rather than from new core mechanisms; indeed, there is ongoing debate about whether returns from further scaling are beginning to plateau.²⁵ Because the architectural principles on which our analysis rests are comparatively stable, the interpretive skills described here are likely to remain relevant even as models continue to scale. It nonetheless remains essential to meet any novel developments with a comprehensive theoretical, practical, and ethical skill set. These competencies will enable healthcare professionals to approach model outputs not as unquestionable truths but as context-dependent and probabilistic responses that warrant critical examination, thereby remaining vigilant regarding their limitations.^7,38 Such a mindset fosters a reflexive awareness of how AI systems interact with clinical reasoning and decision-making, helping to preserve space for professional judgement, patient values, and situational nuance within emerging forms of human–AI collaboration.^39,40 This will help maintain human oversight and ethical accountability even as models become more powerful. Ultimately, bridging medical expertise with theoretical and ethical knowledge will be necessary to ensure that AI contributes to, rather than undermines, the integrity of clinical practice.

Conclusions

The examples analyzed in this study demonstrate that LLMs hold transformative potential in healthcare, and theoretical knowledge of their architecture and mechanisms is important for interpreting their outputs responsibly. Understanding the balance between short head and long tail knowledge, recognizing generic responses, and identifying hallucinations are important skills. Developing these competencies can support healthcare professionals in using LLMs more effectively and safely, enabling them to integrate AI technologies while maintaining appropriate oversight, mitigating risks and ensuring ethical standards.

Data availability

Figshare. Extended Data. https://doi.org/10.6084/m9.figshare.31493524.⁴¹ This project contains the following extended data: Extended Data (All prompts, GPT-o1 outputs, and Consensus.app search results, in Supplementary Sections A and B.) Data is available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

A preliminary version of this work was made available as a preprint at https://www.preprints.org/manuscript/202510.0630/v1.¹ The supplementary material is available at https://doi.org/10.6084/m9.figshare.31493524.⁴¹

References

1. Fuellen G, Jarchow H, Põder J-C: Understanding the Inner Workings of Large Language Models in Medicine. Preprints: Preprints. 2025. Publisher Full Text Reference Source
2. Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al.: Large language models in medicine. Nat. Med. 2023; 29(8): 1930–1940. PubMed Abstract | Publisher Full Text
3. Clusmann J, Kolbinger FR, Muti HS, et al.: The future landscape of large language models in medicine. Commun. Med. 2023/10/10 2023; 3. Publisher Full Text
4. Liu J, Wang C, Liu S: Utility of ChatGPT in Clinical Practice. J. Med. Internet Res. 2023; 25: e48568. PubMed Abstract | Publisher Full Text | Free Full Text
5. Wang L, Wan Z, Ni C, et al.: Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review. J. Med. Internet Res. 2024; 26: e22769. PubMed Abstract | Publisher Full Text | Free Full Text
6. Jung K-H: Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations. Healthc Inform Res. 2025; 31(2): 114–124. PubMed Abstract | Publisher Full Text | Free Full Text
7. Klang E, Tessler I, Freeman R, et al.: If Machines Exceed Us: Health Care at an Inflection Point. NEJM AI. 2024; 1. Publisher Full Text
8. Ong JCL, Chang SY-H, William W, et al.: Ethical and regulatory challenges of large language models in medicine. Lancet Digit Health. 2024; 6(6): e428–e432. Publisher Full Text
9. Bouderhem R: Shaping the future of AI in healthcare through ethics and governance. Humanit Soc Sci Commun. 2024; 11(1): 416. Publisher Full Text
10. Goddard K, Roudsari A, Wyatt JC: Automation bias: Empirical results assessing influencing factors. Int. J. Med. Inform. 2014; 83(5): 368–375. PubMed Abstract | Publisher Full Text
11. Vaccaro M, Almaatouq A, Malone T: When combinations of humans and AI are useful: A systematic review and meta-analysis. Nat. Hum. Behav. 2024; 8(12): 2293–2303. PubMed Abstract | Publisher Full Text | Free Full Text
12. Abdelwanis M, Alarafati HK, Tammam MMS, et al.: Exploring the risks of automation bias in healthcare artificial intelligence applications: A Bowtie analysis. J Saf Sci Resil. 2024; 5(4): 460–469. Publisher Full Text
13. Ranji SR: Large Language Models—Misdiagnosing Diagnostic Excellence? JAMA Netw. Open. 2024; 7(10): e2440901. Publisher Full Text
14. Goh E, Gallo R, Hom J, et al.: Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw. Open. 2024; 7(10): e2440969–e2440969. Publisher Full Text
15. McDuff D, Schaekermann M, Tu T, et al.: Towards accurate differential diagnosis with large language models. Nature. 2025; 642(8067): 451–457. Publisher Full Text
16. Ang C-S: Developing AI literacy in healthcare education: bridging the gap in competency assessment. Discov. Educ. 2025; 4(1): 372. Publisher Full Text
17. Gazquez-Garcia J, Sánchez-Bocanegra CL, Sevillano JL: AI in the Health Sector: Systematic Review of Key Skills for Future Health Professionals. JMIR Med Educ. 2025; 11: e58161. PubMed Abstract | Publisher Full Text | Free Full Text
18. Ahsan Z: Integrating artificial intelligence into medical education: a narrative systematic review of current applications, challenges, and future directions. BMC Med. Educ. 2025; 25(1): 1187. PubMed Abstract | Publisher Full Text | Free Full Text
19. Ong JCL, Chang SY-H, William W, et al.: Medical Ethics of Large Language Models in Medicine. NEJM AI. 2024; 1(7): AIra2400038. Publisher Full Text
20. Põder J-C, Helgesson G: Ethical Aspects of Generative AI in Medicine.Hoffmann CH, Bansal D, editors. AI Ethics in Practice: Navigating Academic Insight, Managerial Expertise, and Philosophical Inquiry. Springer Nature Switzerland; 2025; 139–162. Publisher Full Text
21. McCoy LG, Ci Ng FY, Sauer CM, et al.: Understanding and training for the impact of large language models and artificial intelligence in healthcare practice: a narrative review. BMC Med. Educ. 2024; 24(1): 1096. Publisher Full Text
22. Wang D, Zhang S: Large language models in medical and healthcare fields: applications, advances, and challenges. Artif. Intell. Rev. 2024; 57(11): 299. Publisher Full Text
23. Xiao H, Zhou F, Liu X, et al.: A comprehensive survey of large language models and multimodal large language models in medicine. Inf. Fusion. 2025; 117: 102888. Publisher Full Text
24. Lindsey J, Gurnee W, Ameisen E, et al.: On the Biology of a Large Language Model. Transform. Circuits Thread. 2025; 642(8067): 451–457. Reference Source
25. Vaswani A, Shazeer N, Parmar N, et al.: Attention is all you need.Guyon I, Von Luxburg U , Bengio S, et al., editors. Advances in neural information processing systems. Curran Associates, Inc.; 2017; Vol 30. . Reference Source
26. Zheng Z, Wang Y, Huang Y, et al.: Attention heads of large language models. Patterns. 2025; 6(2): 101176. PubMed Abstract | Publisher Full Text | Free Full Text
27. Webster CS: Natural and artificial intelligence – the psychotechnical agenda of the 21st century. J. Psychol. AI. 2025; 1(1): 2491445. Publisher Full Text
28. Zhang L, Tashiro S, Mukaino M, et al.: Use of artificial intelligence large language models as a clinical tool in rehabilitation medicine: a comparative test case. J. Rehabil. Med. 2023; 55: jrm13373. PubMed Abstract | Publisher Full Text | Free Full Text
29. Wu Y, Mao K, Zhang Y, et al.: CALLM: Enhancing Clinical Interview Analysis Through Data Augmentation With Large Language Models. IEEE J. Biomed. Health Inform. 2024; 28(12): 7531–7542. Publisher Full Text
30. Hamm RM: Physicians neglect base rates, and it matters. Behav. Brain Sci. 1996; 19(1): 25–26. Publisher Full Text
31. Dohmatob E, Feng Y, Subramonian A, et al.: Strong model collapse. In: International Conference on Learning Representations. 2025; pp. 15656–15691. Reference Source
32. Fuellen G, Kulaga A, Lobentanzer S, et al.: Validation requirements for AI-based intervention-evaluation in aging and longevity research and practice. Ageing Res. Rev. 2025; 104: 102617. PubMed Abstract | Publisher Full Text Reference Source
33. Mesinovic M, Watkinson P, Zhu T: Explainability in the age of large language models for healthcare. Commun Eng. 2025; 4(1): 128. PubMed Abstract | Publisher Full Text | Free Full Text
34. Kühler M: Exploring the phenomenon and ethical issues of AI paternalism in health apps. Bioethics. 2022; 36(2): 194–200. PubMed Abstract | Publisher Full Text
35. Heyen NB, Salloch S: The ethics of machine learning-based clinical decision support: an analysis through the lens of professionalisation theory. BMC Med. Ethics. 2021; 22(1): 112. PubMed Abstract | Publisher Full Text | Free Full Text
36. Mahajan A, Obermeyer Z, Daneshjou R, et al.: Cognitive bias in clinical large language models. NPJ Digit Med. 2025; 8(1): 428. Publisher Full Text
37. Ning Y, Liu M, Liu N: Advancing ethical AI in healthcare through interpretability. Patterns. 2025; 6(6): 101290. Publisher Full Text
38. Tun HM, Rahman HA, Naing L, et al.: Trust in Artificial Intelligence–Based Clinical Decision Support Systems Among Health Care Workers: Systematic Review. J. Med. Internet Res. 2025; 27: e69678. Publisher Full Text
39. McDougall RJ: Computer knows best? The need for value-flexibility in medical AI. J. Med. Ethics. 2019; 45(3): 156. Publisher Full Text
40. Sokol K, Fackler J, Vogt JE: Artificial intelligence should genuinely support clinical reasoning and decision making to bridge the translational gap. NPJ Digit Med. 2025; 8(1): 345. PubMed Abstract | Publisher Full Text | Free Full Text
41. Fuellen G, Jarchow H, Põder J-C: Extended Data for: Understanding the Inner Workings of Large Language Models in Medicine. Figshare. 2026. Publisher Full Text

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 05 May 2026

Author details Author details

¹ Institute for Biostatistics and Informatics in Medicine and Ageing Research, Rostock University Medical Center, Rostock, Germany
² Faculty of Theology, University of Rostock, Rostock, Germany

Georg Fuellen
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Supervision, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Hans Jarchow
Roles: Data Curation, Formal Analysis, Investigation, Methodology, Writing – Original Draft Preparation, Writing – Review & Editing

Johann-Christian Põder
Roles: Investigation, Methodology, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (2)

version 2

Revised

Published: 01 Jul 2026, 15:669

https://doi.org/10.12688/f1000research.178855.2

version 1

Published: 05 May 2026, 15:669

https://doi.org/10.12688/f1000research.178855.1

Copyright

© 2026 Fuellen G et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Fuellen G, Jarchow H and Põder JC. Understanding the Inner Workings of Large Language Models in Medicine [version 2; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2026, 15:669 (https://doi.org/10.12688/f1000research.178855.2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 05 May 2026

Views

17

Reviewer Report 04 Jun 2026

Craig S Webster, The University of Auckland, Auckland, Auckland, New Zealand

Approved with Reservations

https://doi.org/10.5256/f1000research.197293.r487509

This is an interesting paper that I enjoyed reading. It makes a number of key distinctions which I think are in fact important for the users of LLMs to better understand when they should be trusted and when users need ... Continue reading

This is an interesting paper that I enjoyed reading. It makes a number of key distinctions which I think are in fact important for the users of LLMs to better understand when they should be trusted and when users need to be more critical. This is an important and practical concern for clinicians using AI-based tools in healthcare, and so from this perspective I think the paper has merit. I like the short-head and long-tail distinction and how this relates to the risk of hallucination – this is of real interest to clinicians who are not technical experts in LLM technology, but need to know how best to use them. The five key topics will also be of interest to clinicians. However, these positives aside, there are a number of areas where the language in the paper needs to be tightened, as below.

Page 3, 2^nd paragraph: you mention that LLMs may on occasion work best without human intervention. I think some unpacking of what you mean by this is needed, as most authorities believe that AI use in healthcare must always be supervised by humans. Humans are the ultimate sense makers and decision makers, since the AI has no actual understanding of the tasks it performs or their consequences.

Page 4: You mention synthetic data augmentation to fill gaps in training data sets. It was unclear to me what you meant by synthetic data – do you mean exemplar data made up by humans, or data generated by other AI systems? Synthetic data generated by other AI systems has been shown to reduce the performance of LLMs, even to the extent of so-called model collapse, and even with surprisingly small amounts of synthetic data in the training data set. Hence, I think you need to be clearer about what you mean by synthetic data, and also to explain how you might avoid performance decline if you are using such data, or at least mention the risks. See: Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian and Julia Kempe. Strong Model Collapse, arXiv, 2024. https://arxiv.org/abs/2410.04840

One of the central problems of asking an LLM to explain itself, or to “introspect” is that all LLM responses are simply text outputs designed to have the highest probability of being correct given the training data and the prompt. Hence the LLM makes no distinction between an “introspective” prompt and one asking a more general inquiry. LLMs have no sense of self, or awareness of what they are doing, hence cannot actually introspect in the way we understand it. Often an LLM will give you the textbook answer to some task it just performed because that method was in its training data, but when you inspect the activity of the neural network itself using something like a mechanistic interpretability approach, you find that what it actually did was nothing like what it just claimed. It may claim some tidy, logical method, but what is actually going on inside the network is typically highly complicated, probabilistic, and essentially unintelligible – it just happens to get the right answer most of the time. See: On the Biology of a Large Language Model, https://transformer-circuits.pub/2025/attribution-graphs/biology.html

I think you need to mention this gap between what the LLM claims it is doing, and what is actually going on in the network.

Page 6, key topic 5: I think this is a very interesting discussion, and it underscores the point I was making in my previous comment – the LLM has no understanding of what it is doing. More critically for the use of LLMs in medicine, the rules that the LLM has extracted from the training data through statistical inference are not equivalent to evidence-based medicine or the causal theories of disease! And this is a key point that many clinicians do not appreciate. This is why it doesn’t make the connection between symptoms and underlying causes, it has only a probabilistic or correlational model, not a causal one. I think you need to be very careful about using words like “understand”, “knowledge”, “reasoning” and indeed “meta-knowledge” when describing what the LLMs is doing – as technically, it has none of these things (although it may appear to have them). For a discussion of these key distinctions, which are highly relevant to medicine, see: Webster, C. S. (2025). Natural and artificial intelligence – the psychotechnical agenda of the 21st century. Journal of Psychology and AI, 1(1). https://doi.org/10.1080/29974100.2025.2491445

Page 8, 5^th paragraph: You make a claim about the rapid evolution of LLMs making it hard to understand their inner workings. Actually, the underlying technology of deep learning models hasn’t changed much in years, what we have seen recently is a scaling up of this technology – whether it continues to scale up or not is a question of debate – although recent evidence suggests that performance of these models is plateauing. However, given that the underlying technology of deep learning remains the same in all these models, the aim of your paper remains relevant, despite the results of scaling up – and I think you should make this point in discussion.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Neuropsychology, clinical education, system redesign, artificial intelligence

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

22

Reviewer Report 18 May 2026

Yavuz Selim Kıyak, Gazi University, Ankara, Turkey

Not Approved

https://doi.org/10.5256/f1000research.197293.r482232

The manuscript focuses on an important possible use of LLMs. It is if theoretical knowledge of LLM architecture can help doctors better interpret AI-generated outputs. However, the current version includes several methodological concerns.

The main methodological concern ... Continue reading

The manuscript focuses on an important possible use of LLMs. It is if theoretical knowledge of LLM architecture can help doctors better interpret AI-generated outputs. However, the current version includes several methodological concerns.

The main methodological concern is the self-referential design. The authors used GPT-o1 to generate examples of how understanding LLMs can help evaluate LLM outputs. This leads to a circularity problem. The problem is that the tool being examined also becomes a source of evidence for the claims being developed. This might be seen acceptable for exploratory idea generation but it is not sufficiently strong for a paper.

Another concern is overclaiming contribution. The introduction states that the study addresses a gap in empirical work but the current approach in the manuscript does not include clinicians, users, performance outcomes, comparison groups, or observed decision-making. Therefore, the study cannot show that theoretical knowledge actually improves clinicians’ ability to evaluate LLM outputs.

The authors also used Consensus.app as a validation tool. This is another important concern. An AI-assisted literature search platform can help identify relevant papers but it does not itself validate GPT-generated themes. Another concern is the limited and selective evidence base. Only a small number of “especially illustrative” studies are used to support the final themes. There is also a need for the language of the introduction to be softened. Terms such as “validated” gives a signal like strong evidence.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

No
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

No

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: medical education, large language models

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 05 May 2026

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 2 (revision) 01 Jul 26
Version 1 05 May 26	read	read

Yavuz Selim Kıyak, Gazi University, Ankara, Turkey
Craig S Webster, The University of Auckland, Auckland, New Zealand

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

17 Views

04 Jun 2026 | for Version 1

Craig S Webster, The University of Auckland, Auckland, Auckland, New Zealand

17 Views Cite this report Responses(0)

Approved With Reservations

This is an interesting paper that I enjoyed reading. It makes a number of key distinctions which I think are in fact important for the users of LLMs to better understand when they should be trusted and when users need to be more critical. This is an important and practical concern for clinicians using AI-based tools in healthcare, and so from this perspective I think the paper has merit. I like the short-head and long-tail distinction and how this relates to the risk of hallucination – this is of real interest to clinicians who are not technical experts in LLM technology, but need to know how best to use them. The five key topics will also be of interest to clinicians. However, these positives aside, there are a number of areas where the language in the paper needs to be tightened, as below.

Page 3, 2^nd paragraph: you mention that LLMs may on occasion work best without human intervention. I think some unpacking of what you mean by this is needed, as most authorities believe that AI use in healthcare must always be supervised by humans. Humans are the ultimate sense makers and decision makers, since the AI has no actual understanding of the tasks it performs or their consequences.

Page 4: You mention synthetic data augmentation to fill gaps in training data sets. It was unclear to me what you meant by synthetic data – do you mean exemplar data made up by humans, or data generated by other AI systems? Synthetic data generated by other AI systems has been shown to reduce the performance of LLMs, even to the extent of so-called model collapse, and even with surprisingly small amounts of synthetic data in the training data set. Hence, I think you need to be clearer about what you mean by synthetic data, and also to explain how you might avoid performance decline if you are using such data, or at least mention the risks. See: Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian and Julia Kempe. Strong Model Collapse, arXiv, 2024. https://arxiv.org/abs/2410.04840

One of the central problems of asking an LLM to explain itself, or to “introspect” is that all LLM responses are simply text outputs designed to have the highest probability of being correct given the training data and the prompt. Hence the LLM makes no distinction between an “introspective” prompt and one asking a more general inquiry. LLMs have no sense of self, or awareness of what they are doing, hence cannot actually introspect in the way we understand it. Often an LLM will give you the textbook answer to some task it just performed because that method was in its training data, but when you inspect the activity of the neural network itself using something like a mechanistic interpretability approach, you find that what it actually did was nothing like what it just claimed. It may claim some tidy, logical method, but what is actually going on inside the network is typically highly complicated, probabilistic, and essentially unintelligible – it just happens to get the right answer most of the time. See: On the Biology of a Large Language Model, https://transformer-circuits.pub/2025/attribution-graphs/biology.html

I think you need to mention this gap between what the LLM claims it is doing, and what is actually going on in the network.

Page 6, key topic 5: I think this is a very interesting discussion, and it underscores the point I was making in my previous comment – the LLM has no understanding of what it is doing. More critically for the use of LLMs in medicine, the rules that the LLM has extracted from the training data through statistical inference are not equivalent to evidence-based medicine or the causal theories of disease! And this is a key point that many clinicians do not appreciate. This is why it doesn’t make the connection between symptoms and underlying causes, it has only a probabilistic or correlational model, not a causal one. I think you need to be very careful about using words like “understand”, “knowledge”, “reasoning” and indeed “meta-knowledge” when describing what the LLMs is doing – as technically, it has none of these things (although it may appear to have them). For a discussion of these key distinctions, which are highly relevant to medicine, see: Webster, C. S. (2025). Natural and artificial intelligence – the psychotechnical agenda of the 21st century. Journal of Psychology and AI, 1(1). https://doi.org/10.1080/29974100.2025.2491445

Page 8, 5^th paragraph: You make a claim about the rapid evolution of LLMs making it hard to understand their inner workings. Actually, the underlying technology of deep learning models hasn’t changed much in years, what we have seen recently is a scaling up of this technology – whether it continues to scale up or not is a question of debate – although recent evidence suggests that performance of these models is plateauing. However, given that the underlying technology of deep learning remains the same in all these models, the aim of your paper remains relevant, despite the results of scaling up – and I think you should make this point in discussion.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Neuropsychology, clinical education, system redesign, artificial intelligence

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

22 Views

18 May 2026 | for Version 1

Yavuz Selim Kıyak, Gazi University, Ankara, Turkey

22 Views Cite this report Responses(0)

Not Approved

The manuscript focuses on an important possible use of LLMs. It is if theoretical knowledge of LLM architecture can help doctors better interpret AI-generated outputs. However, the current version includes several methodological concerns.

The main methodological concern is the self-referential design. The authors used GPT-o1 to generate examples of how understanding LLMs can help evaluate LLM outputs. This leads to a circularity problem. The problem is that the tool being examined also becomes a source of evidence for the claims being developed. This might be seen acceptable for exploratory idea generation but it is not sufficiently strong for a paper.

Another concern is overclaiming contribution. The introduction states that the study addresses a gap in empirical work but the current approach in the manuscript does not include clinicians, users, performance outcomes, comparison groups, or observed decision-making. Therefore, the study cannot show that theoretical knowledge actually improves clinicians’ ability to evaluate LLM outputs.

The authors also used Consensus.app as a validation tool. This is another important concern. An AI-assisted literature search platform can help identify relevant papers but it does not itself validate GPT-generated themes. Another concern is the limited and selective evidence base. Only a small number of “especially illustrative” studies are used to support the final themes. There is also a need for the language of the introduction to be softened. Terms such as “validated” gives a signal like strong evidence.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

No
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

No

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

medical education, large language models

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Respond to this report

Responses (0)

[1] 1. Fuellen G, Jarchow H, Põder J-C: Understanding the Inner Workings of Large Language Models in Medicine. Preprints: Preprints. 2025. Publisher Full Text Reference Source

[2] 2. Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al.: Large language models in medicine. Nat. Med. 2023; 29(8): 1930–1940. PubMed Abstract | Publisher Full Text

[3] 3. Clusmann J, Kolbinger FR, Muti HS, et al.: The future landscape of large language models in medicine. Commun. Med. 2023/10/10 2023; 3. Publisher Full Text

[4] 4. Liu J, Wang C, Liu S: Utility of ChatGPT in Clinical Practice. J. Med. Internet Res. 2023; 25: e48568. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Wang L, Wan Z, Ni C, et al.: Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review. J. Med. Internet Res. 2024; 26: e22769. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Jung K-H: Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations. Healthc Inform Res. 2025; 31(2): 114–124. PubMed Abstract | Publisher Full Text | Free Full Text

[7] 7. Klang E, Tessler I, Freeman R, et al.: If Machines Exceed Us: Health Care at an Inflection Point. NEJM AI. 2024; 1. Publisher Full Text

[8] 8. Ong JCL, Chang SY-H, William W, et al.: Ethical and regulatory challenges of large language models in medicine. Lancet Digit Health. 2024; 6(6): e428–e432. Publisher Full Text

[9] 9. Bouderhem R: Shaping the future of AI in healthcare through ethics and governance. Humanit Soc Sci Commun. 2024; 11(1): 416. Publisher Full Text

[10] 10. Goddard K, Roudsari A, Wyatt JC: Automation bias: Empirical results assessing influencing factors. Int. J. Med. Inform. 2014; 83(5): 368–375. PubMed Abstract | Publisher Full Text

[11] 11. Vaccaro M, Almaatouq A, Malone T: When combinations of humans and AI are useful: A systematic review and meta-analysis. Nat. Hum. Behav. 2024; 8(12): 2293–2303. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Abdelwanis M, Alarafati HK, Tammam MMS, et al.: Exploring the risks of automation bias in healthcare artificial intelligence applications: A Bowtie analysis. J Saf Sci Resil. 2024; 5(4): 460–469. Publisher Full Text

[13] 13. Ranji SR: Large Language Models—Misdiagnosing Diagnostic Excellence? JAMA Netw. Open. 2024; 7(10): e2440901. Publisher Full Text

[14] 14. Goh E, Gallo R, Hom J, et al.: Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw. Open. 2024; 7(10): e2440969–e2440969. Publisher Full Text

[15] 15. McDuff D, Schaekermann M, Tu T, et al.: Towards accurate differential diagnosis with large language models. Nature. 2025; 642(8067): 451–457. Publisher Full Text

[16] 16. Ang C-S: Developing AI literacy in healthcare education: bridging the gap in competency assessment. Discov. Educ. 2025; 4(1): 372. Publisher Full Text

[17] 17. Gazquez-Garcia J, Sánchez-Bocanegra CL, Sevillano JL: AI in the Health Sector: Systematic Review of Key Skills for Future Health Professionals. JMIR Med Educ. 2025; 11: e58161. PubMed Abstract | Publisher Full Text | Free Full Text

[18] 18. Ahsan Z: Integrating artificial intelligence into medical education: a narrative systematic review of current applications, challenges, and future directions. BMC Med. Educ. 2025; 25(1): 1187. PubMed Abstract | Publisher Full Text | Free Full Text

[19] 19. Ong JCL, Chang SY-H, William W, et al.: Medical Ethics of Large Language Models in Medicine. NEJM AI. 2024; 1(7): AIra2400038. Publisher Full Text

[20] 20. Põder J-C, Helgesson G: Ethical Aspects of Generative AI in Medicine.Hoffmann CH, Bansal D, editors. AI Ethics in Practice: Navigating Academic Insight, Managerial Expertise, and Philosophical Inquiry. Springer Nature Switzerland; 2025; 139–162. Publisher Full Text

[21] 21. McCoy LG, Ci Ng FY, Sauer CM, et al.: Understanding and training for the impact of large language models and artificial intelligence in healthcare practice: a narrative review. BMC Med. Educ. 2024; 24(1): 1096. Publisher Full Text

[22] 22. Wang D, Zhang S: Large language models in medical and healthcare fields: applications, advances, and challenges. Artif. Intell. Rev. 2024; 57(11): 299. Publisher Full Text

[23] 23. Xiao H, Zhou F, Liu X, et al.: A comprehensive survey of large language models and multimodal large language models in medicine. Inf. Fusion. 2025; 117: 102888. Publisher Full Text

[24] 24. Lindsey J, Gurnee W, Ameisen E, et al.: On the Biology of a Large Language Model. Transform. Circuits Thread. 2025; 642(8067): 451–457. Reference Source

[25] 25. Vaswani A, Shazeer N, Parmar N, et al.: Attention is all you need.Guyon I, Von Luxburg U , Bengio S, et al., editors. Advances in neural information processing systems. Curran Associates, Inc.; 2017; Vol 30. . Reference Source

[26] 26. Zheng Z, Wang Y, Huang Y, et al.: Attention heads of large language models. Patterns. 2025; 6(2): 101176. PubMed Abstract | Publisher Full Text | Free Full Text

[27] 27. Webster CS: Natural and artificial intelligence – the psychotechnical agenda of the 21st century. J. Psychol. AI. 2025; 1(1): 2491445. Publisher Full Text

[28] 28. Zhang L, Tashiro S, Mukaino M, et al.: Use of artificial intelligence large language models as a clinical tool in rehabilitation medicine: a comparative test case. J. Rehabil. Med. 2023; 55: jrm13373. PubMed Abstract | Publisher Full Text | Free Full Text

[29] 29. Wu Y, Mao K, Zhang Y, et al.: CALLM: Enhancing Clinical Interview Analysis Through Data Augmentation With Large Language Models. IEEE J. Biomed. Health Inform. 2024; 28(12): 7531–7542. Publisher Full Text

[30] 30. Hamm RM: Physicians neglect base rates, and it matters. Behav. Brain Sci. 1996; 19(1): 25–26. Publisher Full Text

[31] 31. Dohmatob E, Feng Y, Subramonian A, et al.: Strong model collapse. In: International Conference on Learning Representations. 2025; pp. 15656–15691. Reference Source

[32] 32. Fuellen G, Kulaga A, Lobentanzer S, et al.: Validation requirements for AI-based intervention-evaluation in aging and longevity research and practice. Ageing Res. Rev. 2025; 104: 102617. PubMed Abstract | Publisher Full Text Reference Source

[33] 33. Mesinovic M, Watkinson P, Zhu T: Explainability in the age of large language models for healthcare. Commun Eng. 2025; 4(1): 128. PubMed Abstract | Publisher Full Text | Free Full Text

[34] 34. Kühler M: Exploring the phenomenon and ethical issues of AI paternalism in health apps. Bioethics. 2022; 36(2): 194–200. PubMed Abstract | Publisher Full Text

[35] 35. Heyen NB, Salloch S: The ethics of machine learning-based clinical decision support: an analysis through the lens of professionalisation theory. BMC Med. Ethics. 2021; 22(1): 112. PubMed Abstract | Publisher Full Text | Free Full Text

[36] 36. Mahajan A, Obermeyer Z, Daneshjou R, et al.: Cognitive bias in clinical large language models. NPJ Digit Med. 2025; 8(1): 428. Publisher Full Text

[37] 37. Ning Y, Liu M, Liu N: Advancing ethical AI in healthcare through interpretability. Patterns. 2025; 6(6): 101290. Publisher Full Text

[38] 38. Tun HM, Rahman HA, Naing L, et al.: Trust in Artificial Intelligence–Based Clinical Decision Support Systems Among Health Care Workers: Systematic Review. J. Med. Internet Res. 2025; 27: e69678. Publisher Full Text

[39] 39. McDougall RJ: Computer knows best? The need for value-flexibility in medical AI. J. Med. Ethics. 2019; 45(3): 156. Publisher Full Text

[40] 40. Sokol K, Fackler J, Vogt JE: Artificial intelligence should genuinely support clinical reasoning and decision making to bridge the translational gap. NPJ Digit Med. 2025; 8(1): 345. PubMed Abstract | Publisher Full Text | Free Full Text

[41] 41. Fuellen G, Jarchow H, Põder J-C: Extended Data for: Understanding the Inner Workings of Large Language Models in Medicine. Figshare. 2026. Publisher Full Text

Understanding the Inner Workings of Large Language Models in Medicine

Abstract

Background

Methods

Results

Conclusions

Keywords

Revised Amendments from Version 1

Introduction

Objectives

Methods

Overview

Large Language Models (LLMs)

Prompting of the generative AI tools

Processing of the generative AI output

Ethical considerations

Results

Table 1. Key Topics (1), (2) and (5), described by GPT-o1, exemplified by Consensus.app.

Table 2. Key Topics (3) and (4), described by GPT-o1, exemplified by Consensus.app.

Key topic 1: Anticipating contextual focus in medical reasoning

Key topic 2: Explaining “generic” or “textbook” responses

Key topic 3: Understanding strengths and weaknesses in differential diagnosis

Key topic 4: Explaining ambiguous or contradictory responses

Key topic 5: Identifying hallucinations in unfamiliar scenarios

Discussion

Conclusions

Data availability

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated