RAD-CaseBookLLM-08: An open-access dataset of structured large language model–generated radiology differential diagnosis teachings

Thomas Saliba; Guillaume Fahrni

doi:10.12688/f1000research.178297.2

Home Browse RAD-CaseBookLLM-08: An open-access dataset of structured large language...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Data Note

Revised

RAD-CaseBookLLM-08: An open-access dataset of structured large language model–generated radiology differential diagnosis teachings

[version 2; peer review: 1 approved, 1 approved with reservations]

Thomas Saliba^1,2, Guillaume Fahrni ¹

PUBLISHED 10 Apr 2026

Author details Author details

¹ Department of Diagnostic and Interventional Radiology, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
² Free University of Brussels, Brussels, Belgium

Thomas Saliba
Roles: Data Curation, Formal Analysis, Investigation, Writing – Original Draft Preparation

Guillaume Fahrni
Roles: Conceptualization, Methodology, Project Administration, Supervision, Validation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

This article is included in the Data: Use and Reuse collection.

This article is included in the AI in Medicine and Healthcare collection.

Abstract

Background

Large language models are increasingly explored in medical education, particularly for generating structured explanatory content. However, openly accessible datasets capturing full-length model outputs in a standardized and reusable format remain limited. In radiology education, differential diagnosis teaching is typically organized around key imaging findings integrated with clinical reasoning. We developed RAD-CaseBookLLM-08, an open dataset of large language model–generated radiology differential diagnosis teachings derived from lesion-based thematic topics.

Methods

The dataset comprises 225 cases across nine radiology subspecialties. Thematic key imaging findings were derived from an established case-based radiology textbook and used as structured prompts. All cases were generated using ChatGPT-4o (OpenAI) in March 2025 via a web-based interface with conversation memory disabled. Each topic was processed in an independent session using an identical prompt template in which only the subspecialty and imaging finding were modified. Outputs were copied verbatim without editing, correction, or validation, and formatting elements were preserved. The dataset is provided in Microsoft Word and Portable Document Format files and is organized by subspecialty with sequential case labeling. No patient data were included.

Conclusions

RAD-CaseBookLLM-08 provides a structured, time-stamped collection of large language model–generated radiology teaching texts. The dataset may support reproducibility studies, benchmarking of model outputs, prompt engineering evaluation, and analysis of educational structure in machine-generated differential diagnoses. It is openly available under a Creative Commons Zero license via Zenodo.

Keywords

Radiology education; Large language models; ChatGPT; Medical artificial intelligence; Differential diagnosis; Open dataset; Medical education research

Corresponding author: Guillaume Fahrni

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2026 Saliba T and Fahrni G. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Saliba T and Fahrni G. RAD-CaseBookLLM-08: An open-access dataset of structured large language model–generated radiology differential diagnosis teachings [version 2; peer review: 1 approved, 1 approved with reservations]. F1000Research 2026, 15:333 (https://doi.org/10.12688/f1000research.178297.2) First published: 02 Mar 2026, 15:333 (https://doi.org/10.12688/f1000research.178297.1) Latest published: 10 Apr 2026, 15:333 (https://doi.org/10.12688/f1000research.178297.2)

Revised Amendments from Version 1

The Introduction was updated to clarify the primary motivation for creating the dataset and to provide concrete examples of intended research applications, including a planned comparative educational study, longitudinal benchmarking, prompt sensitivity analysis, linguistic and structural analysis, and cross-lingual research. The Methods section was revised to explicitly state that the 25 cases per subspecialty represent the exhaustive set of lesion-based key imaging findings available in each corresponding section of the source textbook, with no selective sampling performed. The rationale for the exclusion of nuclear medicine, fetal imaging, ultrasound imaging, and Roentgen Classics sections was expanded. A sentence was added acknowledging that key generation parameters such as temperature and seed could not be controlled or reported as an inherent limitation of the web-based interface. Output variability across repeated runs was explicitly acknowledged as a limitation. A sentence was added noting that model outputs reflect the behaviour of ChatGPT-4o as of March 2025 and may differ with future model updates. Finally, the converted formats of dataset were expanded to include three additional formats: JSON, CSV, and Markdown.

See the authors' detailed response to the review by Shawn Lyo

Introduction

Large language models (LLMs) have recently emerged as powerful tools capable of generating coherent, structured, and context-aware natural language outputs.^1–3 Their rapid integration into medical domains has prompted increasing interest in their potential roles in clinical reasoning support, decision-making assistance, and medical education.^4–6 In particular, generative models have now the potential of producing structured explanatory content that resembles textbook-style teaching material.^7,8

Radiology education relies heavily on structured diagnostic reasoning. A central pedagogical component is the formulation of differential diagnoses based on key imaging findings integrated with clinical context.^9,10 This lesion-based or pattern-based approach is widely used in radiology casebooks and board examination preparation materials. Trainees are typically exposed to thematic imaging findings (e.g., a cavitary pulmonary mass or distal interphalangeal arthropathy) and are expected to develop a prioritized differential diagnosis, recognize distinguishing imaging characteristics, and understand the reasoning leading to the final diagnosis.

While LLMs have demonstrated the ability to generate medical explanations and answer clinical questions, the reproducibility, structure, and educational consistency of LLM-generated differential diagnosis teachings remain insufficiently documented in openly accessible datasets.¹¹ Existing studies often report performance metrics or qualitative assessments, but the underlying generated texts are rarely made publicly available in a structured and reusable format. This limits transparency, benchmarking across model versions, evaluation of prompt sensitivity, and methodological reproducibility.^12,13

Open datasets documenting LLM-generated medical content are particularly important for several reasons. First, LLM outputs are inherently time-sensitive: model updates and parameter adjustments can alter responses over time.¹⁴ Capturing outputs at a defined timepoint enables longitudinal comparison and benchmarking. Second, prompt design significantly influences output structure and reasoning pathways.¹⁵ Publicly sharing prompt iterations enhances reproducibility and allows independent investigation of prompt engineering strategies. Third, openly available datasets support FAIR principles (Findable, Accessible, Interoperable, Reusable) and facilitate secondary analyses, including linguistic evaluation, hallucination detection research, educational structure assessment, and computational benchmarking.¹⁶

To contribute to ongoing efforts toward transparency and reproducibility in medical LLM research, we created RAD-CaseBookLLM-08, a structured dataset of LLM-generated radiology differential diagnosis teachings derived from thematic key imaging findings. The dataset was generated using a standardized prompting protocol applied systematically across multiple radiology subspecialties.

While RAD-CaseBookLLM-08 is not intended as a primary teaching resource, it was primarily developed to support a planned study evaluating learning performance of radiology trainees exposed to LLM-generated differential diagnosis teachings compared to a reference casebook. Beyond this application, the dataset may support additional research tasks. First, it provides a fixed, dated corpus enabling longitudinal benchmarking: the same 225 prompts can be resubmitted to future or alternative models (e.g., GPT-5, open-source LLMs) and outputs compared systematically against this baseline. Second, the standardized prompt structure allows prompt sensitivity analyses, in which alternative prompting strategies applied to identical topics can be compared against the present outputs. Third, the dataset constitutes a naturalistic corpus of LLM-generated medical text suitable for linguistic and structural analysis: examining how a large language model organizes differential diagnosis reasoning, structures pedagogical content, and varies output. Fourth, the dataset may serve as a reference corpus for cross-lingual studies, as the fixed prompt structure and standardized topic set provide a reproducible baseline against which LLM-generated outputs in other languages could be systematically compared, enabling analysis of translation fidelity and terminological consistency.

Methods

Source of thematic topics

Thematic radiological key imaging findings were derived from the case-based structure of the radiology text book Top 3 Differentials in Radiology: A Case Review. (O’Brien, 2010).¹⁷ The source textbook presents radiological cases organized around a central imaging finding, followed by a structured differential diagnosis discussion and final diagnosis. For the purpose of this dataset, only the lesion-based thematic topics, referred to in the book as “Key Imaging Findings” (e.g., “Pharyngeal mucosal mass”), were used as input for the LLM. No textbook images, figure reproductions, or verbatim text excerpts were included in the dataset nor were they included as input for the LLM.

The following subspecialties were included, each comprising 25 cases: chest imaging, cardiac imaging, gastrointestinal imaging, genitourinary imaging, musculoskeletal imaging, head and neck imaging, brain and spine imaging, pediatric imaging, breast imaging, and vascular and interventional radiology. The 25 cases per subspecialty represent the complete set of lesion-based key imaging findings available in each corresponding section of the source textbook; no selective sampling was performed. This resulted in a total of nine subspecialty sections and 225 cases overall. The complete dataset is compiled into a single PDF document comprising 360 pages, 66,874 words, and 502,964 characters.

The sections dedicated to nuclear medicine, fetal imaging, ultrasound imaging, and historical ‘Roentgen Classics’ were not included. The Roentgen Classics section was excluded as it presents single pathognomonic diagnoses rather than differential diagnosis frameworks. Nuclear medicine, fetal imaging, and ultrasound imaging were excluded as these subspecialties follow more specific and distinct teaching approaches. While similar prompting approaches could potentially be applied to these domains, the present dataset was intentionally scoped to the most conventional cross-sectional radiology subspecialties. Expansion to these additional sections may be considered in future iterations of the dataset.

LLM environment

Dataset generation was performed using the following environment:

• Model: ChatGPT-4o
• Provider: OpenAI
• Interface: Web-based interface
• Model access date: March 2025
• Conversation memory: Disabled

Each thematic topic was processed in an independent chat session. No conversation history was reused across topics.

To reduce potential personalization or adaptation effects related to prior interactions, a newly created user account was used exclusively for dataset generation. This measure was implemented to minimize contextual carryover and to improve output independence across cases.

No external plugins, browsing tools, or additional system instructions were activated during generation. The web-based interface was deliberately chosen to reflect real-world usage conditions, as this represents the mode of interaction most commonly adopted by clinicians and trainees in practice.

Prompt development

Prompt engineering was conducted iteratively through internal testing prior to final dataset generation. The objective was to obtain outputs that were structurally consistent, educational in tone, organized by differential diagnosis categories, explicit in diagnostic reasoning, and reproducible across thematic topics.

Multiple candidate prompts were tested and refined. Because complex prompts resulted in variable outputs, the following simple yet precise final prompt, which provided the best results, was retained:

“I am a radiology resident preparing for my final radiology exam. Please provide a concise radiological summary, from an exam-oriented perspective, of the following:

Specialty: [[subspecialty name (e.g., Musculoskeletal)]]

Topic: [[Key Imaging Finding (e.g., Sequestrum)]]”

In this final prompt, only the subspecialty name and Key Imaging Finding were manually updated to correspond to each processed case; the rest of the prompt was left untouched. All prompts were written in English. After the final prompt was chosen, the answers were extracted in a single session; we did not retry the same prompts multiple times, meaning that output variability across repeated runs was not assessed, which represents a limitation of the present dataset.

Dataset generation protocol

For each thematic key imaging finding, the following standardized procedure was applied:

1. A new chat session was initiated in the web interface.
2. The finalized structured prompt was entered, specifying the subspecialty and thematic topic.
3. The complete model output was copied verbatim in a word document.
4. The case number was manually added at the top of the output.
5. Original formatting (including headings, bold text, bullet points, and spacing) was preserved.
6. No editorial modification, correction, summarization, or medical validation was performed.

Interactive or conversational concluding phrases generated by the model (e.g., “Would you like more details on …”) were intentionally retained to preserve authenticity of the output and maintain fidelity to the original generation context.

The dataset therefore represents unaltered LLM-generated content captured at a defined timepoint. It should be noted that, given the continuous evolution of LLMs, the outputs reflect the behaviour of ChatGPT-4o at a specific point in time and should be interpreted accordingly, as model updates may produce different responses to identical prompts in the future.

Dataset structure

The RAD-CaseBookLLM-08 dataset is organized by radiology subspecialty.

For each subspecialty:

• One master document contains the complete list of LLM-generated teachings (n = 25 cases per specialty) corresponding to all thematic key findings within that section.
• Cases are structured sequentially and labeled according to the case numbering system of the source textbook to enable future comparative or benchmarking studies.
• Each case heading in the Word (.docx) version is formatted using the “Title 1” style to allow structured navigation via document navigation panels.

Five file formats are provided:

• Microsoft Word (.docx) format (original format)
• Converted PDF format
• Converted MD format
• Converted CSV format
• Converted JSON format

A summary dataset overview with a list of key imaging findings per specialty is provided in Tables 1–3.

Table 1. List of cardiothoracic, gastrointestinal, and genitourinary key imaging findings.

CARDIOTHORAX	GASTROINTESTINAL	UROGENITAL
Solitary Pulmonary Nodule (SPN)	Hyperdense Liver	Solid Renal Mass
Multiple Pulmonary Nodules	Nodular Liver Contour	Multiple Bilateral Renal Lesions/Masses
Cavitary Pulmonary Mass	Esophageal Diverticulum	Cystic Renal Mass
Miliary Pulmonary Nodules	Solitary Hypodense, Hypovascular Liver Mass	Retroperitoneal Mass
Centrilobular Pulmonary Nodules	Multiple Hypodense Liver Masses	Cortical Nephrocalcinosis
Cystic Lung Disease	Cystic Mass at Porta Hepatis	Medullary Nephrocalcinosis
Lower Lobe Interstitial Lung Disease (ILD)	Esophageal Submucosal Mass	Striated Nephrogram
Upper Lobe Interstitial Lung Disease (ILD)	Esophageal Dilatation	Papillary Necrosis
Hyperlucent Lung	Esophageal Outpouchings	Staghorn Calculus
Anterior Mediastinal Mass	Esophageal Ulcers	Renal Cortical Defect
Middle Mediastinal Mass	Solid Pancreatic Mass	Renal Pelvis Mass
Posterior Mediastinal Mass	Linitis Plastica	Medial Deviation of the Ureters
Chronic Air-Space Disease	Gastric Ulcer	Ureteral Filling Defects
Peripheral Air-Space Disease	Gastric Fold Thickening	Renal Migration Anomaly
Ground-Glass Opacification (GGO)	Cecal Mass	Bladder Filling Defect
Mediastinal/Hilar Lymphadenopathy	Mesenteric Mass	Bilateral Cystic Renal Disease
Calcified Pleural Disease	Terminal Ileal Wall Thickening	Perinephric Fluid Collection
Bronchiectasis	Colonic Wall Thickening	Pear-Shaped Bladder
Perilymphatic Pulmonary Nodules	Small Bowel Wall Thickening	Prostate Enlargement
Pleural-Based Mass	Esophageal Stricture	Bladder Rupture
Parenchymal Disease in a Patient with HIV	Small Bowel Dilatation	Bladder Wall Calcifications
Abnormal Left Ventricular Contour	Cystic Pancreatic Mass	Adrenal Mass
Cardiac Mass	Hypervascular Liver Mass	Fatty Retroperitoneal Mass
Delayed Myocardial Enhancement (DME)	Multiple Splenic Nodules	Dilated Ureter
Cardiac Wall Fat	Intrahepatic Biliary Ductal Strictures	Urethral Stricture

Table 2. List of musculoskeletal, head and neck, and neuro key imaging findings.

MSK	HEAD AND NECK	NEURO
FOG MACHINE (Mnemonic for Multifocal Lytic Lesions)	Enhancing Orbital Mass	Confluent White Matter Lesions
Sequestrum	Orbital Rim Fracture	Confluent White Matter Lesions in a Child
Periosteal Reaction in an Infant	Cavernous Sinus Mass/Enhancement	Ring-Enhancing Lesions in Brain & Spine
Rugger Jersey Spine	Aggressive Sinus Disease with Bony Destruction	Pineal Region Mass
Sacroiliitis	Unilateral Parotid Mass	Sellar/Suprasellar Mass in a Child
Proximal Arthropathy (MCP Joints)	Bilateral Parotid Enlargement	Posterior Fossa Mass in a Child
Distal Arthropathy (IP Joints)	Orbital Muscle Enlargement	Posterior Fossa Mass in an Adult
Erosive Arthropathy of the Foot	Mucosal Space Mass	Posterior Fossa Cyst
Chondrocalcinosis	Masticator Space Mass	Cerebellopontine Angle (CPA) Mass
Vertebra Plana in a Child	Carotid Space Mass	Cerebellar Tonsillar Herniation
Wormian Bones	Retropharyngeal Mass	Cerebrospinal Fluid (CSF)-Lined Cortical Cleft
Madelung Deformity	Clival Mass	Enhancing Intramedullary Spinal Mass
Lucent Metaphyseal Bands	Vascular Injury to the Neck	Intradural Extramedullary (IDEM) Spinal Mass
Medullary/Chondroid Lesion	Globe Lesion in a Child	Diffuse Temporal Lobe Mass
Acro-Osteolysis	Optic Nerve Enlargement and Enhancement	Increased T2 Signal Intensity in Basal Ganglia/Thalami in a Child
Dense Joint Effusion	Pachymeningeal (Dural) Enhancement	Intraparenchymal Hemorrhage (IPH)
Loose Bodies with Erosions	Middle Ear Mass	Corpus Callosal Lesion
Expansile Rib Lesion in a Child	Temporal Bone Trauma with Mastoid Fluid	Subependymal Nodules
Posterior Element Lytic Lesion	Inner Ear Congenital Malformations	Massive Supratentorial CSF Collection in a Newborn
Carpal Dislocation	Floor of the Mouth Mass	Intraventricular Mass
Periarticular Soft Tissue Calcifications	Aggressive Nasal Mass in an Adolescent	Cerebellar Atrophy
Benign Expansile Lytic Lesion	Cystic Neck Mass	Spinal Cord Signal Abnormalities
Multiple Sclerotic Foci in the Pelvis	Jugular Foramen Mass	Cortically Based Enhancing Neoplasm
Vertebral Body Wedge Fracture	Petrous Apex Lesion	Epidural Spinal Mass
Epiphyseal Equivalent Lucent Lesions	Leptomeningeal Enhancement	Prominent Periventricular/Basal Ganglia Cystic Lesions

Table 3. List of pediatrics, vascular and interventional, and breast key imaging findings.

PEDIATRICS	VASCULAR & INTERVENTIONAL	BREAST
Neonatal Lung Disease with Low Lung Volumes	Post-Intervention Vascular Complication	Breast Implant Defect
Neonatal Lung Disease with Increased Lung Volumes	Carotid Artery Stenosis	Suspicious Enhancement on Breast MRI
Cyanotic Infant with Decreased Pulmonary Blood Flow	Renal Transplant Vascular Complications	Complex Cystic Mass in a Lactating Woman
Cyanotic Infant with Increased Pulmonary Blood Flow	Digital Artery Occlusion/Ischemia	Coarse Calcifications in a Partially Circumscribed Breast Mass
Shunt Vascularity	Subclavian Vein Occlusion	Benign-Appearing Calcifications in the Breast
Solid Pulmonary Mass in Pediatrics	Great Vessel Stenosis	Malignant-Appearing Calcifications (Linear, Branching Forms)
Liver Mass in an Infant	Renal Artery Stenosis	Fatty Breast Lesion
Suprarenal Mass in a Child	Intraparenchymal Renal Artery Aneurysms	Well-Circumscribed Breast Mass in a Young Woman
Renal Mass in a Child	Hypervascular Pulmonary Mass	Unilateral Skin Thickening in the Breast
Cystic Renal Lesion (Pediatrics)	Infrarenal Aortic Occlusion	Axillary Lymphadenopathy
Subglottic Narrowing	Popliteal Artery Occlusion	Mass with Central Lucency
Neonatal Distal Bowel Obstruction	Extratesticular Mass	Well-Circumscribed Solid Breast Mass
Enterocolitis in an Immunocompromised Child	Inferior Vena Cava (IVC) Vascular Anomaly/Abnormality	Well-Circumscribed Cystic Breast Mass
Skeletal Dysplasia	Hypervascular Renal Mass	Ductal Mass
“Double Bubble” Sign	Prominent Paraspinal Flow Voids	Postoperative Changes
Posterior Vertebral Body Scalloping	Suprasellar Mass in an Adult	Bilateral Skin Thickening
Presacral Mass	Hypervascular Intracranial Mass	Breast Lesion in a Man
Long Bone Aggressive Lesion	Aortic Dissection	Well-Circumscribed Breast Cancer
Endobronchial Lesion in a Child	Lower Gastrointestinal (GI) Bleeding	Developing Asymmetry
Generalized Increased Bone Density	Vascular Ring/Sling	Infiltrative Breast Mass
Lytic Skull Lesion in a Child	Urinary Obstruction	Breast Lesion with Nipple Discharge
Avascular Necrosis (AVN) of the Femoral Head in Children	TIPS Dysfunction	Unilateral Nipple/Skin Changes
Vascular Anomaly with Esophageal and Tracheal Compression	Biliary Duct Obstruction	Superficial Breast Lesion
Neonatal Cystic Lung Lesion	Traumatic Aortic Injury (TAI)	Large Breast Mass
Esophageal Obstruction in a Neonate	Celiac Axis Stenosis/Occlusion	Complex Cystic and Solid Breast Mass

Ethical Considerations

This dataset does not contain patient data, clinical records, or identifiable human information. No ethics approval was required.

Data availability

The RAD-CaseBookLLM-08 dataset is openly available via Zenodo¹⁸: doi.org/10.5281/zenodo.18625031

The dataset includes:

• Subspecialty folders containing LLM-generated teaching texts in converted PDF, CSV, JSON and MD format (verbatim outputs).
• Subspecialty folders containing the same LLM-generated teaching texts in Word (.docx) format with structured “Title 1” styles for navigable headings.

These data are released under the Creative Commons Zero (CC0 1.0 Public Domain Dedication) license, enabling unrestricted reuse, redistribution, and adaptation.

Acknowledgments

The authors thank Dr Mustafa Mohamed and Dr Jacopo Ferrari from CHUV University Hospital for their contributions to the dataset generation, and Dr D.C. Rotzinger for his guidance on the study design. This manuscript was formatted with the assistance of a generative AI tool (ChatGPT, OpenAI), which was used only for language editing and formatting. All ideas, data, analyses, and interpretations are the original work of the authors.

References

1. Raiaan MAK, Mukta MSH, Fatema K, et al.: A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access. 2024; 12: 26839–26874.
2. Naveed H, Khan AU, Qiu S, et al.: A Comprehensive Overview of Large Language Models. ACM Trans. Intell. Syst. Technol. 2025 Oct 31; 16(5): 1–72. Publisher Full Text
3. Routray SK, Javali A, Sharmila KP, et al.: Large Language Models (LLMs): Hypes and Realities. 2023 International Conference on Computer Science and Emerging Technologies (CSET). Bangalore, India: IEEE; 2023 [cited 2026 Feb 13]; pp. 1–6. Reference Source
4. Vrdoljak J, Boban Z, Vilović M, et al.: A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration. Healthcare. 2025 Mar 10; 13(6): 603. Publisher Full Text
5. Shool S, Adimi S, Saboori Amleshi R, et al.: A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med. Inform. Decis. Mak. 2025 Mar 7; 25(1): 117. PubMed Abstract | Publisher Full Text
6. Saliba T, Ferrari J, Pozzessere C, et al.: Can advanced large language models support radiology training? A performance assessment of DeepSeek R1. Eur. J. Radiol. Artif. Intell. 2025 Sep; 3: 100024. Publisher Full Text
7. Dong B, Bai J, Xu T, et al.: Large Language Models in Education: A Systematic Review. 2024 6th International Conference on Computer Science and Technologies in Education (CSTE). Xi’an, China: IEEE; 2024 [cited 2026 Feb 13]; pp. 131–134. Reference Source
8. García-Méndez S, De Arriba-Pérez F, Somoza-López MDC: A Review on the Use of Large Language Models as Virtual Tutors. Sci. Educ. 2025 Apr; 34(2): 877–892. Publisher Full Text
9. European Society of Radiology (ESR): ESR statement on new approaches to undergraduate teaching in Radiology. Insights Imaging. 2019 Dec; 10(1): 109.
10. Kainberger F, Kletter K: Radiologie in einem prägraduellen problembasiert-integrierten Medizincurriculum. RöFo - Fortschritte Auf Dem Geb Röntgenstrahlen Bildgeb Verfahr. 2007 Nov; 179(11): 1137–1144. Publisher Full Text
11. Zhou S, Lin M, Ding S, et al.: Explainable differential diagnosis with dual-inference large language models. Npj Health Syst. 2025 Apr 24; 2(1): 12. PubMed Abstract | Publisher Full Text
12. Chen F, Cato K, Gürsoy G, et al.: Toward Identifying New Risk Aversions and Subsequent Limitations and Biases When Making De-identified Structured Data Sets Openly Available in a Post-LLM world. AMIA Annu Symp Proc AMIA Symp. 2024; 2024: 262–270. PubMed Abstract
13. Manchanda J, Boettcher L, Westphalen M, et al.: The Open Source Advantage in Large Language Models (LLMs). arXiv. 2024 [cited 2026 Feb 13]. Reference Source
14. Mousavi SM, Alghisi S, Riccardi G: DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs. Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics; 2024 [cited 2026 Feb 13]; pp. 8014–8029. Reference Source
15. Fahrni G, Rotzinger DC: Expanding on “A Hitchhiker’s Guide to Good Prompting Practices for Large Language Models in Radiology.”. J. Am. Coll. Radiol. 2025 Nov; 22(11): 1258–1259. PubMed Abstract | Publisher Full Text
16. Mons B, Neylon C, Velterop J, et al.: Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud. Inf. Serv. Use. 2017 Feb; 37(1): 49–56. Publisher Full Text
17. O’Brien WT: Top 3 Differentials in Radiology: A Case Review. New York: Thieme Medical Publishers; 2010.
18. Saliba T, Fahrni G: RAD-CaseBookLLM-08. Zenodo. 2026 [cited 2026 Feb 17]. Publisher Full Text

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 02 Mar 2026

Author details Author details

¹ Department of Diagnostic and Interventional Radiology, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
² Free University of Brussels, Brussels, Belgium

Thomas Saliba
Roles: Data Curation, Formal Analysis, Investigation, Writing – Original Draft Preparation

Guillaume Fahrni
Roles: Conceptualization, Methodology, Project Administration, Supervision, Validation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (2)

version 2

Revised

Published: 10 Apr 2026, 15:333

https://doi.org/10.12688/f1000research.178297.2

version 1

Published: 02 Mar 2026, 15:333

https://doi.org/10.12688/f1000research.178297.1

© 2026 Saliba T and Fahrni G. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Saliba T and Fahrni G. RAD-CaseBookLLM-08: An open-access dataset of structured large language model–generated radiology differential diagnosis teachings [version 2; peer review: 1 approved, 1 approved with reservations]. F1000Research 2026, 15:333 (https://doi.org/10.12688/f1000research.178297.2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 02 Mar 2026

Views

Reviewer Report 23 Mar 2026

Shawn Lyo, Hospital of the University of Pennsylvania, Philadelphia, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.196667.r465016

Summary:
This data note describes the creation of RAD-CaseBookLLM-08, an open-access dataset of large language model–generated radiology differential diagnosis teaching texts. The authors generated 225 cases spanning nine radiology subspecialties using a standardized prompt applied to lesion-based “key imaging findings” derived from a case-based radiology textbook. All outputs were produced using ChatGPT-4o via a web-based interface with conversation memory disabled, and each case was generated in an independent session. The model responses were copied verbatim without editing, correction, or validation, and formatting elements were preserved. The resulting dataset is organized by subspecialty and distributed as Word and PDF documents, with the goal of providing a time-stamped, reproducible collection of LLM-generated educational content for potential use in benchmarking, prompt engineering research, and analysis of machine-generated radiology teaching materials.

Overall:
This manuscript describes the creation of an open-access dataset of LLM-generated radiology differential diagnosis teaching texts using a standardized prompting approach. The emphasis on transparency, standardized generation, and open data release is commendable, and the concept of capturing a time-stamped snapshot of LLM-generated educational content is reasonable given the evolving nature of these models.

However, the practical utility of the dataset is not clearly established. While the authors propose that the dataset may support benchmarking, reproducibility, and educational analysis, these use cases are not concretely demonstrated, and it remains unclear how such a snapshot would be used in practice. In addition, the dataset is generated via a web-based interface without control over key parameters (e.g., temperature, model versioning), which limits the reproducibility and interpretability of the snapshot itself. The dataset is also provided as document-based outputs (Word/PDF) without structured or machine-readable formatting, annotation, or validation, further limiting its usability for downstream research. Additional clarification of intended use cases and minimal dataset characterization would strengthen the contribution.

Introduction:

The authors appropriately describe the increasing role of large language models in medical education and their ability to generate structured, textbook-style content.
The manuscript correctly highlights that radiology education is centered around structured diagnostic reasoning and differential diagnosis frameworks.
The authors note that existing studies often do not make full LLM-generated outputs publicly available, which is a reasonable motivation for dataset creation.
The specific purpose of the dataset remains unclear. While the authors reference reproducibility, benchmarking, and prompt engineering, these applications are not concretely defined.
It is not clear what specific research tasks or evaluations this dataset is intended to support.

Methods:

The dataset is derived from lesion-based “key imaging findings” taken from Top 3 Differentials in Radiology, which is a reasonable and clinically relevant framework.
The dataset includes 225 cases across multiple radiology subspecialties, providing broad coverage of common differential diagnosis scenarios.
The rationale for case selection is unclear. It is not specified how the 25 cases per subspecialty were chosen or whether this represents a complete or selective sampling of topics from the source material.
Several subspecialties (e.g., nuclear medicine, fetal imaging, ultrasound) were excluded despite the fact that similar differential diagnosis frameworks could be applied in these domains.
Dataset generation was performed using the ChatGPT web interface rather than an API-based approach. This significantly limits reproducibility, as key parameters such as temperature, seed, and token limits cannot be controlled or reported.
Only a single output was generated per prompt, with no assessment of variability or reproducibility across repeated runs. This is particularly relevant given the known stochasticity of LLM outputs.
The dataset is distributed in Word and PDF formats only, without a machine-readable structure (e.g., JSON or CSV), which limits its usability for computational analysis or benchmarking.
The dataset represents a single timepoint snapshot of model outputs, but no mechanisms are provided to assess temporal reproducibility or compare outputs across model versions.
No annotation, labeling, or metadata are provided (e.g., structured sections, differential categories, or reasoning components), further limiting downstream applications.
There is no characterization of dataset quality, including accuracy, completeness, internal consistency, or hallucination rates.
While the data generation process is transparent, the methodological choices limit reproducibility, standardization, and downstream usability of the dataset.
The dataset is derived from a structured radiology textbook. A limited comparison between the outputs and the textbook would have helped demonstrate the dataset’s potential utility for benchmarking, educational evaluation, or error analysis.

Is the rationale for creating the dataset(s) clearly described?

Yes
Are the protocols appropriate and is the work technically sound?

Partly
Are sufficient details of methods and materials provided to allow replication by others?

Partly
Are the datasets clearly presented in a useable and accessible format?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Neuroradiology, AI, Deep Learning, Large Language Models, Education

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 10 Apr 2026

Guillaume Fahrni, Department of Diagnostic and Interventional Radiology, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland

10 Apr 2026

Author Response

We sincerely thank the Reviewer for the thorough, constructive, and insightful evaluation of our manuscript. The comments have helped us meaningfully improve the manuscript. We address each point below.

... Continue reading We sincerely thank the Reviewer for the thorough, constructive, and insightful evaluation of our manuscript. The comments have helped us meaningfully improve the manuscript. We address each point below.

1. Introduction: "The specific purpose of the dataset remains unclear. While the authors reference reproducibility, benchmarking, and prompt engineering, these applications are not concretely defined."

We thank the reviewer for this comment. We agree that the intended applications of the dataset were insufficiently concrete in the original manuscript. The closing paragraph of the introduction described potential use cases in broad terms without providing specific examples of research tasks, which limited the reader's ability to assess the dataset's practical utility.
We have therefore revised this paragraph to clarify both the primary motivation for creating the dataset and its additional potential applications. Specifically, we now state that the dataset was developed primarily to support a planned comparative study evaluating radiology trainee learning performance when using LLM-generated teaching material versus a reference casebook. We further outline four additional concrete research applications: (1) longitudinal benchmarking against future or alternative LLM outputs using the same prompt set, (2) prompt sensitivity analysis by applying alternative prompting strategies to identical topics, (3) linguistic and structural analysis of how LLMs organize radiology differential diagnosis content, and (4) cross-lingual studies using the dataset as a fixed reference baseline to compare outputs generated in other languages, enabling analysis of translation fidelity and terminological consistency.

2. Introduction: "It is not clear what specific research tasks or evaluations this dataset is intended to support."

This concern is addressed in our response to point 1 above, as both comments relate to the same underlying issue regarding the clarity of intended use cases.

3. Methods: "The rationale for case selection is unclear. It is not specified how the 25 cases per subspecialty were chosen or whether this represents a complete or selective sampling of topics from the source material."

We thank the reviewer for this question. This point was indeed not explicitly stated in the original manuscript. To clarify, the 25 cases per subspecialty were not the result of selective sampling but represent the exhaustive set of lesion-based key imaging findings available in each corresponding section of the source textbook. No cases were excluded within the included subspecialties. We have added a sentence to the Methods section to make this explicit.

4. Methods: "Several subspecialties (e.g., nuclear medicine, fetal imaging, ultrasound) were excluded despite the fact that similar differential diagnosis frameworks could be applied in these domains."
We thank the reviewer for raising this point. We agree that the original manuscript did not provide sufficient justification for the exclusion of these sections. We have revised the relevant paragraph in the Methods section to clarify the rationale for each exclusion.
Specifically, the Roentgen Classics section was excluded as it presents single pathognomonic diagnoses rather than differential diagnosis frameworks, making it incompatible with the dataset's structure. Nuclear medicine, fetal imaging, and ultrasound imaging were excluded as these subspecialties follow more specific and distinct teaching approaches that differ from conventional cross-sectional radiology differential diagnosis frameworks. We acknowledge, as the reviewer notes, that similar prompting approaches could potentially be applied to these domains, and we explicitly state in the revised manuscript that expansion to these additional sections may be considered in future iterations of the dataset.

5. Methods: "Dataset generation was performed using the ChatGPT web interface rather than an API-based approach. This significantly limits reproducibility, as key parameters such as temperature, seed, and token limits cannot be controlled or reported."

We thank the reviewer for this comment. We fully acknowledge that the use of the web-based interface prevents reporting of key generation parameters such as temperature, seed, and token limits, and that this represents a limitation with respect to strict technical reproducibility. We have added a sentence to the Methods section explicitly acknowledging this.
However, we would like to highlight that the choice of the web-based interface was deliberate rather than incidental. While an API-based approach would offer greater parameter control, it would not reflect how these tools are actually used in clinical and educational practice. The vast majority of clinicians and trainees interact with LLMs through consumer-facing web interfaces, and we believe a dataset generated under these conditions is more representative of real-world outputs. The present dataset is therefore intentionally designed as a naturalistic snapshot of LLM-generated content as it would be encountered in practice.

6. Methods: "Only a single output was generated per prompt, with no assessment of variability or reproducibility across repeated runs."

We thank the reviewer for this comment. We agree that generating a single output per prompt does not allow assessment of variability across repeated runs, and we acknowledge this as a limitation. We have added a sentence to the Methods section to state this explicitly.
We would however note that assessing output variability was not a primary objective of this dataset, which was intentionally designed as a single time-stamped snapshot of LLM-generated content rather than a reproducibility study. Generating multiple outputs per prompt would constitute a different and complementary type of dataset, and we agree this would be a valuable direction for future iterations of this dataset.

7. Methods: "The dataset is distributed in Word and PDF formats only, without a machine-readable structure (e.g., JSON or CSV), which limits its usability for computational analysis or benchmarking."

We thank the reviewer for this comment, this is a great suggestion. In response, we have added three new machine-readable versions of the dataset to the Zenodo repository: JSON, CSV, and Markdown (.md) formats. These formats cover a range of common research use cases, from computational and NLP-based analyses to structured data processing. The Dataset Structure section of the Methods has been updated to reflect this addition.

8. Methods: "The dataset represents a single timepoint snapshot of model outputs, but no mechanisms are provided to assess temporal reproducibility or compare outputs across model versions."

We thank the reviewer for this comment. We agree that LLM outputs are inherently time-sensitive, as models are continuously updated and the same prompts submitted at a later timepoint or to a different model version would likely yield different results. This is precisely why capturing a dated snapshot has value, the present dataset provides a fixed baseline against which future outputs can be compared. We have added a sentence to the Methods section acknowledging this characteristic explicitly.

9. Methods: "No annotation, labeling, or metadata are provided (e.g., structured sections, differential categories, or reasoning components), further limiting downstream applications."

We thank the reviewer for this comment. We would however note that annotating and restructuring the outputs would be at odds with the core principle of this dataset, which is to capture verbatim, unmodified LLM-generated content. Introducing manual annotations or category labels would constitute a form of human intervention over the raw outputs, undermining the dataset's value as a naturalistic snapshot. We further note that the addition of JSON and CSV formats provides the maximum degree of machine-readable structure achievable without intervening in the content itself, including case-level metadata such as subspecialty, case number, and topic. Systematic annotation of differential categories and reasoning components across 225 cases would require expert medical validation and constitutes a separate research endeavour beyond the scope of a data note.

10. Methods: "There is no characterization of dataset quality, including accuracy, completeness, internal consistency, or hallucination rates."

We thank the reviewer for this comment. Regarding quality assessment, we would argue that such evaluation is inherently task-dependent and cannot be assessed in the abstract independently of a specific research application. More importantly, this type of validation is precisely the object of a planned follow-up study currently in preparation, as stated in the revised manuscript. Performing this analysis within the present data note would be equivalent to conducting a full benchmarking study, which exceeds the scope of this type of publication.

11. Methods: "The dataset is derived from a structured radiology textbook. A limited comparison between the outputs and the textbook would have helped demonstrate the dataset's potential utility for benchmarking, educational evaluation, or error analysis."

We thank the reviewer for this suggestion. We fully agree that a comparison between the LLM-generated outputs and the source textbook would be highly informative, and we note that this is precisely the primary objective of the planned follow-up study currently in preparation. The present dataset was intentionally structured to enable this comparison, as cases are labeled according to the case numbering system of the source textbook. We therefore respectfully consider this point already addressed by the planned study rather than a limitation of the dataset itself.
We sincerely thank the Reviewer for the thorough, constructive, and insightful evaluation of our manuscript. The comments have helped us meaningfully improve the manuscript. We address each point below.

1. Introduction: "The specific purpose of the dataset remains unclear. While the authors reference reproducibility, benchmarking, and prompt engineering, these applications are not concretely defined."

We thank the reviewer for this comment. We agree that the intended applications of the dataset were insufficiently concrete in the original manuscript. The closing paragraph of the introduction described potential use cases in broad terms without providing specific examples of research tasks, which limited the reader's ability to assess the dataset's practical utility.
We have therefore revised this paragraph to clarify both the primary motivation for creating the dataset and its additional potential applications. Specifically, we now state that the dataset was developed primarily to support a planned comparative study evaluating radiology trainee learning performance when using LLM-generated teaching material versus a reference casebook. We further outline four additional concrete research applications: (1) longitudinal benchmarking against future or alternative LLM outputs using the same prompt set, (2) prompt sensitivity analysis by applying alternative prompting strategies to identical topics, (3) linguistic and structural analysis of how LLMs organize radiology differential diagnosis content, and (4) cross-lingual studies using the dataset as a fixed reference baseline to compare outputs generated in other languages, enabling analysis of translation fidelity and terminological consistency.

2. Introduction: "It is not clear what specific research tasks or evaluations this dataset is intended to support."

This concern is addressed in our response to point 1 above, as both comments relate to the same underlying issue regarding the clarity of intended use cases.

3. Methods: "The rationale for case selection is unclear. It is not specified how the 25 cases per subspecialty were chosen or whether this represents a complete or selective sampling of topics from the source material."

We thank the reviewer for this question. This point was indeed not explicitly stated in the original manuscript. To clarify, the 25 cases per subspecialty were not the result of selective sampling but represent the exhaustive set of lesion-based key imaging findings available in each corresponding section of the source textbook. No cases were excluded within the included subspecialties. We have added a sentence to the Methods section to make this explicit.

4. Methods: "Several subspecialties (e.g., nuclear medicine, fetal imaging, ultrasound) were excluded despite the fact that similar differential diagnosis frameworks could be applied in these domains."
We thank the reviewer for raising this point. We agree that the original manuscript did not provide sufficient justification for the exclusion of these sections. We have revised the relevant paragraph in the Methods section to clarify the rationale for each exclusion.
Specifically, the Roentgen Classics section was excluded as it presents single pathognomonic diagnoses rather than differential diagnosis frameworks, making it incompatible with the dataset's structure. Nuclear medicine, fetal imaging, and ultrasound imaging were excluded as these subspecialties follow more specific and distinct teaching approaches that differ from conventional cross-sectional radiology differential diagnosis frameworks. We acknowledge, as the reviewer notes, that similar prompting approaches could potentially be applied to these domains, and we explicitly state in the revised manuscript that expansion to these additional sections may be considered in future iterations of the dataset.

5. Methods: "Dataset generation was performed using the ChatGPT web interface rather than an API-based approach. This significantly limits reproducibility, as key parameters such as temperature, seed, and token limits cannot be controlled or reported."

We thank the reviewer for this comment. We fully acknowledge that the use of the web-based interface prevents reporting of key generation parameters such as temperature, seed, and token limits, and that this represents a limitation with respect to strict technical reproducibility. We have added a sentence to the Methods section explicitly acknowledging this.
However, we would like to highlight that the choice of the web-based interface was deliberate rather than incidental. While an API-based approach would offer greater parameter control, it would not reflect how these tools are actually used in clinical and educational practice. The vast majority of clinicians and trainees interact with LLMs through consumer-facing web interfaces, and we believe a dataset generated under these conditions is more representative of real-world outputs. The present dataset is therefore intentionally designed as a naturalistic snapshot of LLM-generated content as it would be encountered in practice.

6. Methods: "Only a single output was generated per prompt, with no assessment of variability or reproducibility across repeated runs."

We thank the reviewer for this comment. We agree that generating a single output per prompt does not allow assessment of variability across repeated runs, and we acknowledge this as a limitation. We have added a sentence to the Methods section to state this explicitly.
We would however note that assessing output variability was not a primary objective of this dataset, which was intentionally designed as a single time-stamped snapshot of LLM-generated content rather than a reproducibility study. Generating multiple outputs per prompt would constitute a different and complementary type of dataset, and we agree this would be a valuable direction for future iterations of this dataset.

7. Methods: "The dataset is distributed in Word and PDF formats only, without a machine-readable structure (e.g., JSON or CSV), which limits its usability for computational analysis or benchmarking."

We thank the reviewer for this comment, this is a great suggestion. In response, we have added three new machine-readable versions of the dataset to the Zenodo repository: JSON, CSV, and Markdown (.md) formats. These formats cover a range of common research use cases, from computational and NLP-based analyses to structured data processing. The Dataset Structure section of the Methods has been updated to reflect this addition.

8. Methods: "The dataset represents a single timepoint snapshot of model outputs, but no mechanisms are provided to assess temporal reproducibility or compare outputs across model versions."

We thank the reviewer for this comment. We agree that LLM outputs are inherently time-sensitive, as models are continuously updated and the same prompts submitted at a later timepoint or to a different model version would likely yield different results. This is precisely why capturing a dated snapshot has value, the present dataset provides a fixed baseline against which future outputs can be compared. We have added a sentence to the Methods section acknowledging this characteristic explicitly.

9. Methods: "No annotation, labeling, or metadata are provided (e.g., structured sections, differential categories, or reasoning components), further limiting downstream applications."

We thank the reviewer for this comment. We would however note that annotating and restructuring the outputs would be at odds with the core principle of this dataset, which is to capture verbatim, unmodified LLM-generated content. Introducing manual annotations or category labels would constitute a form of human intervention over the raw outputs, undermining the dataset's value as a naturalistic snapshot. We further note that the addition of JSON and CSV formats provides the maximum degree of machine-readable structure achievable without intervening in the content itself, including case-level metadata such as subspecialty, case number, and topic. Systematic annotation of differential categories and reasoning components across 225 cases would require expert medical validation and constitutes a separate research endeavour beyond the scope of a data note.

10. Methods: "There is no characterization of dataset quality, including accuracy, completeness, internal consistency, or hallucination rates."

We thank the reviewer for this comment. Regarding quality assessment, we would argue that such evaluation is inherently task-dependent and cannot be assessed in the abstract independently of a specific research application. More importantly, this type of validation is precisely the object of a planned follow-up study currently in preparation, as stated in the revised manuscript. Performing this analysis within the present data note would be equivalent to conducting a full benchmarking study, which exceeds the scope of this type of publication.

11. Methods: "The dataset is derived from a structured radiology textbook. A limited comparison between the outputs and the textbook would have helped demonstrate the dataset's potential utility for benchmarking, educational evaluation, or error analysis."

We thank the reviewer for this suggestion. We fully agree that a comparison between the LLM-generated outputs and the source textbook would be highly informative, and we note that this is precisely the primary objective of the planned follow-up study currently in preparation. The present dataset was intentionally structured to enable this comparison, as cases are labeled according to the case numbering system of the source textbook. We therefore respectfully consider this point already addressed by the planned study rather than a limitation of the dataset itself.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 10 Apr 2026

Guillaume Fahrni, Department of Diagnostic and Interventional Radiology, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland

10 Apr 2026

Author Response

We sincerely thank the Reviewer for the thorough, constructive, and insightful evaluation of our manuscript. The comments have helped us meaningfully improve the manuscript. We address each point below.

... Continue reading We sincerely thank the Reviewer for the thorough, constructive, and insightful evaluation of our manuscript. The comments have helped us meaningfully improve the manuscript. We address each point below.

1. Introduction: "The specific purpose of the dataset remains unclear. While the authors reference reproducibility, benchmarking, and prompt engineering, these applications are not concretely defined."

We thank the reviewer for this comment. We agree that the intended applications of the dataset were insufficiently concrete in the original manuscript. The closing paragraph of the introduction described potential use cases in broad terms without providing specific examples of research tasks, which limited the reader's ability to assess the dataset's practical utility.
We have therefore revised this paragraph to clarify both the primary motivation for creating the dataset and its additional potential applications. Specifically, we now state that the dataset was developed primarily to support a planned comparative study evaluating radiology trainee learning performance when using LLM-generated teaching material versus a reference casebook. We further outline four additional concrete research applications: (1) longitudinal benchmarking against future or alternative LLM outputs using the same prompt set, (2) prompt sensitivity analysis by applying alternative prompting strategies to identical topics, (3) linguistic and structural analysis of how LLMs organize radiology differential diagnosis content, and (4) cross-lingual studies using the dataset as a fixed reference baseline to compare outputs generated in other languages, enabling analysis of translation fidelity and terminological consistency.

2. Introduction: "It is not clear what specific research tasks or evaluations this dataset is intended to support."

This concern is addressed in our response to point 1 above, as both comments relate to the same underlying issue regarding the clarity of intended use cases.

3. Methods: "The rationale for case selection is unclear. It is not specified how the 25 cases per subspecialty were chosen or whether this represents a complete or selective sampling of topics from the source material."

We thank the reviewer for this question. This point was indeed not explicitly stated in the original manuscript. To clarify, the 25 cases per subspecialty were not the result of selective sampling but represent the exhaustive set of lesion-based key imaging findings available in each corresponding section of the source textbook. No cases were excluded within the included subspecialties. We have added a sentence to the Methods section to make this explicit.

4. Methods: "Several subspecialties (e.g., nuclear medicine, fetal imaging, ultrasound) were excluded despite the fact that similar differential diagnosis frameworks could be applied in these domains."
We thank the reviewer for raising this point. We agree that the original manuscript did not provide sufficient justification for the exclusion of these sections. We have revised the relevant paragraph in the Methods section to clarify the rationale for each exclusion.
Specifically, the Roentgen Classics section was excluded as it presents single pathognomonic diagnoses rather than differential diagnosis frameworks, making it incompatible with the dataset's structure. Nuclear medicine, fetal imaging, and ultrasound imaging were excluded as these subspecialties follow more specific and distinct teaching approaches that differ from conventional cross-sectional radiology differential diagnosis frameworks. We acknowledge, as the reviewer notes, that similar prompting approaches could potentially be applied to these domains, and we explicitly state in the revised manuscript that expansion to these additional sections may be considered in future iterations of the dataset.

5. Methods: "Dataset generation was performed using the ChatGPT web interface rather than an API-based approach. This significantly limits reproducibility, as key parameters such as temperature, seed, and token limits cannot be controlled or reported."

We thank the reviewer for this comment. We fully acknowledge that the use of the web-based interface prevents reporting of key generation parameters such as temperature, seed, and token limits, and that this represents a limitation with respect to strict technical reproducibility. We have added a sentence to the Methods section explicitly acknowledging this.
However, we would like to highlight that the choice of the web-based interface was deliberate rather than incidental. While an API-based approach would offer greater parameter control, it would not reflect how these tools are actually used in clinical and educational practice. The vast majority of clinicians and trainees interact with LLMs through consumer-facing web interfaces, and we believe a dataset generated under these conditions is more representative of real-world outputs. The present dataset is therefore intentionally designed as a naturalistic snapshot of LLM-generated content as it would be encountered in practice.

6. Methods: "Only a single output was generated per prompt, with no assessment of variability or reproducibility across repeated runs."

We thank the reviewer for this comment. We agree that generating a single output per prompt does not allow assessment of variability across repeated runs, and we acknowledge this as a limitation. We have added a sentence to the Methods section to state this explicitly.
We would however note that assessing output variability was not a primary objective of this dataset, which was intentionally designed as a single time-stamped snapshot of LLM-generated content rather than a reproducibility study. Generating multiple outputs per prompt would constitute a different and complementary type of dataset, and we agree this would be a valuable direction for future iterations of this dataset.

7. Methods: "The dataset is distributed in Word and PDF formats only, without a machine-readable structure (e.g., JSON or CSV), which limits its usability for computational analysis or benchmarking."

We thank the reviewer for this comment, this is a great suggestion. In response, we have added three new machine-readable versions of the dataset to the Zenodo repository: JSON, CSV, and Markdown (.md) formats. These formats cover a range of common research use cases, from computational and NLP-based analyses to structured data processing. The Dataset Structure section of the Methods has been updated to reflect this addition.

8. Methods: "The dataset represents a single timepoint snapshot of model outputs, but no mechanisms are provided to assess temporal reproducibility or compare outputs across model versions."

We thank the reviewer for this comment. We agree that LLM outputs are inherently time-sensitive, as models are continuously updated and the same prompts submitted at a later timepoint or to a different model version would likely yield different results. This is precisely why capturing a dated snapshot has value, the present dataset provides a fixed baseline against which future outputs can be compared. We have added a sentence to the Methods section acknowledging this characteristic explicitly.

9. Methods: "No annotation, labeling, or metadata are provided (e.g., structured sections, differential categories, or reasoning components), further limiting downstream applications."

We thank the reviewer for this comment. We would however note that annotating and restructuring the outputs would be at odds with the core principle of this dataset, which is to capture verbatim, unmodified LLM-generated content. Introducing manual annotations or category labels would constitute a form of human intervention over the raw outputs, undermining the dataset's value as a naturalistic snapshot. We further note that the addition of JSON and CSV formats provides the maximum degree of machine-readable structure achievable without intervening in the content itself, including case-level metadata such as subspecialty, case number, and topic. Systematic annotation of differential categories and reasoning components across 225 cases would require expert medical validation and constitutes a separate research endeavour beyond the scope of a data note.

10. Methods: "There is no characterization of dataset quality, including accuracy, completeness, internal consistency, or hallucination rates."

We thank the reviewer for this comment. Regarding quality assessment, we would argue that such evaluation is inherently task-dependent and cannot be assessed in the abstract independently of a specific research application. More importantly, this type of validation is precisely the object of a planned follow-up study currently in preparation, as stated in the revised manuscript. Performing this analysis within the present data note would be equivalent to conducting a full benchmarking study, which exceeds the scope of this type of publication.

11. Methods: "The dataset is derived from a structured radiology textbook. A limited comparison between the outputs and the textbook would have helped demonstrate the dataset's potential utility for benchmarking, educational evaluation, or error analysis."

We thank the reviewer for this suggestion. We fully agree that a comparison between the LLM-generated outputs and the source textbook would be highly informative, and we note that this is precisely the primary objective of the planned follow-up study currently in preparation. The present dataset was intentionally structured to enable this comparison, as cases are labeled according to the case numbering system of the source textbook. We therefore respectfully consider this point already addressed by the planned study rather than a limitation of the dataset itself.
We sincerely thank the Reviewer for the thorough, constructive, and insightful evaluation of our manuscript. The comments have helped us meaningfully improve the manuscript. We address each point below.

1. Introduction: "The specific purpose of the dataset remains unclear. While the authors reference reproducibility, benchmarking, and prompt engineering, these applications are not concretely defined."

We thank the reviewer for this comment. We agree that the intended applications of the dataset were insufficiently concrete in the original manuscript. The closing paragraph of the introduction described potential use cases in broad terms without providing specific examples of research tasks, which limited the reader's ability to assess the dataset's practical utility.
We have therefore revised this paragraph to clarify both the primary motivation for creating the dataset and its additional potential applications. Specifically, we now state that the dataset was developed primarily to support a planned comparative study evaluating radiology trainee learning performance when using LLM-generated teaching material versus a reference casebook. We further outline four additional concrete research applications: (1) longitudinal benchmarking against future or alternative LLM outputs using the same prompt set, (2) prompt sensitivity analysis by applying alternative prompting strategies to identical topics, (3) linguistic and structural analysis of how LLMs organize radiology differential diagnosis content, and (4) cross-lingual studies using the dataset as a fixed reference baseline to compare outputs generated in other languages, enabling analysis of translation fidelity and terminological consistency.

2. Introduction: "It is not clear what specific research tasks or evaluations this dataset is intended to support."

This concern is addressed in our response to point 1 above, as both comments relate to the same underlying issue regarding the clarity of intended use cases.

3. Methods: "The rationale for case selection is unclear. It is not specified how the 25 cases per subspecialty were chosen or whether this represents a complete or selective sampling of topics from the source material."

We thank the reviewer for this question. This point was indeed not explicitly stated in the original manuscript. To clarify, the 25 cases per subspecialty were not the result of selective sampling but represent the exhaustive set of lesion-based key imaging findings available in each corresponding section of the source textbook. No cases were excluded within the included subspecialties. We have added a sentence to the Methods section to make this explicit.

4. Methods: "Several subspecialties (e.g., nuclear medicine, fetal imaging, ultrasound) were excluded despite the fact that similar differential diagnosis frameworks could be applied in these domains."
We thank the reviewer for raising this point. We agree that the original manuscript did not provide sufficient justification for the exclusion of these sections. We have revised the relevant paragraph in the Methods section to clarify the rationale for each exclusion.
Specifically, the Roentgen Classics section was excluded as it presents single pathognomonic diagnoses rather than differential diagnosis frameworks, making it incompatible with the dataset's structure. Nuclear medicine, fetal imaging, and ultrasound imaging were excluded as these subspecialties follow more specific and distinct teaching approaches that differ from conventional cross-sectional radiology differential diagnosis frameworks. We acknowledge, as the reviewer notes, that similar prompting approaches could potentially be applied to these domains, and we explicitly state in the revised manuscript that expansion to these additional sections may be considered in future iterations of the dataset.

5. Methods: "Dataset generation was performed using the ChatGPT web interface rather than an API-based approach. This significantly limits reproducibility, as key parameters such as temperature, seed, and token limits cannot be controlled or reported."

We thank the reviewer for this comment. We fully acknowledge that the use of the web-based interface prevents reporting of key generation parameters such as temperature, seed, and token limits, and that this represents a limitation with respect to strict technical reproducibility. We have added a sentence to the Methods section explicitly acknowledging this.
However, we would like to highlight that the choice of the web-based interface was deliberate rather than incidental. While an API-based approach would offer greater parameter control, it would not reflect how these tools are actually used in clinical and educational practice. The vast majority of clinicians and trainees interact with LLMs through consumer-facing web interfaces, and we believe a dataset generated under these conditions is more representative of real-world outputs. The present dataset is therefore intentionally designed as a naturalistic snapshot of LLM-generated content as it would be encountered in practice.

6. Methods: "Only a single output was generated per prompt, with no assessment of variability or reproducibility across repeated runs."

We thank the reviewer for this comment. We agree that generating a single output per prompt does not allow assessment of variability across repeated runs, and we acknowledge this as a limitation. We have added a sentence to the Methods section to state this explicitly.
We would however note that assessing output variability was not a primary objective of this dataset, which was intentionally designed as a single time-stamped snapshot of LLM-generated content rather than a reproducibility study. Generating multiple outputs per prompt would constitute a different and complementary type of dataset, and we agree this would be a valuable direction for future iterations of this dataset.

7. Methods: "The dataset is distributed in Word and PDF formats only, without a machine-readable structure (e.g., JSON or CSV), which limits its usability for computational analysis or benchmarking."

We thank the reviewer for this comment, this is a great suggestion. In response, we have added three new machine-readable versions of the dataset to the Zenodo repository: JSON, CSV, and Markdown (.md) formats. These formats cover a range of common research use cases, from computational and NLP-based analyses to structured data processing. The Dataset Structure section of the Methods has been updated to reflect this addition.

8. Methods: "The dataset represents a single timepoint snapshot of model outputs, but no mechanisms are provided to assess temporal reproducibility or compare outputs across model versions."

We thank the reviewer for this comment. We agree that LLM outputs are inherently time-sensitive, as models are continuously updated and the same prompts submitted at a later timepoint or to a different model version would likely yield different results. This is precisely why capturing a dated snapshot has value, the present dataset provides a fixed baseline against which future outputs can be compared. We have added a sentence to the Methods section acknowledging this characteristic explicitly.

9. Methods: "No annotation, labeling, or metadata are provided (e.g., structured sections, differential categories, or reasoning components), further limiting downstream applications."

We thank the reviewer for this comment. We would however note that annotating and restructuring the outputs would be at odds with the core principle of this dataset, which is to capture verbatim, unmodified LLM-generated content. Introducing manual annotations or category labels would constitute a form of human intervention over the raw outputs, undermining the dataset's value as a naturalistic snapshot. We further note that the addition of JSON and CSV formats provides the maximum degree of machine-readable structure achievable without intervening in the content itself, including case-level metadata such as subspecialty, case number, and topic. Systematic annotation of differential categories and reasoning components across 225 cases would require expert medical validation and constitutes a separate research endeavour beyond the scope of a data note.

10. Methods: "There is no characterization of dataset quality, including accuracy, completeness, internal consistency, or hallucination rates."

We thank the reviewer for this comment. Regarding quality assessment, we would argue that such evaluation is inherently task-dependent and cannot be assessed in the abstract independently of a specific research application. More importantly, this type of validation is precisely the object of a planned follow-up study currently in preparation, as stated in the revised manuscript. Performing this analysis within the present data note would be equivalent to conducting a full benchmarking study, which exceeds the scope of this type of publication.

11. Methods: "The dataset is derived from a structured radiology textbook. A limited comparison between the outputs and the textbook would have helped demonstrate the dataset's potential utility for benchmarking, educational evaluation, or error analysis."

We thank the reviewer for this suggestion. We fully agree that a comparison between the LLM-generated outputs and the source textbook would be highly informative, and we note that this is precisely the primary objective of the planned follow-up study currently in preparation. The present dataset was intentionally structured to enable this comparison, as cases are labeled according to the case numbering system of the source textbook. We therefore respectfully consider this point already addressed by the planned study rather than a limitation of the dataset itself.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 17 Mar 2026

Craig S Webster, The University of Auckland, Auckland, Auckland, New Zealand

Approved

https://doi.org/10.5256/f1000research.196667.r465013

Title: The title is a noun cluster which makes it hard to understand. Better to try to use some small words to break up the nouns, e.g. “An open-access dataset of radiology differential diagnosis teachings generated with a large-language model” or something similar.

Page 1, Methods: Again noun clusters “thematic key imaging findings” – better to say something like key findings of thematic imaging analysis?

Page 2, top: No patient data were included – this seems counter-intuitive. I was expecting radiographs from actual patients to be part of this, but it is only much later that you explain why this isn’t the case. I think this fact needs to be clearer in the abstract (only needs a few more words).

Page 3, top: “context-aware natural language outputs…” – many would argue that LLMs are not good with context and are aware of nothing – so this seems like poor choice of words here.

Page 3, bottom: We don’t need to know the number of characters – the number of words is fine.

Page 5, top: Are all 25 cases drawn from the textbook you mentioned? Or did you generate variations on each case found in the textbook?

Tables: Please number the rows in your tables to correspond to the 1 to 25 cases. I also think it is confusing that you head the tables “imaging findings” – when actually you didn’t use any radiographs here

Is the rationale for creating the dataset(s) clearly described?

Yes
Are the protocols appropriate and is the work technically sound?

Yes
Are sufficient details of methods and materials provided to allow replication by others?

Partly
Are the datasets clearly presented in a useable and accessible format?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: AI, medical education, system redesign

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 02 Mar 2026

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 2 (revision) 10 Apr 26
Version 1 02 Mar 26	read	read

Craig S Webster, The University of Auckland, Auckland, New Zealand
Shawn Lyo, Hospital of the University of Pennsylvania, Philadelphia, USA

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

9 Views

23 Mar 2026 | for Version 1

Shawn Lyo, Hospital of the University of Pennsylvania, Philadelphia, USA

9 Views Cite this report Responses(1)

Approved With Reservations

The authors appropriately describe the increasing role of large language models in medical education and their ability to generate structured, textbook-style content.
The manuscript correctly highlights that radiology education is centered around structured diagnostic reasoning and differential diagnosis frameworks.
The authors note that existing studies often do not make full LLM-generated outputs publicly available, which is a reasonable motivation for dataset creation.
The specific purpose of the dataset remains unclear. While the authors reference reproducibility, benchmarking, and prompt engineering, these applications are not concretely defined.
It is not clear what specific research tasks or evaluations this dataset is intended to support.

Methods:

The dataset is derived from lesion-based “key imaging findings” taken from Top 3 Differentials in Radiology, which is a reasonable and clinically relevant framework.
The dataset includes 225 cases across multiple radiology subspecialties, providing broad coverage of common differential diagnosis scenarios.
The rationale for case selection is unclear. It is not specified how the 25 cases per subspecialty were chosen or whether this represents a complete or selective sampling of topics from the source material.
Several subspecialties (e.g., nuclear medicine, fetal imaging, ultrasound) were excluded despite the fact that similar differential diagnosis frameworks could be applied in these domains.
Dataset generation was performed using the ChatGPT web interface rather than an API-based approach. This significantly limits reproducibility, as key parameters such as temperature, seed, and token limits cannot be controlled or reported.
Only a single output was generated per prompt, with no assessment of variability or reproducibility across repeated runs. This is particularly relevant given the known stochasticity of LLM outputs.
The dataset is distributed in Word and PDF formats only, without a machine-readable structure (e.g., JSON or CSV), which limits its usability for computational analysis or benchmarking.
The dataset represents a single timepoint snapshot of model outputs, but no mechanisms are provided to assess temporal reproducibility or compare outputs across model versions.
No annotation, labeling, or metadata are provided (e.g., structured sections, differential categories, or reasoning components), further limiting downstream applications.
There is no characterization of dataset quality, including accuracy, completeness, internal consistency, or hallucination rates.
While the data generation process is transparent, the methodological choices limit reproducibility, standardization, and downstream usability of the dataset.
The dataset is derived from a structured radiology textbook. A limited comparison between the outputs and the textbook would have helped demonstrate the dataset’s potential utility for benchmarking, educational evaluation, or error analysis.

Is the rationale for creating the dataset(s) clearly described?

Yes
Are the protocols appropriate and is the work technically sound?

Partly
Are sufficient details of methods and materials provided to allow replication by others?

Partly
Are the datasets clearly presented in a useable and accessible format?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Neuroradiology, AI, Deep Learning, Large Language Models, Education

Respond to this report

Responses (1)

Author Response

10 Apr 2026

Guillaume Fahrni, Department of Diagnostic and Interventional Radiology, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland

We sincerely thank the Reviewer for the thorough, constructive, and insightful evaluation of our manuscript. The comments have helped us meaningfully improve the manuscript. We address each point below.

1. Introduction: "The specific purpose of the dataset remains unclear. While the authors reference reproducibility, benchmarking, and prompt engineering, these applications are not concretely defined."

We thank the reviewer for this comment. We agree that the intended applications of the dataset were insufficiently concrete in the original manuscript. The closing paragraph of the introduction described potential use cases in broad terms without providing specific examples of research tasks, which limited the reader's ability to assess the dataset's practical utility.
We have therefore revised this paragraph to clarify both the primary motivation for creating the dataset and its additional potential applications. Specifically, we now state that the dataset was developed primarily to support a planned comparative study evaluating radiology trainee learning performance when using LLM-generated teaching material versus a reference casebook. We further outline four additional concrete research applications: (1) longitudinal benchmarking against future or alternative LLM outputs using the same prompt set, (2) prompt sensitivity analysis by applying alternative prompting strategies to identical topics, (3) linguistic and structural analysis of how LLMs organize radiology differential diagnosis content, and (4) cross-lingual studies using the dataset as a fixed reference baseline to compare outputs generated in other languages, enabling analysis of translation fidelity and terminological consistency.

2. Introduction: "It is not clear what specific research tasks or evaluations this dataset is intended to support."

This concern is addressed in our response to point 1 above, as both comments relate to the same underlying issue regarding the clarity of intended use cases.

3. Methods: "The rationale for case selection is unclear. It is not specified how the 25 cases per subspecialty were chosen or whether this represents a complete or selective sampling of topics from the source material."

We thank the reviewer for this question. This point was indeed not explicitly stated in the original manuscript. To clarify, the 25 cases per subspecialty were not the result of selective sampling but represent the exhaustive set of lesion-based key imaging findings available in each corresponding section of the source textbook. No cases were excluded within the included subspecialties. We have added a sentence to the Methods section to make this explicit.

4. Methods: "Several subspecialties (e.g., nuclear medicine, fetal imaging, ultrasound) were excluded despite the fact that similar differential diagnosis frameworks could be applied in these domains."
We thank the reviewer for raising this point. We agree that the original manuscript did not provide sufficient justification for the exclusion of these sections. We have revised the relevant paragraph in the Methods section to clarify the rationale for each exclusion.
Specifically, the Roentgen Classics section was excluded as it presents single pathognomonic diagnoses rather than differential diagnosis frameworks, making it incompatible with the dataset's structure. Nuclear medicine, fetal imaging, and ultrasound imaging were excluded as these subspecialties follow more specific and distinct teaching approaches that differ from conventional cross-sectional radiology differential diagnosis frameworks. We acknowledge, as the reviewer notes, that similar prompting approaches could potentially be applied to these domains, and we explicitly state in the revised manuscript that expansion to these additional sections may be considered in future iterations of the dataset.

5. Methods: "Dataset generation was performed using the ChatGPT web interface rather than an API-based approach. This significantly limits reproducibility, as key parameters such as temperature, seed, and token limits cannot be controlled or reported."

We thank the reviewer for this comment. We fully acknowledge that the use of the web-based interface prevents reporting of key generation parameters such as temperature, seed, and token limits, and that this represents a limitation with respect to strict technical reproducibility. We have added a sentence to the Methods section explicitly acknowledging this.
However, we would like to highlight that the choice of the web-based interface was deliberate rather than incidental. While an API-based approach would offer greater parameter control, it would not reflect how these tools are actually used in clinical and educational practice. The vast majority of clinicians and trainees interact with LLMs through consumer-facing web interfaces, and we believe a dataset generated under these conditions is more representative of real-world outputs. The present dataset is therefore intentionally designed as a naturalistic snapshot of LLM-generated content as it would be encountered in practice.

6. Methods: "Only a single output was generated per prompt, with no assessment of variability or reproducibility across repeated runs."

We thank the reviewer for this comment. We agree that generating a single output per prompt does not allow assessment of variability across repeated runs, and we acknowledge this as a limitation. We have added a sentence to the Methods section to state this explicitly.
We would however note that assessing output variability was not a primary objective of this dataset, which was intentionally designed as a single time-stamped snapshot of LLM-generated content rather than a reproducibility study. Generating multiple outputs per prompt would constitute a different and complementary type of dataset, and we agree this would be a valuable direction for future iterations of this dataset.

7. Methods: "The dataset is distributed in Word and PDF formats only, without a machine-readable structure (e.g., JSON or CSV), which limits its usability for computational analysis or benchmarking."

We thank the reviewer for this comment, this is a great suggestion. In response, we have added three new machine-readable versions of the dataset to the Zenodo repository: JSON, CSV, and Markdown (.md) formats. These formats cover a range of common research use cases, from computational and NLP-based analyses to structured data processing. The Dataset Structure section of the Methods has been updated to reflect this addition.

8. Methods: "The dataset represents a single timepoint snapshot of model outputs, but no mechanisms are provided to assess temporal reproducibility or compare outputs across model versions."

We thank the reviewer for this comment. We agree that LLM outputs are inherently time-sensitive, as models are continuously updated and the same prompts submitted at a later timepoint or to a different model version would likely yield different results. This is precisely why capturing a dated snapshot has value, the present dataset provides a fixed baseline against which future outputs can be compared. We have added a sentence to the Methods section acknowledging this characteristic explicitly.

9. Methods: "No annotation, labeling, or metadata are provided (e.g., structured sections, differential categories, or reasoning components), further limiting downstream applications."

We thank the reviewer for this comment. We would however note that annotating and restructuring the outputs would be at odds with the core principle of this dataset, which is to capture verbatim, unmodified LLM-generated content. Introducing manual annotations or category labels would constitute a form of human intervention over the raw outputs, undermining the dataset's value as a naturalistic snapshot. We further note that the addition of JSON and CSV formats provides the maximum degree of machine-readable structure achievable without intervening in the content itself, including case-level metadata such as subspecialty, case number, and topic. Systematic annotation of differential categories and reasoning components across 225 cases would require expert medical validation and constitutes a separate research endeavour beyond the scope of a data note.

10. Methods: "There is no characterization of dataset quality, including accuracy, completeness, internal consistency, or hallucination rates."

We thank the reviewer for this comment. Regarding quality assessment, we would argue that such evaluation is inherently task-dependent and cannot be assessed in the abstract independently of a specific research application. More importantly, this type of validation is precisely the object of a planned follow-up study currently in preparation, as stated in the revised manuscript. Performing this analysis within the present data note would be equivalent to conducting a full benchmarking study, which exceeds the scope of this type of publication.

11. Methods: "The dataset is derived from a structured radiology textbook. A limited comparison between the outputs and the textbook would have helped demonstrate the dataset's potential utility for benchmarking, educational evaluation, or error analysis."

We thank the reviewer for this suggestion. We fully agree that a comparison between the LLM-generated outputs and the source textbook would be highly informative, and we note that this is precisely the primary objective of the planned follow-up study currently in preparation. The present dataset was intentionally structured to enable this comparison, as cases are labeled according to the case numbering system of the source textbook. We therefore respectfully consider this point already addressed by the planned study rather than a limitation of the dataset itself.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

15 Views

17 Mar 2026 | for Version 1

Craig S Webster, The University of Auckland, Auckland, Auckland, New Zealand

15 Views Cite this report Responses(0)

Approved

Is the rationale for creating the dataset(s) clearly described?

Yes
Are the protocols appropriate and is the work technically sound?

Yes
Are sufficient details of methods and materials provided to allow replication by others?

Partly
Are the datasets clearly presented in a useable and accessible format?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

AI, medical education, system redesign

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Raiaan MAK, Mukta MSH, Fatema K, et al.: A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access. 2024; 12: 26839–26874.

[2] 2. Naveed H, Khan AU, Qiu S, et al.: A Comprehensive Overview of Large Language Models. ACM Trans. Intell. Syst. Technol. 2025 Oct 31; 16(5): 1–72. Publisher Full Text

[3] 3. Routray SK, Javali A, Sharmila KP, et al.: Large Language Models (LLMs): Hypes and Realities. 2023 International Conference on Computer Science and Emerging Technologies (CSET). Bangalore, India: IEEE; 2023 [cited 2026 Feb 13]; pp. 1–6. Reference Source

[4] 4. Vrdoljak J, Boban Z, Vilović M, et al.: A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration. Healthcare. 2025 Mar 10; 13(6): 603. Publisher Full Text

[5] 5. Shool S, Adimi S, Saboori Amleshi R, et al.: A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med. Inform. Decis. Mak. 2025 Mar 7; 25(1): 117. PubMed Abstract | Publisher Full Text

[6] 6. Saliba T, Ferrari J, Pozzessere C, et al.: Can advanced large language models support radiology training? A performance assessment of DeepSeek R1. Eur. J. Radiol. Artif. Intell. 2025 Sep; 3: 100024. Publisher Full Text

[7] 7. Dong B, Bai J, Xu T, et al.: Large Language Models in Education: A Systematic Review. 2024 6th International Conference on Computer Science and Technologies in Education (CSTE). Xi’an, China: IEEE; 2024 [cited 2026 Feb 13]; pp. 131–134. Reference Source

[8] 8. García-Méndez S, De Arriba-Pérez F, Somoza-López MDC: A Review on the Use of Large Language Models as Virtual Tutors. Sci. Educ. 2025 Apr; 34(2): 877–892. Publisher Full Text

[9] 9. European Society of Radiology (ESR): ESR statement on new approaches to undergraduate teaching in Radiology. Insights Imaging. 2019 Dec; 10(1): 109.

[10] 10. Kainberger F, Kletter K: Radiologie in einem prägraduellen problembasiert-integrierten Medizincurriculum. RöFo - Fortschritte Auf Dem Geb Röntgenstrahlen Bildgeb Verfahr. 2007 Nov; 179(11): 1137–1144. Publisher Full Text

[11] 11. Zhou S, Lin M, Ding S, et al.: Explainable differential diagnosis with dual-inference large language models. Npj Health Syst. 2025 Apr 24; 2(1): 12. PubMed Abstract | Publisher Full Text

[12] 12. Chen F, Cato K, Gürsoy G, et al.: Toward Identifying New Risk Aversions and Subsequent Limitations and Biases When Making De-identified Structured Data Sets Openly Available in a Post-LLM world. AMIA Annu Symp Proc AMIA Symp. 2024; 2024: 262–270. PubMed Abstract

[13] 13. Manchanda J, Boettcher L, Westphalen M, et al.: The Open Source Advantage in Large Language Models (LLMs). arXiv. 2024 [cited 2026 Feb 13]. Reference Source

[14] 14. Mousavi SM, Alghisi S, Riccardi G: DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs. Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics; 2024 [cited 2026 Feb 13]; pp. 8014–8029. Reference Source

[15] 15. Fahrni G, Rotzinger DC: Expanding on “A Hitchhiker’s Guide to Good Prompting Practices for Large Language Models in Radiology.”. J. Am. Coll. Radiol. 2025 Nov; 22(11): 1258–1259. PubMed Abstract | Publisher Full Text

[16] 16. Mons B, Neylon C, Velterop J, et al.: Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud. Inf. Serv. Use. 2017 Feb; 37(1): 49–56. Publisher Full Text

[17] 17. O’Brien WT: Top 3 Differentials in Radiology: A Case Review. New York: Thieme Medical Publishers; 2010.

[18] 18. Saliba T, Fahrni G: RAD-CaseBookLLM-08. Zenodo. 2026 [cited 2026 Feb 17]. Publisher Full Text

RAD-CaseBookLLM-08: An open-access dataset of structured large language model–generated radiology differential diagnosis teachings

Abstract

Background

Methods

Conclusions

Keywords

Revised Amendments from Version 1

Introduction

Methods

Source of thematic topics

LLM environment

Prompt development

Table 1. List of cardiothoracic, gastrointestinal, and genitourinary key imaging findings.

Table 2. List of musculoskeletal, head and neck, and neuro key imaging findings.

Table 3. List of pediatrics, vascular and interventional, and breast key imaging findings.

Ethical Considerations

Data availability

Acknowledgments

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated