Keywords
Radiology education; Large language models; ChatGPT; Medical artificial intelligence; Differential diagnosis; Open dataset; Medical education research
This article is included in the Artificial Intelligence and Machine Learning gateway.
This article is included in the AI in Medicine and Healthcare collection.
This article is included in the Data: Use and Reuse collection.
Large language models are increasingly explored in medical education, particularly for generating structured explanatory content. However, openly accessible datasets capturing full-length model outputs in a standardized and reusable format remain limited. In radiology education, differential diagnosis teaching is typically organized around key imaging findings integrated with clinical reasoning. We developed RAD-CaseBookLLM-08, an open dataset of large language model–generated radiology differential diagnosis teachings derived from lesion-based thematic topics.
The dataset comprises 225 cases across nine radiology subspecialties. Thematic key imaging findings were derived from an established case-based radiology textbook and used as structured prompts. All cases were generated using ChatGPT-4o (OpenAI) in March 2025 via a web-based interface with conversation memory disabled. Each topic was processed in an independent session using an identical prompt template in which only the subspecialty and imaging finding were modified. Outputs were copied verbatim without editing, correction, or validation, and formatting elements were preserved. The dataset is provided in Microsoft Word and Portable Document Format files and is organized by subspecialty with sequential case labeling. No patient data were included.
RAD-CaseBookLLM-08 provides a structured, time-stamped collection of large language model–generated radiology teaching texts. The dataset may support reproducibility studies, benchmarking of model outputs, prompt engineering evaluation, and analysis of educational structure in machine-generated differential diagnoses. It is openly available under a Creative Commons Zero license via Zenodo.
Radiology education; Large language models; ChatGPT; Medical artificial intelligence; Differential diagnosis; Open dataset; Medical education research
The Introduction was updated to clarify the primary motivation for creating the dataset and to provide concrete examples of intended research applications, including a planned comparative educational study, longitudinal benchmarking, prompt sensitivity analysis, linguistic and structural analysis, and cross-lingual research. The Methods section was revised to explicitly state that the 25 cases per subspecialty represent the exhaustive set of lesion-based key imaging findings available in each corresponding section of the source textbook, with no selective sampling performed. The rationale for the exclusion of nuclear medicine, fetal imaging, ultrasound imaging, and Roentgen Classics sections was expanded. A sentence was added acknowledging that key generation parameters such as temperature and seed could not be controlled or reported as an inherent limitation of the web-based interface. Output variability across repeated runs was explicitly acknowledged as a limitation. A sentence was added noting that model outputs reflect the behaviour of ChatGPT-4o as of March 2025 and may differ with future model updates. Finally, the converted formats of dataset were expanded to include three additional formats: JSON, CSV, and Markdown.
See the authors' detailed response to the review by Shawn Lyo
Large language models (LLMs) have recently emerged as powerful tools capable of generating coherent, structured, and context-aware natural language outputs.1–3 Their rapid integration into medical domains has prompted increasing interest in their potential roles in clinical reasoning support, decision-making assistance, and medical education.4–6 In particular, generative models have now the potential of producing structured explanatory content that resembles textbook-style teaching material.7,8
Radiology education relies heavily on structured diagnostic reasoning. A central pedagogical component is the formulation of differential diagnoses based on key imaging findings integrated with clinical context.9,10 This lesion-based or pattern-based approach is widely used in radiology casebooks and board examination preparation materials. Trainees are typically exposed to thematic imaging findings (e.g., a cavitary pulmonary mass or distal interphalangeal arthropathy) and are expected to develop a prioritized differential diagnosis, recognize distinguishing imaging characteristics, and understand the reasoning leading to the final diagnosis.
While LLMs have demonstrated the ability to generate medical explanations and answer clinical questions, the reproducibility, structure, and educational consistency of LLM-generated differential diagnosis teachings remain insufficiently documented in openly accessible datasets.11 Existing studies often report performance metrics or qualitative assessments, but the underlying generated texts are rarely made publicly available in a structured and reusable format. This limits transparency, benchmarking across model versions, evaluation of prompt sensitivity, and methodological reproducibility.12,13
Open datasets documenting LLM-generated medical content are particularly important for several reasons. First, LLM outputs are inherently time-sensitive: model updates and parameter adjustments can alter responses over time.14 Capturing outputs at a defined timepoint enables longitudinal comparison and benchmarking. Second, prompt design significantly influences output structure and reasoning pathways.15 Publicly sharing prompt iterations enhances reproducibility and allows independent investigation of prompt engineering strategies. Third, openly available datasets support FAIR principles (Findable, Accessible, Interoperable, Reusable) and facilitate secondary analyses, including linguistic evaluation, hallucination detection research, educational structure assessment, and computational benchmarking.16
To contribute to ongoing efforts toward transparency and reproducibility in medical LLM research, we created RAD-CaseBookLLM-08, a structured dataset of LLM-generated radiology differential diagnosis teachings derived from thematic key imaging findings. The dataset was generated using a standardized prompting protocol applied systematically across multiple radiology subspecialties.
While RAD-CaseBookLLM-08 is not intended as a primary teaching resource, it was primarily developed to support a planned study evaluating learning performance of radiology trainees exposed to LLM-generated differential diagnosis teachings compared to a reference casebook. Beyond this application, the dataset may support additional research tasks. First, it provides a fixed, dated corpus enabling longitudinal benchmarking: the same 225 prompts can be resubmitted to future or alternative models (e.g., GPT-5, open-source LLMs) and outputs compared systematically against this baseline. Second, the standardized prompt structure allows prompt sensitivity analyses, in which alternative prompting strategies applied to identical topics can be compared against the present outputs. Third, the dataset constitutes a naturalistic corpus of LLM-generated medical text suitable for linguistic and structural analysis: examining how a large language model organizes differential diagnosis reasoning, structures pedagogical content, and varies output. Fourth, the dataset may serve as a reference corpus for cross-lingual studies, as the fixed prompt structure and standardized topic set provide a reproducible baseline against which LLM-generated outputs in other languages could be systematically compared, enabling analysis of translation fidelity and terminological consistency.
Thematic radiological key imaging findings were derived from the case-based structure of the radiology text book Top 3 Differentials in Radiology: A Case Review. (O’Brien, 2010).17 The source textbook presents radiological cases organized around a central imaging finding, followed by a structured differential diagnosis discussion and final diagnosis. For the purpose of this dataset, only the lesion-based thematic topics, referred to in the book as “Key Imaging Findings” (e.g., “Pharyngeal mucosal mass”), were used as input for the LLM. No textbook images, figure reproductions, or verbatim text excerpts were included in the dataset nor were they included as input for the LLM.
The following subspecialties were included, each comprising 25 cases: chest imaging, cardiac imaging, gastrointestinal imaging, genitourinary imaging, musculoskeletal imaging, head and neck imaging, brain and spine imaging, pediatric imaging, breast imaging, and vascular and interventional radiology. The 25 cases per subspecialty represent the complete set of lesion-based key imaging findings available in each corresponding section of the source textbook; no selective sampling was performed. This resulted in a total of nine subspecialty sections and 225 cases overall. The complete dataset is compiled into a single PDF document comprising 360 pages, 66,874 words, and 502,964 characters.
The sections dedicated to nuclear medicine, fetal imaging, ultrasound imaging, and historical ‘Roentgen Classics’ were not included. The Roentgen Classics section was excluded as it presents single pathognomonic diagnoses rather than differential diagnosis frameworks. Nuclear medicine, fetal imaging, and ultrasound imaging were excluded as these subspecialties follow more specific and distinct teaching approaches. While similar prompting approaches could potentially be applied to these domains, the present dataset was intentionally scoped to the most conventional cross-sectional radiology subspecialties. Expansion to these additional sections may be considered in future iterations of the dataset.
Dataset generation was performed using the following environment:
• Model: ChatGPT-4o
• Provider: OpenAI
• Interface: Web-based interface
• Model access date: March 2025
• Conversation memory: Disabled
Each thematic topic was processed in an independent chat session. No conversation history was reused across topics.
To reduce potential personalization or adaptation effects related to prior interactions, a newly created user account was used exclusively for dataset generation. This measure was implemented to minimize contextual carryover and to improve output independence across cases.
No external plugins, browsing tools, or additional system instructions were activated during generation. The web-based interface was deliberately chosen to reflect real-world usage conditions, as this represents the mode of interaction most commonly adopted by clinicians and trainees in practice.
Prompt engineering was conducted iteratively through internal testing prior to final dataset generation. The objective was to obtain outputs that were structurally consistent, educational in tone, organized by differential diagnosis categories, explicit in diagnostic reasoning, and reproducible across thematic topics.
Multiple candidate prompts were tested and refined. Because complex prompts resulted in variable outputs, the following simple yet precise final prompt, which provided the best results, was retained:
“I am a radiology resident preparing for my final radiology exam. Please provide a concise radiological summary, from an exam-oriented perspective, of the following:
Specialty: [[subspecialty name (e.g., Musculoskeletal)]]
Topic: [[Key Imaging Finding (e.g., Sequestrum)]]”
In this final prompt, only the subspecialty name and Key Imaging Finding were manually updated to correspond to each processed case; the rest of the prompt was left untouched. All prompts were written in English. After the final prompt was chosen, the answers were extracted in a single session; we did not retry the same prompts multiple times, meaning that output variability across repeated runs was not assessed, which represents a limitation of the present dataset.
For each thematic key imaging finding, the following standardized procedure was applied:
1. A new chat session was initiated in the web interface.
2. The finalized structured prompt was entered, specifying the subspecialty and thematic topic.
3. The complete model output was copied verbatim in a word document.
4. The case number was manually added at the top of the output.
5. Original formatting (including headings, bold text, bullet points, and spacing) was preserved.
6. No editorial modification, correction, summarization, or medical validation was performed.
Interactive or conversational concluding phrases generated by the model (e.g., “Would you like more details on …”) were intentionally retained to preserve authenticity of the output and maintain fidelity to the original generation context.
The dataset therefore represents unaltered LLM-generated content captured at a defined timepoint. It should be noted that, given the continuous evolution of LLMs, the outputs reflect the behaviour of ChatGPT-4o at a specific point in time and should be interpreted accordingly, as model updates may produce different responses to identical prompts in the future.
The RAD-CaseBookLLM-08 dataset is organized by radiology subspecialty.
For each subspecialty:
• One master document contains the complete list of LLM-generated teachings (n = 25 cases per specialty) corresponding to all thematic key findings within that section.
• Cases are structured sequentially and labeled according to the case numbering system of the source textbook to enable future comparative or benchmarking studies.
• Each case heading in the Word (.docx) version is formatted using the “Title 1” style to allow structured navigation via document navigation panels.
Five file formats are provided:
• Microsoft Word (.docx) format (original format)
• Converted PDF format
• Converted MD format
• Converted CSV format
• Converted JSON format
A summary dataset overview with a list of key imaging findings per specialty is provided in Tables 1–3.
This dataset does not contain patient data, clinical records, or identifiable human information. No ethics approval was required.
The RAD-CaseBookLLM-08 dataset is openly available via Zenodo18: doi.org/10.5281/zenodo.18625031
The dataset includes:
• Subspecialty folders containing LLM-generated teaching texts in converted PDF, CSV, JSON and MD format (verbatim outputs).
• Subspecialty folders containing the same LLM-generated teaching texts in Word (.docx) format with structured “Title 1” styles for navigable headings.
These data are released under the Creative Commons Zero (CC0 1.0 Public Domain Dedication) license, enabling unrestricted reuse, redistribution, and adaptation.
The authors thank Dr Mustafa Mohamed and Dr Jacopo Ferrari from CHUV University Hospital for their contributions to the dataset generation, and Dr D.C. Rotzinger for his guidance on the study design. This manuscript was formatted with the assistance of a generative AI tool (ChatGPT, OpenAI), which was used only for language editing and formatting. All ideas, data, analyses, and interpretations are the original work of the authors.
| Views | Downloads | |
|---|---|---|
| F1000Research | - | - |
|
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for creating the dataset(s) clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Partly
Are sufficient details of methods and materials provided to allow replication by others?
Partly
Are the datasets clearly presented in a useable and accessible format?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Neuroradiology, AI, Deep Learning, Large Language Models, Education
Is the rationale for creating the dataset(s) clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of methods and materials provided to allow replication by others?
Partly
Are the datasets clearly presented in a useable and accessible format?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: AI, medical education, system redesign
Alongside their report, reviewers assign a status to the article:
| Invited Reviewers | ||
|---|---|---|
| 1 | 2 | |
|
Version 2 (revision) 10 Apr 26 |
||
|
Version 1 02 Mar 26 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)