Keywords
Radiology education; Large language models; ChatGPT; Medical artificial intelligence; Differential diagnosis; Open dataset; Medical education research
This article is included in the Artificial Intelligence and Machine Learning gateway.
This article is included in the Data: Use and Reuse collection.
This article is included in the AI in Medicine and Healthcare collection.
Large language models are increasingly explored in medical education, particularly for generating structured explanatory content. However, openly accessible datasets capturing full-length model outputs in a standardized and reusable format remain limited. In radiology education, differential diagnosis teaching is typically organized around key imaging findings integrated with clinical reasoning. We developed RAD-CaseBookLLM-08, an open dataset of large language model–generated radiology differential diagnosis teachings derived from lesion-based thematic topics.
The dataset comprises 225 cases across nine radiology subspecialties. Thematic key imaging findings were derived from an established case-based radiology textbook and used as structured prompts. All cases were generated using ChatGPT-4o (OpenAI) in March 2025 via a web-based interface with conversation memory disabled. Each topic was processed in an independent session using an identical prompt template in which only the subspecialty and imaging finding were modified. Outputs were copied verbatim without editing, correction, or validation, and formatting elements were preserved. The dataset is provided in Microsoft Word and Portable Document Format files and is organized by subspecialty with sequential case labeling. No patient data were included.
RAD-CaseBookLLM-08 provides a structured, time-stamped collection of large language model–generated radiology teaching texts. The dataset may support reproducibility studies, benchmarking of model outputs, prompt engineering evaluation, and analysis of educational structure in machine-generated differential diagnoses. It is openly available under a Creative Commons Zero license via Zenodo.
Radiology education; Large language models; ChatGPT; Medical artificial intelligence; Differential diagnosis; Open dataset; Medical education research
Large language models (LLMs) have recently emerged as powerful tools capable of generating coherent, structured, and context-aware natural language outputs.1–3 Their rapid integration into medical domains has prompted increasing interest in their potential roles in clinical reasoning support, decision-making assistance, and medical education.4–6 In particular, generative models have now the potential of producing structured explanatory content that resembles textbook-style teaching material.7,8
Radiology education relies heavily on structured diagnostic reasoning. A central pedagogical component is the formulation of differential diagnoses based on key imaging findings integrated with clinical context.9,10 This lesion-based or pattern-based approach is widely used in radiology casebooks and board examination preparation materials. Trainees are typically exposed to thematic imaging findings (e.g., a cavitary pulmonary mass or distal interphalangeal arthropathy) and are expected to develop a prioritized differential diagnosis, recognize distinguishing imaging characteristics, and understand the reasoning leading to the final diagnosis.
While LLMs have demonstrated the ability to generate medical explanations and answer clinical questions, the reproducibility, structure, and educational consistency of LLM-generated differential diagnosis teachings remain insufficiently documented in openly accessible datasets.11 Existing studies often report performance metrics or qualitative assessments, but the underlying generated texts are rarely made publicly available in a structured and reusable format. This limits transparency, benchmarking across model versions, evaluation of prompt sensitivity, and methodological reproducibility.12,13
Open datasets documenting LLM-generated medical content are particularly important for several reasons. First, LLM outputs are inherently time-sensitive: model updates and parameter adjustments can alter responses over time.14 Capturing outputs at a defined timepoint enables longitudinal comparison and benchmarking. Second, prompt design significantly influences output structure and reasoning pathways.15 Publicly sharing prompt iterations enhances reproducibility and allows independent investigation of prompt engineering strategies. Third, openly available datasets support FAIR principles (Findable, Accessible, Interoperable, Reusable) and facilitate secondary analyses, including linguistic evaluation, hallucination detection research, educational structure assessment, and computational benchmarking.16
To contribute to ongoing efforts toward transparency and reproducibility in medical LLM research, we created RAD-CaseBookLLM-08, a structured dataset of LLM-generated radiology differential diagnosis teachings derived from thematic key imaging findings. The dataset was generated using a standardized prompting protocol applied systematically across multiple radiology subspecialties.
While RAD-CaseBookLLM-08 is not intended as a primary teaching resource, it provides a structured dataset suitable for research applications. The dataset can be used to study the characteristics of LLM-generated educational text, compare such outputs with conventional radiology teaching materials, investigate prompt engineering strategies, and analyze the organization, clarity, and pedagogical value of machine-generated differential diagnoses. By making the dataset openly available, we aim to support reproducibility, benchmarking, and further exploration of AI-assisted medical education.
Thematic radiological key imaging findings were derived from the case-based structure of the radiology text book Top 3 Differentials in Radiology: A Case Review. (O’Brien, 2010).17 The source textbook presents radiological cases organized around a central imaging finding, followed by a structured differential diagnosis discussion and final diagnosis. For the purpose of this dataset, only the lesion-based thematic topics, referred to in the book as “Key Imaging Findings” (e.g., “Pharyngeal mucosal mass”), were used as input for the LLM. No textbook images, figure reproductions, or verbatim text excerpts were included in the dataset nor were they included as input for the LLM.
The following subspecialties were included, each comprising 25 cases: chest imaging, cardiac imaging, gastrointestinal imaging, genitourinary imaging, musculoskeletal imaging, head and neck imaging, brain and spine imaging, pediatric imaging, breast imaging, and vascular and interventional radiology. This resulted in a total of nine subspecialty sections and 225 cases overall. The complete dataset is compiled into a single PDF document comprising 360 pages, 66,874 words, and 502,964 characters.
The sections dedicated to nuclear medicine, fetal imaging, ultrasound imaging, and historical “Roentgen Classics” were excluded. These exclusions were made to maintain consistency with lesion-based cross-sectional radiological differential diagnosis teaching and to focus on subspecialties most commonly represented in structured diagnostic reasoning frameworks.
Dataset generation was performed using the following environment:
• Model: ChatGPT-4o
• Provider: OpenAI
• Interface: Web-based interface
• Model access date: March 2025
• Conversation memory: Disabled
Each thematic topic was processed in an independent chat session. No conversation history was reused across topics.
To reduce potential personalization or adaptation effects related to prior interactions, a newly created user account was used exclusively for dataset generation. This measure was implemented to minimize contextual carryover and to improve output independence across cases.
No external plugins, browsing tools, or additional system instructions were activated during generation.
Prompt engineering was conducted iteratively through internal testing prior to final dataset generation. The objective was to obtain outputs that were structurally consistent, educational in tone, organized by differential diagnosis categories, explicit in diagnostic reasoning, and reproducible across thematic topics.
Multiple candidate prompts were tested and refined. Because complex prompts resulted in variable outputs, the following simple yet precise final prompt, which provided the best results, was retained:
“I am a radiology resident preparing for my final radiology exam. Please provide a concise radiological summary, from an exam-oriented perspective, of the following:
Specialty: [[subspecialty name (e.g., Musculoskeletal)]]
Topic: [[Key Imaging Finding (e.g., Sequestrum)]]”
In this final prompt, only the subspecialty name and Key Imaging Finding were manually updated to correspond to each processed case; the rest of the prompt was left untouched. All prompts were written in English. After the final prompt was chosen, the answers were extracted in a single session; we did not retry the same prompts multiple times.
For each thematic key imaging finding, the following standardized procedure was applied:
1. A new chat session was initiated in the web interface.
2. The finalized structured prompt was entered, specifying the subspecialty and thematic topic.
3. The complete model output was copied verbatim in a word document.
4. The case number was manually added at the top of the output.
5. Original formatting (including headings, bold text, bullet points, and spacing) was preserved.
6. No editorial modification, correction, summarization, or medical validation was performed.
Interactive or conversational concluding phrases generated by the model (e.g., “Would you like more details on …”) were intentionally retained to preserve authenticity of the output and maintain fidelity to the original generation context.
The dataset therefore represents unaltered LLM-generated content captured at a defined timepoint.
The RAD-CaseBookLLM-08 dataset is organized by radiology subspecialty.
For each subspecialty:
• One master document contains the complete list of LLM-generated teachings (n = 25 cases per specialty) corresponding to all thematic key findings within that section.
• Cases are structured sequentially and labeled according to the case numbering system of the source textbook to enable future comparative or benchmarking studies.
• Each case heading in the Word (.docx) version is formatted using the “Title 1” style to allow structured navigation via document navigation panels.
Two file formats are provided:
A summary dataset overview with a list of key imaging findings per specialty is provided in Tables 1–3.
This dataset does not contain patient data, clinical records, or identifiable human information. No ethics approval was required.
The RAD-CaseBookLLM-08 dataset is openly available via Zenodo18: doi.org/10.5281/zenodo.18625031
The dataset includes:
• Subspecialty folders containing LLM-generated teaching texts in PDF format (verbatim outputs).
• Subspecialty folders containing the same LLM-generated teaching texts in Word (.docx) format with structured “Title 1” styles for navigable headings.
These data are released under the Creative Commons Zero (CC0 1.0 Public Domain Dedication) license, enabling unrestricted reuse, redistribution, and adaptation.
The authors thank Dr Mustafa Mohamed and Dr Jacopo Ferrari from CHUV University Hospital for their contributions to the dataset generation, and Dr D.C. Rotzinger for his guidance on the study design. This manuscript was formatted with the assistance of a generative AI tool (ChatGPT, OpenAI), which was used only for language editing and formatting. All ideas, data, analyses, and interpretations are the original work of the authors.
| Views | Downloads | |
|---|---|---|
| F1000Research | - | - |
|
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for creating the dataset(s) clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Partly
Are sufficient details of methods and materials provided to allow replication by others?
Partly
Are the datasets clearly presented in a useable and accessible format?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Neuroradiology, AI, Deep Learning, Large Language Models, Education
Is the rationale for creating the dataset(s) clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of methods and materials provided to allow replication by others?
Partly
Are the datasets clearly presented in a useable and accessible format?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: AI, medical education, system redesign
Alongside their report, reviewers assign a status to the article:
| Invited Reviewers | ||
|---|---|---|
| 1 | 2 | |
|
Version 2 (revision) 10 Apr 26 |
||
|
Version 1 02 Mar 26 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)