ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Data Note

Dataset of multi-focus (Z-stack) images derived from liquid-based cervical cancer cytology specimens

[version 1; peer review: awaiting peer review]
PUBLISHED 10 Apr 2026
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS AWAITING PEER REVIEW

Abstract

Abstract*

Background

In cervical cancer screening, cytotechnologists and cytopathologists integrate three-dimensional information by continuously adjusting the microscope’s focus to evaluate chromatin structure and nuclear morphology. However, most existing public datasets consist of single-focus 2D images, which do not fully reflect this clinical diagnostic workflow. This study presents the Cervical Cancer Cell Image Database: Multi-focus Cytology Dataset (CCCID) to bridge this gap.

Methods

Cervical specimens were processed using the BD SurePath™ LBC technique and Papanicolaou staining. Digitization was performed using a NanoZoomer-XR scanner. For 639 unique fields of view (FOVs), a Z-stack consisting of 11 focal planes was captured at 1.0 μm intervals, resulting in 7,029 images (384 × 384 pixels). Ground-truth labels were established only when six board-certified expert cytotechnologists reached 100% consensus.

Conclusions

The CCCID provides a high-reliability benchmark for developing machine-learning models that utilize axial (Z-axis) information. It is highly valuable for advancing three-dimensional nuclear morphology analysis, cell segmentation in overlapping clusters, and the evaluation of focus-fusion algorithms in digital cytopathology.

Keywords

Multi-focus imaging; Z-stack; Cervical cancer; Liquid-based cytology; Deep learning; Pap smear; cytology; cytopathology.

Introduction

Cervical cancer remains a leading cause of cancer-related deaths among women globally, and cytology is central to its early detection.1 Liquid-based cytology (LBC) is a widely used standard method because of its specimen uniformity. Although image analysis using deep learning has flourished, existing public datasets such as the Herlev dataset,2 Cervix93,3 SIPaKMeD,4 and CRIC Cervix5 primarily consist of single-focus 2D static images. These do not fully reflect the actual diagnostic process in which cytotechnologists and pathologists integrate three-dimensional information by continuously adjusting the focus to evaluate chromatin structure, nuclear membrane irregularities, and cell overlapping.6 Even with LBC, cells and nuclei retain a thickness of several to over 10 micrometers. Single-plane 2D images risk losing critical information, such as 3D nuclear morphology. Therefore, this dataset was constructed to provide multi-focus image sequences (Z-stacks) of cervical LBC specimens. By including continuous focal depths for each field of view, this dataset enables the development of analytical methods and artificial intelligence models that consider 3D morphological information, thus reflecting conditions closer to real-world clinical practice.

Materials and methods

  • 1. Specimen Preparation and LBC Processing

    Cervical cytology specimens were collected from patients at Nobeoka Prefectural Hospital and Kawasaki Medical University. All samples were processed using the BD SurePath™ LBC system (BD Diagnostics, Burlington, NC, USA), which employs a density gradient enrichment process to provide a representative monolayer of cells. The processed slides were stained using the standard Papanicolaou staining method to visualize nuclear and cytoplasmic features. This study was conducted in accordance with the Declaration of Helsinki and was approved by the Institutional Review Boards of all participating institutions, including the Ethics Committee of Kyushu University of Medical Science (Approval No. 17–19). Informed consent was obtained using an opt-out procedure approved by the ethics committees. The study included only adult participants, and all data were anonymized prior to image extraction and annotation. Written informed consent was waived in accordance with the opt-out procedure approved by the ethics committees, given the retrospective nature of the study and the use of fully anonymized samples.

  • 2. Whole-slide imaging and Digital Acquisition

    The prepared LBC slides were digitized using a NanoZoomer-XR whole slide imaging scanner (Hamamatsu Photonics, Shizuoka, Japan). Scanning was performed using a 20× objective lens. The proprietary software of the scanner was used to generate whole-slide images in the .ndpi format.

  • 3. Fields of view (FOVs) Selection and Multi-focus (Z-stack) Extraction

    Specific FOVs representing typical cytological features of each Bethesda category were identified from whole-slide images. For each selected FOV, a multi-focus image sequence (Z-stack) was generated. The extraction process involved capturing 11 distinct focal planes with a vertical interval of 1.0 μm between each plane. The focal range was centered on the optimal focus determined by the expert system, covering a total depth of 10 μm (5 μm above and below the center). The extracted images were cropped into 384 × 384 pixel tiles and converted from the raw.ndpi format to the JPG format with a resolution of 96 dpi for standardized use in machine-learning workflows.

  • 4. Annotation and Consensus Validation

    The annotation of the dataset was conducted by six board-certified expert cytotechnologists. To facilitate the annotation process, a custom software, “Annotation Image Creation Tool,” developed by Proassist Ltd., (Osaka, Japan) was used. The annotation protocol was as follows. Each expert independently reviewed the multi-focus image sequences for each FOV. The FOVs were classified into one of the following four categories based on the Bethesda System: negative for intraepithelial lesion or malignancy (NILM), low-grade squamous intraepithelial lesion (LSIL), high-grade squamous intraepithelial lesion (HSIL), squamous cell carcinoma (SCC). In this dataset, these are categorized as “SCC_etc,” which includes both SCC and adenocarcinoma (AC). A strict consensus rule was applied for inclusion in the final dataset; an FOV was included only if all six experts reached a 100% agreement on the diagnostic classification. The FOVs that did not achieve unanimous agreement were excluded from the dataset to ensure the highest possible ground-truth reliability.

  • 5. Final Dataset Organization

    The final validated dataset, consisting of 639 FOVs (7,029 total images), was organized into the directory structure described in the Data description section. Each file was named according to its diagnostic class, FOV index, and Z-stack index to facilitate automated processing.

Data description

CCCID is organized into four main directories named according to the Bethesda System: “NILM,” “LSIL,” “HSIL,” and “SCC_etc” ( Table 1). Each directory contains subfolders corresponding to specific cellular or pathological types ( Table 2). The “NILM” directory contains nine subfolders representing normal or benign cellular components: 1. Superficial-Intermediate Cells, 2. Parabasal-Basal Cells, 3. Glandular Cells (Isolated), 4. Glandular Cells (Cluster), 5. Squamous Metaplastic Cells, 6. Repair Cells, 7. Atrophic Vaginitis, 8. Macrophages, and 9. Neutrophils. The “LSIL” directory contains four subfolders: 10. Superficial-type Dysplastic Cells, 11. Intermediate-type Dysplastic Cells, 12. Superficial-type Koilocyte, and 13. Intermediate-type Koilocyte. The “HSIL” directory contains two subfolders: 14. Deep-layer Dysplastic Cells and 15. Carcinoma in situ Cells. The “SCC_etc” directory contains two subfolders: 16. Squamous Cell Carcinoma and 17. Adenocarcinoma.

Table 1. Dataset composition of the cervical cancer cell image database: Multi-focus cytology dataset.

CategoryFolder nameNo. of FOVsImages per FOVTotal JPG files
Negative for intraepithelial lesion or malignancyNILM273113,003
Low-grade squamous intraepithelial lesionLSIL93111,023
High-grade squamous intraepithelial lesionHSIL6911759
Squamous cell carcinoma and AdenocarcinomaSCC_etc204112,244
Total6397,029

Table 2. Detailed cellular composition of the cervical cancer cell image database: Multi-focus cytology dataset.

Main folder nameSubfolder name (cell type)No. of FOVsImages per FOVTotal JPG files
NILM1. Superficial-Intermediate Cells3011330
2. Parabasal-Basal Cells3011330
3. Glandular Cells (Isolated)3011330
4. Glandular Cells (Cluster)1311143
5. Squamous Metaplastic Cells2111231
6. Repair Cells61166
7. Atrophic Vaginitis2111231
8. Macrophages91111,001
9. Neutrophils3111341
LSIL10. Superficial-type Dysplastic Cells1111121
11. Intermediate-type Dysplastic Cells2911319
12. Superficial-type Koilocyte3211352
13. Intermediate-type Koilocyte2111231
HSIL14. Deep-layer Dysplastic Cells1111121
15. Carcinoma In Situ Cells5811638
SCC_etc16. Squamous Cell Carcinoma175111,925
17. Adenocarcinoma2911319
Total6397,029

Each subfolder contains individual image files in JPG format, representing specific FOVs and their corresponding multi-focus planes.

The image files follow the naming convention: [Serial Number]_(CenterX, CenterY, Z-stack Index).jpg.

  • [Serial Number]: A unique sequential identifier assigned to each captured FOV.

  • [Center X, Center Y]: X and Y coordinates of the center of the FOV within the specimen (μm).

  • [Z-stack Index]: The focal plane number, ranging from 0 to 10 (representing 11 layers).

For example, the file “001_(3059,11012,0).jpg” represents the first captured image, located at specimen coordinates (3059 μm, 11012 μm), with a focal plane index of 0. The files “001_(3059,11012,0).jpg” through “001_(3059,11012,10).jpg” constitute the complete 11-layer Z-stack for the first FOV.

All images are 384 × 384 pixels with a resolution of 96 dpi. The Z-stack index 0 to 10 corresponds to a focal range captured at 1 μm intervals, providing a comprehensive volumetric view of the cellular and nuclear morphology. A representative image is shown in Figure 1.

c4bf605c-06ec-47a5-8cbf-f5b4db193000_figure1.gif

Figure 1. The representative multi-focus (Z-stack) images for each category contained in the Cervical Cancer Cell Image Database: Multi-focus Cytology Dataset.

Negative for intraepithelial lesion or malignancy (NILM), low-grade squamous intraepithelial lesion (LSIL), high-grade squamous intraepithelial lesion (HSIL), squamous cell carcinoma (SCC), and adenocarcinoma (AC).

Value of the data

CCCID provides unique multi-focus (Z-stack) image sequences of cervical cytology, addressing the limitations of existing single-focus 2D datasets by reflecting the actual diagnostic process used by cytotechnologists and cytopathologists. These data are valuable for developing and evaluating deep learning models that require three-dimensional morphological information, such as chromatin distribution and nuclear membrane irregularities, which are often blurred or lost in single-plane images. The dataset features high ground-truth reliability, as every included image was validated through 100% consensus among six expert cytotechnologists, minimizing interobserver variability in the training data. Researchers can reuse this dataset to benchmark computer-aided diagnosis systems, specifically for testing algorithms designed to handle overlapping cells or thick cell clusters common in LBC specimens. The inclusion of 11 focus layers at 1 μm intervals allows for the exploration of focus-fusion (Extended Depth of Field) algorithms and the study of how vertical focal shifts impact the accuracy of automated cell classification.

Limitations

Although CCCID provides a high-quality multi-focus resource, several limitations should be noted. First, the data were collected using a specific Whole Slide Imaging scanner (NanoZoomer-XR) and a single LBC preparation method (BD SurePath™). Therefore, the visual characteristics of the images may differ from those produced by other scanners or preparation techniques, such as ThinPrep. Second, the dataset focuses on typical diagnostic images where six expert cytotechnologists reached a 100% consensus. This means that highly atypical or “borderline” cases, which often cause diagnostic disagreement in clinical practice, were intentionally excluded. Third, the number of FOVs varies across categories, resulting in class imbalance among the different cell types. Finally, the Z-stack range is fixed at 11 layers with 1 μm intervals.

Ethical considerations

In all cases, informed consent was obtained from the patients using an opt-out procedure. This study was conducted in accordance with the Declaration of Helsinki and was approved by the Institutional Review Boards of all participating institutions, including the Ethics Committee of Kyushu University of Medical Science (Approval No. 17–19). All patient data were fully anonymized prior to the image extraction and annotation process to ensure the protection of personal information.

Consent to publish

Cytology samples were used in anonymized form. Consent for the use of clinical samples and associated images for research and publication was obtained through an opt-out procedure approved by the institutional ethics committees, including the Ethics Committee of Kyushu University of Medical Science (Approval No. 17–19).

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 10 Apr 2026
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Onishi T, Miyamoto T, Osawa Y et al. Dataset of multi-focus (Z-stack) images derived from liquid-based cervical cancer cytology specimens [version 1; peer review: awaiting peer review]. F1000Research 2026, 15:502 (https://doi.org/10.12688/f1000research.179164.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status:
AWAITING PEER REVIEW
AWAITING PEER REVIEW
?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 10 Apr 2026
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.