ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article
Revised

HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats

[version 2; peer review: 2 approved]
PUBLISHED 04 Dec 2019
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the RPackage gateway.

This article is included in the Bioconductor gateway.

Abstract

Benchmarking is a crucial step during computational analysis and method development. Recently, a number of new methods have been developed for analyzing high-dimensional cytometry data. However, it can be difficult for analysts and developers to find and access well-characterized benchmark datasets. Here, we present HDCytoData, a Bioconductor package providing streamlined access to several publicly available high-dimensional cytometry benchmark datasets. The package is designed to be extensible, allowing new datasets to be contributed by ourselves or other researchers in the future. Currently, the package includes a set of experimental and semi-simulated datasets, which have been used in our previous work to evaluate methods for clustering and differential analyses. Datasets are formatted into standard SummarizedExperiment and flowSet Bioconductor object formats, which include complete metadata within the objects. Access is provided through Bioconductor's ExperimentHub interface. The package is freely available from http://bioconductor.org/packages/HDCytoData.

Keywords

benchmarking, high-dimensional cytometry, Bioconductor, ExperimentHub, clustering, differential analyses

Revised Amendments from Version 1

We have revised the manuscript, package vignettes, and package help files to address the issues raised by the reviewers. In particular, we have added two new vignettes titled (i) “Examples and use cases”, which includes reproducible code for the example previously included in the text, as well as new examples on clustering and differential analyses, and (ii) “Contribution guidelines”, which explains the procedure and required data files for contributing new datasets. The text has been clarified in a number of locations to better explain the motivation for creating the HDCytoData package, and more clearly explain aspects that may be non-intuitive for users who are less familiar with high-dimensional flow and mass cytometry data. Specific responses to the issues raised by the reviewers are listed in the responses to the reviewers.

See the authors' detailed response to the review by Laurent Gatto
See the authors' detailed response to the review by Shila Ghazanfar

Introduction

Benchmarking analyses are frequently used to evaluate and compare the performance of computational methods, for example by users interested in selecting a suitable method, or by developers to demonstrate performance improvements of a newly developed method. A critical part of any benchmark is the selection of appropriate benchmark datasets1,2. In some cases, suitable publicly available datasets may be found in the literature. Alternatively, new experimental or simulated datasets containing a known ground truth may be created by the authors of the benchmark1,2.

High-dimensional cytometry refers to a set of recently developed technologies that enable measurement of expression levels of up to dozens of proteins in hundreds to thousands of cells per second, using targeted antibodies labeled with various types of reporter tags. This includes multi-color flow cytometry, mass cytometry (or CyTOF), and sequence-based cytometry (or genomic cytometry). Due to the large size and high dimensionality of the resulting data, numerous computational methods have been developed for analyzing these datasets3. Many of these methods are based on the fundamental concept of analyzing cells in terms of cell populations, for example using clustering to define cell populations, or detecting differential cell populations between conditions.

In our previous work, we have collected a number of benchmark datasets to evaluate methods for clustering4 and differential analyses5 in high-dimensional cytometry data. This includes publicly available datasets previously published by other groups or our experimental collaborators, as well as new semi-simulated datasets that we generated. In these previous publications, we recorded links to original data sources and made all data available via FlowRepository6. FlowRepository is a widely used resource in the cytometry community, which provides a permanent record of publicly available datasets associated with peer-reviewed publications, and which has also been used by other authors to distribute benchmark datasets (e.g., 7,8). However, FlowRepository is primarily accessed via a web interface, and downloading and loading data for further analysis in R requires customized code and matching of metadata (e.g., sample information), which can hinder accessibility and reproducibility.

Here, we introduce the HDCytoData package, which provides a resource for re-distributing high-dimensional cytometry benchmark datasets through Bioconductor’s ExperimentHub9, in order to improve accessibility. ExperimentHub provides a flexible platform for hosting datasets in the form of R/Bioconductor objects, which can be directly loaded within an R session. We have formatted the datasets in HDCytoData into standard SummarizedExperiment and flowSet Bioconductor object formats1012, which include all required metadata within the objects and facilitate interoperability with R/Bioconductor-based workflows. The data objects are intended to be static, with no major updates following release. We envisage that these datasets will be useful for future benchmarking studies, as well as other activities such as teaching, examples, and tutorials. The package is extensible, allowing new datasets to be contributed by ourselves or other researchers in the future. It is designed to be accessible for users who are familiar with R and Bioconductor, but who may not have used ExperimentHub packages before. The package is freely available from http://bioconductor.org/packages/HDCytoData.

Methods

Implementation

The benchmark datasets currently included in the HDCytoData package consist of experimental and semi-simulated data, and can be grouped into datasets useful for benchmarking algorithms for (i) clustering and (ii) differential analyses. Table 1 and Table 2 provide an overview of the datasets.

Table 1. Summary of benchmark datasets for evaluating clustering algorithms.

For more details on these datasets, see Table 2 in 4, or the HDCytoData help files.

DatasetExperimentHub
ID
Number
of cells
Number of
dimensions
Number of
reference
cell
populations
Type of
ground truth
FlowRepository
ID
Original
reference
Levine_
32dim
EH2240 – EH2241265,6273214Manual gatingFR-FCM-ZZPH13
Levine_
13dim
EH2242 – EH2243167,0441324Manual gatingFR-FCM-ZZPH13
Samusik_
01
EH2244 – EH224586,8643924Manual gatingFR-FCM-ZZPH14
Samusik_
all
EH2246 – EH2247841,6443924Manual gatingFR-FCM-ZZPH14
Nilsson_
rare
EH2248 – EH224944,140131 (rare
population)
Manual gatingFR-FCM-ZZPH15
Mosmann_
rare
EH2250 – EH2251396,460141 (rare
population)
Manual gatingFR-FCM-ZZPH16

Table 2. Summary of benchmark datasets for evaluating methods for differential analyses.

For more details on these datasets, see Supplementary Note 1 in 5, or the HDCytoData help files.

DatasetExperimentHub
ID
Type of dataNumber
of cells
Number of
dimensions
Type of
ground
truth
Type of
differential
analysis
FlowRepository
ID
Original
reference
Krieg_Anti_
PD_1
EH2252 – EH2253Experimental85,71524 (cell
type)
QualitativeDifferential
abundance
FR-FCM-ZYL817
Bodenmiller_
BCR_XL
EH2254 – EH2255Experimental172,79124 (10 cell
type; 14 cell
state)
QualitativeDifferential
states
FR-FCM-ZYL818
Weber_AML_
sim
EH3025 – EH3046Semi-
simulated
(multiple
simulation
scenarios)
157,593
(excluding
spike-in)
16 (cell
type)
Spike-in
cell labels
Differential
abundance
FR-FCM-ZYL85
Weber_BCR_
XL_sim
EH3047 – EH3064Semi-
simulated
(multiple
simulation
scenarios)
85,331
(main
simulation;
excluding
spike-in)
24 (10 cell
type; 14 cell
state)
Spike-in
cell labels
Differential
states
FR-FCM-ZYL85

The raw datasets were collected from various sources (Table 1 and Table 2), and have been extensively reformatted and documented for inclusion in the HDCytoData package. Each dataset is stored in both SummarizedExperiment and flowSet formats, since these are the most commonly used R/Bioconductor data structures for high-dimensional cytometry data (and there is generally no straightforward way to convert between the two). The objects each contain one or more tables of expression values, as well as all required metadata. Following standard conventions used for cytometry data19, rows contain cells, and columns contain protein markers. Row metadata includes sample IDs, group IDs, patient IDs, reference cell population labels (where available), and labels identifying ‘spiked in’ cells (where available). Column metadata includes channel names, protein marker names, and protein marker classes (cell type, cell state, as well as non protein marker columns). Note that raw expression values should be transformed prior to performing any downstream analyses. Standard transformations include the inverse hyperbolic sine (asinh) with cofactor parameter equal to 5 for mass cytometry or 150 for flow cytometry data (20, Supplementary Figure S2); several other alternatives also exist21.

Most of these datasets include a known ground truth, enabling the calculation of statistical performance metrics. The ground truth information consists of reference cell population labels for the clustering datasets, and labels identifying computationally ‘spiked in’ cells for the differential analysis datasets. The datasets without a ground truth instead consist of experimental datasets that contain a known biological signal, which can be used to evaluate methods in qualitative terms; i.e., whether methods can reproduce the known biological result.

Extensive documentation is available via the help files for each dataset—including descriptions of the datasets, details on accessor functions required to access the expression tables and metadata, and links to original sources. In addition, reproducible R scripts demonstrating how the formatted SummarizedExperiment and flowSet objects were generated from the original raw data files from FlowRepository are included within the source code of the package.

New datasets may be contributed by ourselves or other authors in the future. The procedure for external contributions is described in the vignette titled “Contribution guidelines”, available from Bioconductor. This vignette describes the submission procedure (via GitHub), as well as the required files (data objects in SummarizedExperiment and flowSet formats containing all necessary metadata, reproducible R scripts showing how the formatted objects were generated from the original raw data files, documentation, and package metadata).

Operation

The HDCytoData package can be installed by following standard Bioconductor package installation procedures. All datasets listed in Table 1 and Table 2 are available in Bioconductor version 3.10 and above. Minimum system requirements include a recent version of R (3.6 or later; this paper was prepared using R version 3.6.1), on a Mac, Windows, or Linux system. Example installation code is shown below.

# install BiocManager              
install.packages("BiocManager")    
                                   
# install HDCytoData package       
BiocManager::install("HDCytoData") 

Once the HDCytoData package is installed, the datasets can be downloaded from ExperimentHub and loaded directly into an R session using only a few lines of R code. This can be done by either (i) referring to named functions for each dataset, or (ii) creating an ExperimentHub instance and referring to the dataset IDs. Example code for each option for one of the datasets is shown below. Note that each dataset is available in both SummarizedExperiment and flowSet formats. After an object has been downloaded, the ExperimentHub client stores it in a local cache for faster retrieval. File sizes for these datasets range from 2.4 MB (Nilsson_rare) to 194.5 MB (Samusik_all) (see help files). The local download cache can be cleared using the removeCache function from the ExperimentHub package (see HDCytoData package help file or main vignette). For more details on accessing ExperimentHub resources, refer to the ExperimentHub vignette available from Bioconductor.

# load HDCytoData package                                   
library(HDCytoData)                                         
                                                            
# option 1: load datasets using named functions             
d_SE <- Bodenmiller_BCR_XL_SE()                             
d_flowSet <- Bodenmiller_BCR_XL_flowSet()                   
                                                            
# option 2: load datasets by creating ExperimentHub instance
ehub <- ExperimentHub()                                     
query(ehub, "HDCytoData")                                   
d_SE <- ehub[["EH2254"]]                                    
d_flowSet <- ehub[["EH2255"]]                               

Once the datasets have been downloaded and loaded, they are available to the user as R objects within the R session. They can then be inspected and manipulated using standard accessor and subsetting functions (for either the SummarizedExperiment or flowSet object class). Example code to inspect a SummarizedExperiment is displayed below. For more details on how to load and inspect datasets, including the expected output from each function shown here, refer to the HDCytoData package main vignette available from Bioconductor.

# inspect SummarizedExperiment object
d_SE                                 
assays(d_SE)                         
rowData(d_SE)                        
colData(d_SE)                        
metadata(d_SE)                       

Documentation describing each dataset is available in the help files for the objects, which can be accessed using the standard R help interface, as shown below.

# display documentation (help files)
?Bodenmiller_BCR_XL                 
help(Bodenmiller_BCR_XL)            

Use cases

The datasets currently included in the HDCytoData package (Table 1 and Table 2) can be used to benchmark methods for either (i) clustering or (ii) differential analyses. In addition, these datasets may be useful for other activities such as teaching, examples, and tutorials (e.g., demonstrating how to use a new computational tool).

For the clustering benchmark datasets (Table 1), performance can be evaluated by calculating metrics such as the mean F1 score or adjusted Rand index, which measure the similarity between two sets of cell labels (i.e., the cluster labels and the ground truth or reference cell population labels)1. A short example is shown in the vignette titled “Examples and use cases”, available from Bioconductor. For more extensive examples and evaluations, see the GitHub repository accompanying our previous study4.

These datasets can also be used to generate visualizations demonstrating the performance of dimension reduction algorithms. For example, Figure 1 compares three different dimension reduction algorithms (principal component analysis [PCA], t-distributed stochastic neighbor embedding [tSNE]22,23, and uniform manifold approximation and projection [UMAP]24,25), for one of the datasets (Levine_32dim), with colors indicating the ground truth cell population labels. The figure shows a clear visual separation between the populations, with varying performance for the different algorithms. Reproducible R code for this figure is available in the “Examples and use cases” vignette, and the GitHub repository http://github.com/lmweber/HDCytoData-example.

c8a7e038-78b0-46d8-b2f5-3c1759d59bdf_figure1.gif

Figure 1. Example of use case for datasets in the HDCytoData package.

This example compares three different dimension reduction algorithms — principal component analysis (PCA), t-distributed stochastic neighbor embedding (tSNE), and uniform manifold approximation and projection (UMAP) — for visualizing cell populations in the Levine_32dim dataset (Table 1). Colors indicate the known ground truth cell populations.

For the differential analysis benchmark datasets (Table 2), methods can be evaluated by their ability to recover the known differential signals, either in quantitative terms using the ground truth spike-in cell labels (for the semi-simulated datasets), or in qualitative terms (for the experimental datasets). The differential signals consist of either differential abundance of cell populations, or differential states within cell populations (i.e., differential expression of additional functional markers within cell populations), providing conceptually distinct differential analysis tasks. A short example showing how to perform differential analyses on these datasets is provided in the “Examples and use cases” vignette. For more extensive examples and evaluations, see the GitHub repository accompanying our previous study5.

Summary

The HDCytoData package is an extensible resource providing streamlined access to a number of publicly available benchmark datasets used in our previous work on high-dimensional cytometry data analysis. Datasets are provided in standard Bioconductor object formats, and are hosted on Bioconductor’s ExperimentHub platform. In the future, it may make sense to develop similar packages for other data types, e.g., imaging mass cytometry, once several well-characterized benchmark datasets become available. By facilitating access to these datasets, we hope they will be useful for other researchers interested in designing rigorous benchmarks for method development or other computational analyses, as well as other activities such as teaching, examples, and tutorials.

Data availability

All data underlying the results are available as part of the article and no additional source data are required.

Software availability

Software available from: http://bioconductor.org/packages/HDCytoData

Source code available from: https://github.com/lmweber/HDCytoData

Archived source code at time of publication: https://doi.org/10.5281/zenodo.355105126

Licence: MIT License

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 19 Aug 2019
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Weber LM and Soneson C. HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats [version 2; peer review: 2 approved]. F1000Research 2019, 8:1459 (https://doi.org/10.12688/f1000research.20210.2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 04 Dec 2019
Revised
Views
10
Cite
Reviewer Report 10 Dec 2019
Shila Ghazanfar, Cancer Research UK Cambridge Institute (CRUK CI), Cambridge, UK 
Approved
VIEWS 10
The authors have done an excellent job in addressing my comments and concerns. The manuscript contains further detail, including information suitable to those less familiar with high dimensional cytometry data, and two new vignettes have been added to the HDCytoData ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ghazanfar S. Reviewer Report For: HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats [version 2; peer review: 2 approved]. F1000Research 2019, 8:1459 (https://doi.org/10.5256/f1000research.23700.r57458)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
10
Cite
Reviewer Report 04 Dec 2019
Laurent Gatto, De Duve Institute, University of Louvain (UCLouvain), Brussels, Belgium 
Approved
VIEWS 10
Thank you for the amendments.

I hope the new vignette and ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Gatto L. Reviewer Report For: HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats [version 2; peer review: 2 approved]. F1000Research 2019, 8:1459 (https://doi.org/10.5256/f1000research.23700.r57457)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 19 Aug 2019
Views
30
Cite
Reviewer Report 02 Sep 2019
Laurent Gatto, De Duve Institute, University of Louvain (UCLouvain), Brussels, Belgium 
Approved with Reservations
VIEWS 30
Weber and Soneson present HDCytoData, a Bioconductor data package providing pre-formatted high-dimensional cytometry data. The preparation of the datasets as SummarizedExperiment and flowSet objects makes these amendable for benchmarking, a crucial step when developing new methods.

My ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Gatto L. Reviewer Report For: HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats [version 2; peer review: 2 approved]. F1000Research 2019, 8:1459 (https://doi.org/10.5256/f1000research.22200.r52679)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 04 Dec 2019
    Lukas Weber, Institute of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    04 Dec 2019
    Author Response
    Thank you for your comments and suggestions. As suggested, we have provided significant additional material on the procedure for contributing new datasets. We have expanded the section in the text ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 04 Dec 2019
    Lukas Weber, Institute of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    04 Dec 2019
    Author Response
    Thank you for your comments and suggestions. As suggested, we have provided significant additional material on the procedure for contributing new datasets. We have expanded the section in the text ... Continue reading
Views
30
Cite
Reviewer Report 28 Aug 2019
Shila Ghazanfar, Cancer Research UK Cambridge Institute (CRUK CI), Cambridge, UK 
Approved with Reservations
VIEWS 30
Weber and Soneson have written a software article presenting HDCytoData (currently version 1.4.0), a Bioconductor package aimed at making multiple high-dimensional cytometry (HDC) datasets available in a consistent R friendly format as either SummarizedExperiment or flowSet objects. The authors have ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ghazanfar S. Reviewer Report For: HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats [version 2; peer review: 2 approved]. F1000Research 2019, 8:1459 (https://doi.org/10.5256/f1000research.22200.r52681)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 04 Dec 2019
    Lukas Weber, Institute of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    04 Dec 2019
    Author Response
    Thank you for your comments and suggestions. We have updated the text, vignettes, and help files to clarify each of the issues raised above. Below are also responses to the ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 04 Dec 2019
    Lukas Weber, Institute of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    04 Dec 2019
    Author Response
    Thank you for your comments and suggestions. We have updated the text, vignettes, and help files to clarify each of the issues raised above. Below are also responses to the ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 19 Aug 2019
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.