HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats

Benchmarking is a crucial step during computational analysis and method development. Recently, a number of new methods have been developed for analyzing high-dimensional cytometry data. However, it can be difficult for analysts and developers to find and access well-characterized benchmark datasets. Here, we present HDCytoData, a Bioconductor package providing streamlined access to several publicly available high-dimensional cytometry benchmark datasets. The package is designed to be extensible, allowing new datasets to be contributed by ourselves or other researchers in the future. Currently, the package includes a set of experimental and semi-simulated datasets, which have been used in our previous work to evaluate methods for clustering and differential analyses. Datasets are formatted into standard SummarizedExperiment and flowSet Bioconductor object formats, which include complete metadata within the objects. Access is provided through Bioconductor's ExperimentHub interface. The package is freely available from http://bioconductor.org/packages/HDCytoData.


Amendments from Version 1 Introduction
Benchmarking analyses are frequently used to evaluate and compare the performance of computational methods, for example by users interested in selecting a suitable method, or by developers to demonstrate performance improvements of a newly developed method. A critical part of any benchmark is the selection of appropriate benchmark datasets 1,2 . In some cases, suitable publicly available datasets may be found in the literature. Alternatively, new experimental or simulated datasets containing a known ground truth may be created by the authors of the benchmark 1,2 .
High-dimensional cytometry refers to a set of recently developed technologies that enable measurement of expression levels of up to dozens of proteins in hundreds to thousands of cells per second, using targeted antibodies labeled with various types of reporter tags. This includes multi-color flow cytometry, mass cytometry (or CyTOF), and sequence-based cytometry (or genomic cytometry). Due to the large size and high dimensionality of the resulting data, numerous computational methods have been developed for analyzing these datasets 3 . Many of these methods are based on the fundamental concept of analyzing cells in terms of cell populations, for example using clustering to define cell populations, or detecting differential cell populations between conditions. In our previous work, we have collected a number of benchmark datasets to evaluate methods for clustering 4 and differential analyses 5 in high-dimensional cytometry data. This includes publicly available datasets previously published by other groups or our experimental collaborators, as well as new semi-simulated datasets that we generated. In these previous publications, we recorded links to original data sources and made all data available via FlowRepository 6 . FlowRepository is a widely used resource in the cytometry community, which provides a permanent record of publicly available datasets associated with peer-reviewed publications, and which has also been used by other authors to distribute benchmark datasets (e.g., 7,8). However, FlowRepository is primarily accessed via a web interface, and downloading and loading data for further analysis in R requires customized code and matching of metadata (e.g., sample information), which can hinder accessibility and reproducibility.
Here, we introduce the HDCytoData package, which provides a resource for re-distributing high-dimensional cytometry benchmark datasets through Bioconductor's ExperimentHub 9 , in order to improve accessibility. ExperimentHub provides a flexible platform for hosting datasets in the form of R/Bioconductor objects, which can be directly loaded within an R session. We have formatted the datasets in HDCytoData into standard SummarizedExperiment and flowSet Bioconductor object formats 10-12 , which include all required metadata within the objects and facilitate interoperability with R/Bioconductor-based workflows. The data objects are intended to be static, with no major updates following release. We envisage that these datasets will be useful for future benchmarking studies, as well as other activities such as teaching, examples, and tutorials. The package is extensible, allowing new datasets to be contributed by ourselves or other researchers in the future. It is designed to be accessible for users who are familiar with R and Bioconductor, but who may not have used ExperimentHub packages before. The package is freely available from http://bioconductor.org/packages/HDCytoData.

Implementation
The benchmark datasets currently included in the HDCytoData package consist of experimental and semi-simulated data, and can be grouped into datasets useful for benchmarking algorithms for (i) clustering and (ii) differential analyses. Table 1 and Table 2 provide an overview of the datasets. The raw datasets were collected from various sources ( Table 1 and Table 2), and have been extensively reformatted and documented for inclusion in the HDCytoData package. Each dataset is stored in both SummarizedExperiment and flowSet formats, since these are the most commonly used R/Bioconductor data structures for high-dimensional cytometry data (and there is generally no straightforward way to convert between the two). The objects each contain one or more tables of expression values, as well as all required metadata. Following standard conventions used for cytometry data 19 , rows contain cells, and columns contain protein markers. Row metadata includes sample IDs, group IDs, patient IDs, reference cell population labels (where available), and labels identifying 'spiked in' cells (where available). Column metadata includes channel names, protein marker names, and protein marker classes (cell type, cell state, as well as non protein marker columns).
Note that raw expression values should be transformed prior to performing any downstream analyses. Standard transformations include the inverse hyperbolic sine (asinh) with cofactor parameter equal to 5 for mass cytometry or 150 for flow cytometry data (20, Supplementary Figure S2); several other alternatives also exist 21 .
Most of these datasets include a known ground truth, enabling the calculation of statistical performance metrics. The ground truth information consists of reference cell population labels for the clustering datasets, and labels identifying computationally 'spiked in' cells for the differential analysis datasets. The datasets without a ground truth instead consist of experimental datasets that contain a known biological signal, which can be used to evaluate methods in qualitative terms; i.e., whether methods can reproduce the known biological result.
Extensive documentation is available via the help files for each dataset-including descriptions of the datasets, details on accessor functions required to access the expression tables and metadata, and links to original sources. In addition, reproducible R scripts demonstrating how the formatted SummarizedExperiment and flowSet objects were generated from the original raw data files from FlowRepository are included within the source code of the package.
New datasets may be contributed by ourselves or other authors in the future. The procedure for external contributions is described in the vignette titled "Contribution guidelines", available from Bioconductor. This vignette describes the submission procedure (via GitHub), as well as the required files (data objects in SummarizedExperiment and flowSet formats containing all necessary metadata, reproducible R scripts showing how the formatted objects were generated from the original raw data files, documentation, and package metadata).

Operation
The HDCytoData package can be installed by following standard Bioconductor package installation procedures. All datasets listed in Table 1 and Table 2 are available in Bioconductor version 3.10 and above. Minimum system requirements include a recent version of R (3.6 or later; this paper was prepared using R version 3.6.1), on a Mac, Windows, or Linux system. Example installation code is shown below.
# install BiocManager install.packages("BiocManager") # install HDCytoData package BiocManager::install("HDCytoData") Once the HDCytoData package is installed, the datasets can be downloaded from ExperimentHub and loaded directly into an R session using only a few lines of R code. This can be done by either (i) referring to named functions for each dataset, or (ii) creating an ExperimentHub instance and referring to the dataset IDs. Example code for each option for one of the datasets is shown below. Note that each dataset is available in both SummarizedExperiment and flowSet formats. After an object has been downloaded, the ExperimentHub client stores it in a local cache for faster retrieval. Once the datasets have been downloaded and loaded, they are available to the user as R objects within the R session. They can then be inspected and manipulated using standard accessor and subsetting functions (for either the SummarizedExperiment or flowSet object class). Example code to inspect a SummarizedExperiment is displayed below. For more details on how to load and inspect datasets, including the expected output from each function shown here, refer to the HDCytoData package main vignette available from Bioconductor.
Documentation describing each dataset is available in the help files for the objects, which can be accessed using the standard R help interface, as shown below.

Use cases
The datasets currently included in the HDCytoData package (Table 1 and Table 2) can be used to benchmark methods for either (i) clustering or (ii) differential analyses. In addition, these datasets may be useful for other activities such as teaching, examples, and tutorials (e.g., demonstrating how to use a new computational tool).
For the clustering benchmark datasets (Table 1), performance can be evaluated by calculating metrics such as the mean F1 score or adjusted Rand index, which measure the similarity between two sets of cell labels (i.e., the cluster labels and the ground truth or reference cell population labels) 1  For the differential analysis benchmark datasets (Table 2), methods can be evaluated by their ability to recover the known differential signals, either in quantitative terms using the ground truth spike-in cell labels (for the semi-simulated datasets), or in qualitative terms (for the experimental datasets). The differential signals consist  (Table 1). Colors indicate the known ground truth cell populations. of either differential abundance of cell populations, or differential states within cell populations (i.e., differential expression of additional functional markers within cell populations), providing conceptually distinct differential analysis tasks. A short example showing how to perform differential analyses on these datasets is provided in the "Examples and use cases" vignette. For more extensive examples and evaluations, see the GitHub repository accompanying our previous study 5 .

Summary
The HDCytoData package is an extensible resource providing streamlined access to a number of publicly available benchmark datasets used in our previous work on high-dimensional cytometry data analysis. Datasets are provided in standard Bioconductor object formats, and are hosted on Bioconductor's ExperimentHub platform. In the future, it may make sense to develop similar packages for other data types, e.g., imaging mass cytometry, once several well-characterized benchmark datasets become available. By facilitating access to these datasets, we hope they will be useful for other researchers interested in designing rigorous benchmarks for method development or other computational analyses, as well as other activities such as teaching, examples, and tutorials.

Data availability
All data underlying the results are available as part of the article and no additional source data are required.

Laurent Gatto
De Duve Institute, University of Louvain (UCLouvain), Brussels, Belgium Weber and Soneson present HDCytoData, a Bioconductor data package providing pre-formatted high-dimensional cytometry data. The preparation of the datasets as SummarizedExperiment and flowSet objects makes these amendable for benchmarking, a crucial step when developing new methods.
My main comment centres around the contribution of new data. While the curated/formatted data in the package have already been useful to the authors in their previous work, the ambition is to make it possible for others to benefit from them and, to enable this in the longer term, to expand the package with additional data. These contributions are anticipated to come from the original authors and, ideally, also by new contributors.
The contribution procedure, while crucial, (1) isn't described very clearly and, at least in its current form, (2) only applies to seasoned R users/programmers. These two points constitute a serious barrier to external contributions.
Indeed, the only information that is provided are a list of three required artefacts (objects, scripts and documentation), without details as to how to produce these, nor how to provide them. I would suggest to add a 'How to contribute' vignette to the package, describing all these aspects, including an example for one of the existing data. I would also suggest to include a contribution code of conduct, given that external contributions are explicitly advertised.
I would suggest asking new contributors to send a pull request (PR) on Github, with possible alternative methods for those that aren't familiar with GitHub. The use a PR provides traceability (as opposed to an email, for instance) and publicly recognises the external contribution, as PRs are publicly recorded on GitHub. I would also suggest to explicitly define how external contributions are to be acknowledged in the contribution guide (for example addition as a 'contributor' in the DESCRIPTION file).
These additions will clarify what is expected for a contribution to be considered, how it will be managed by These additions will clarify what is expected for a contribution to be considered, how it will be managed by the authors, and how it will be acknowledged, thus hopefully facilitating the process.

Minor suggestions:
How can a potential user find out if/when new data have been added to the package? Whilè ?HDCytoData` gives a list of dataset, a function returning a vector or dataframe with dataset names and possibly some annotation would be useful for programmatic access (given here that data(package = "HDCytoData")` doesn't work for data on ExperimentHub).
It could be useful to expand the 'Use cases' section with (1) example calculations of the F1 scores and Rand indices for the clustering example and (2) adding a similar short example for the differential analysis use case.
I am curious as to why the content of the lmweber/HDCytoData-example isn't included as a vignette in the HDCytoData package (and thus lacking the usual control and documentation that comes with R packages).

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed.

Competing Interests:
Reviewer Expertise: Computational biology, method development, research software engineering.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 26 Nov 2019 , SIB Swiss Institute of Bioinformatics, Zurich, Switzerland

Lukas Weber
Thank you for your comments and suggestions. As suggested, we have provided significant additional material on the procedure for contributing new datasets. We have expanded the section in the text on external contributions, and added an additional Bioconductor vignette titled in the text on external contributions, and added an additional Bioconductor vignette titled "Contribution guidelines". This vignette describes the required files (data objects, scripts, documentation, metadata), as well as the submission procedure. We have requested that contributions be submitted via GitHub issues and pull requests, clarified the acknowledgment procedure, and added a code of conduct.
Regarding the minor suggestions, we have also (i) updated the main vignette and package help file to show how to programmatically retrieve a data frame of all available datasets, and (ii) added a new vignette titled "Examples and use cases", which includes the example from the previous repository ( ), as well as new examples showing https://github.com/lmweber/HDCytoData-example how to use the datasets in the HDCytoData package to evaluate clustering performance (e.g. adjusted Rand index) and perform differential analyses.
No competing interests were disclosed.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed. Competing Interests: Reviewer Expertise: statistics, high throughput genomics, transcriptomics, R software, high-dimensional data analysis I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 26 Nov 2019 , SIB Swiss Institute of Bioinformatics, Zurich, Switzerland

Lukas Weber
Thank you for your comments and suggestions. We have updated the text, vignettes, and help files to clarify each of the issues raised above. Below are also responses to the specific questions: (1) Code to process the raw .fcs files from FlowRepository into the SummarizedExperiment and flowSet formats is provided in the 'make-data' scripts saved in the 'inst/scripts' directory in the source code of the HDCytoData package. Here we have followed the standard setup for ExperimentHub packages -i.e. processed data objects that are ready to load into R, together with reproducible scripts saved in 'inst/scripts' -as described in the ExperimentHub vignettes. We believe this is a useful setup for these datasets. FlowRepository is primarily intended as a permanent public repository for .fcs files associated with peer-reviewed publications, which cannot be updated. FlowRepository is also primarily accessed via the web interface, so it would be much less user-friendly to only provide scripts that re-format the .fcs files after downloading. Providing these datasets as pre-formatted SummarizedExperiment and flowSet objects makes them much more easily accessible for users.
(2) In principle, any data types that can be formatted into SummarizedExperiment and flowSet formats could be added to the package. However, we believe it makes sense to keep the scope of the package relatively limited, to facilitate modularity and maintainability. For now, we plan to include only the current set of technologies, although in the future it may make sense to develop similar packages for other data types (e.g. imaging mass cytometry) (see Summary).
(3) According to the policies of FlowRepository, original .fcs files stored in FlowRepository cannot be updated after publication of the associated peer-reviewed paper. Similarly, data objects stored in ExperimentHub can only be updated manually by contacting the ExperimentHub maintainers. Therefore, we do not expect any major updates to the datasets currently stored in the HDCytoData package (except possibly minor bug fixes). We have included some additional text explaining this. We have also included a new vignette on "Contribution guidelines" (see comments for Reviewer 2).
(4) While users could indeed use the HDCytoData package to load these datasets in a consistent (4) While users could indeed use the HDCytoData package to load these datasets in a consistent way for other purposes, we believe the main use cases for these particular datasets are for benchmarking and teaching / examples / tutorials. These datasets are well-characterized and have been studied in a number of previous publications, making them ideal for benchmarking. Formatting the datasets into consistent SummarizedExperiment and flowSet formats requires significant effort, so we expect this will mainly be worthwhile for datasets that can be re-used a number of times, e.g. for benchmarking.
(5-7) We have updated the text, main vignette, and help files to mention the size of the data files. The datasets range in size from 2.4 MB to 194.5 MB. We have also explained how to clear the local download cache, and updated the text to mention the expected level of experience with Bioconductor. We are not aware of a bulk download option in the ExperimentHub interface, so we have not included this. (If this functionality were added in the future, we believe it would better belong in the ExperimentHub package than in HDCytoData.) (8) There is no simple way to convert between the SummarizedExperiment and flowSet formats. This is one of the major contributions of this package -we have pre-processed the datasets into both of these formats (with reproducible code saved in the 'inst/scripts' directory), so that users do not need to do this manually. We have included additional text to mention this.
(9) The additional columns of raw data (which are labeled as "none" in the "marker_class" column) contain additional information from the raw .fcs files from the mass cytometry machine, including barcodes for sample deconvolution, and event length and DNA content to identify live single cells. These columns are usually stored in the expression matrices in the original raw .fcs files, so we have also left them in the objects, e.g. for users who wish to check the pre-processing steps. We labeled these columns as "marker_class = none" to make them easier to identify, especially for users who are not already familiar with mass cytometry data. We have updated the help files to clarify that these columns are not needed for downstream analyses.
(10) Compensation for fluorescence spillover has already been performed by the original authors of the flow cytometry datasets, so users of these datasets do not need to perform this. However, users still need to apply a transformation (e.g. arcsinh), which we have described in the vignettes and help files.
No competing interests were disclosed.

Competing Interests:
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com