Integration of single-cell RNA-Seq and CyTOF data characterises heterogeneity of rare cell subpopulations

Emmanouela Repapi; Devika Agarwal; Giorgio Napolitani; David Sims; Stephen Taylor

doi:10.12688/f1000research.121829.2

Home Browse Integration of single-cell RNA-Seq and CyTOF data characterises heterogeneity...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Revised

Integration of single-cell RNA-Seq and CyTOF data characterises heterogeneity of rare cell subpopulations

[version 2; peer review: 1 approved, 1 approved with reservations]

Emmanouela Repapi ¹, Devika Agarwal¹, Giorgio Napolitani², David Sims¹, Stephen Taylor¹

Emmanouela Repapi ¹, Devika Agarwal¹, [...] Giorgio Napolitani², David Sims¹, Stephen Taylor¹

PUBLISHED 04 Nov 2022

Author details Author details

¹ Medical Research Council Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, Oxfordshire, OX3 9DS, UK
² Department of Haematology, King's College London, London, SE1 1UL, UK

Emmanouela Repapi
Roles: Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Investigation, Methodology, Project Administration, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Devika Agarwal
Roles: Data Curation, Formal Analysis, Validation, Visualization, Writing – Review & Editing

Giorgio Napolitani
Roles: Conceptualization, Supervision, Writing – Review & Editing

David Sims
Roles: Supervision, Writing – Review & Editing

Stephen Taylor
Roles: Funding Acquisition, Supervision, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Cell & Molecular Biology gateway.

Abstract

Background: The simultaneous measurement of cellular proteins and transcriptomes of single cell data has become an exciting new possibility with the advent of highly multiplexed multi-omics methodologies. However, mass cytometry (CyTOF) is a well-established, affordable technique for the analysis of proteomic data, which is well suited for the discovery and characterisation of very rare subpopulations of cells with a wealth of publicly available datasets.
Methods: We present and evaluate the multimodal integration of single cell RNA-Seq and CyTOF datasets coming from both matched and unmatched samples, using two publicly available datasets.
Results: We demonstrate that the integration of well annotated CyTOF data with single cell RNA sequencing can aid in the identification and annotation of cell populations with high accuracy. Furthermore, we show that the integration can provide imputed measurements of protein markers which are comparable to the current gold standard of antibody derived tags (ADT) from CITE-Seq for both matched and unmatched datasets. Using this methodology, we identify and transcriptionally characterise a rare subpopulation of CD11c positive B cells in high resolution using publicly available data and we unravel its heterogeneity in a single cell setting without the need to sort the cells in advance, in a manner which had not been previously possible.
Conclusions: This approach provides the framework for using available proteomic and transcriptomic datasets in a unified and unbiased fashion to assist ongoing and future studies of cellular characterisation and biomarker identification.

Keywords

single cell transcriptomics, single cell proteomics, CyTOF, mass cytometry, single cell integration, CD11c B cells

Corresponding author: Emmanouela Repapi

Competing interests: No competing interests were disclosed.

Grant information: This work was supported by UKRI (MR/S005471/1; a Rutherford Fund Fellowship to E.R), and Wellcome (209235,https://doi.org/10.35802/209235 https://doi.org/10.35802/209235; a Collaborative Award to D.S.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2022 Repapi E et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Repapi E, Agarwal D, Napolitani G et al. Integration of single-cell RNA-Seq and CyTOF data characterises heterogeneity of rare cell subpopulations [version 2; peer review: 1 approved, 1 approved with reservations]. F1000Research 2022, 11:560 (https://doi.org/10.12688/f1000research.121829.2) First published: 23 May 2022, 11:560 (https://doi.org/10.12688/f1000research.121829.1) Latest published: 02 May 2023, 11:560 (https://doi.org/10.12688/f1000research.121829.3)

Revised Amendments from Version 1

We have added a section in the Methods describing alternative options of the parameters that were explored for the integration steps (detailed in Supplementary Table 1 which was added to this version). We have revised the methods section for clarity and expanded the “Integration of datasets of unmatched samples from different conditions” section of the results considerably to explore best options for the integration of unmatched samples. Furthermore, in the same section, we have included an additional integration of all the twenty-four PBMC samples with the CyTOF samples of healthy controls from COMBAT (“refPBMC Mixed and CyTOF HC integration”) to further assess the accuracy of the integration using different reference datasets.
We also revised Figure 3 and added Table 2, five Supplementary Figures and a Supplementary Table to address the helpful comments of the Reviewer.

See the authors' detailed response to the review by Tallulah Andrews
See the authors' detailed response to the review by Xiang Chen

Introduction

In the last decade the analysis of transcriptomes at single cell resolution has revolutionised molecular biology and become one of the most widely used techniques for answering fundamental biological questions of cell identities and functions.¹^–³ Single-cell RNA sequencing (scRNA-Seq) and mass cytometry (CyTOF) have both contributed immensely in the characterisation of cellular states, the identification of novel populations and the understanding of cellular differentiation trajectories.¹^,²^,⁴^–⁶ Protein and transcriptomic measurements in single cells, however, have been shown to provide different and complementary information regarding cell states.⁷^,⁸ Joining the two fields can provide insightful observations on biological systems and functions at a cellular level.

The simultaneous quantification of transcriptome and proteins of the same cells has now marked the beginning of a new era in which our understanding of cellular systems and processes will be transformed by coupling the phenotypic and transcriptomics landscapes of biology. However, in these early days of multiplex technologies where cost and sample accessibility can be a limiting factor, there is still a lot to be gained from the integration of independent and already available datasets. Mass cytometry can target both intracellular epitopes and epigenetic markers with high precision and also has the advantage of being able to identify novel and very rare populations while profiling up to hundreds of thousands of cells per sample.⁷^,⁹^,¹⁰ Most importantly, the large quantity and high quality of single-modality single-cell data that have been already generated in the last few years have the potential to reveal novel biological insights when integrated appropriately.³^,¹¹ Therefore, the integration of CyTOF and scRNA-Seq can provide an affordable alternative to a unified view across modalities and aid in the functional characterisation of novel populations.

Here we present the multimodal integration of scRNA-Seq and CyTOF datasets coming from both matched samples (same patients profiled with the two platforms in the same dataset/study) and unmatched samples (different patients from distinct datasets/studies). We use the publicly available datasets from the multi-omic blood atlas of coronavirus disease 2019 (COVID-19) patients (COMBAT Study)¹² to present the integration of matched datasets and show how the imputed values projected from the integration compare to the current gold standard measurements of ADT, coming from the available CITE-Seq experiment. Furthermore, we integrate the COMBAT mass cytometry dataset with an unmatched multimodal human PBMC atlas¹³ as an example of how scRNA-Seq and CyTOF datasets originating from different samples can also be integrated with high accuracy. We demonstrate that the integration of the CyTOF with the scRNA-Seq can guide the difficult task of annotating scRNA-Seq datasets and show that the imputed values generated are correlated to the CITE-seq protein levels. We finally use this integration technique to transcriptionally characterise a rare and heterogeneous subpopulation of CD11c+ B cells originally identified in the COMBAT mass cytometry dataset in a large meta-analysis of 12 COVID-19 scRNA-Seq datasets; thus, demonstrating the potential of this methodology to couple the unique strength of mass cytometry to identify very rare subpopulations with that of scRNA-Seq to functionally describe previously undefined subpopulations with unprecedented granularity.

Methods

Datasets and preprocessing

COMBAT datasets

For the integration of matched datasets, we downloaded mass cytometry (CyTOF) and single-cell combined transcriptome (RNA) with surface proteome (ADT) data of peripheral blood mononuclear cells (PBMCs) from the Covid-19 Multi-omics Blood ATlas (COMBAT) Consortium¹² (DOI 10.5281/zenodo.5139560). The datasets contained 91 overlapping samples from healthy controls and community COVID-19 cases in the recovery phase (never admitted to hospital) as well as from patients with mild, severe and critical COVID-19.

The scRNA-Seq and CyTOF datasets had been previously annotated, and the published annotations were retained as gold standard annotations. Both datasets were filtered to exclude cells with unclassified or uncertain annotations. Furthermore, subsets of cells that were extremely rare in the scRNA-Seq dataset (frequency less than 0.05%) and for which there was no overlapping features to identify them (namely reticulocyte and mast cells) were also removed prior to the integration. The scRNA-Seq and CITE-Seq datasets were further filtered to exclude low-quality cells based on i) number of genes, ii) number of features of ADT, iii) number of total UMI (RNA) or iv) number of total UMI of ADT lower than 0.001 of their relevant distributions. As this dataset is a subset of the published dataset (containing only the overlapping samples between scRNA-Seq and CyTOF), the raw RNA-Seq values were downloaded and re-integrated to mitigate batch effects. The batch correction was performed using the Seurat v3 Log normalize method with 50 PCA dimensions and 3000 highly variable genes and with the reciprocal PCA (RPCA) reduction for the anchor finding step.¹⁴ The batch correction was done on the 89 overlapping samples between scRNA-Seq and CyTOF (after the exclusion of 2 samples with less than 80 cells) on 182,000 cells by randomly subsampling 2000 cells from each sample and with the healthy controls as a reference. The published cell annotations which were created from expert immunological knowledge using data from all three modalities (GEX, ADT and VDJ)¹² from the COMBAT Consortium were used for comparisons and the data was not re-clustered. The ADT dataset contained normalised values (normalisation using the DSB algorithm,¹⁵ details in Ahern et al.¹²) for 192 ADT tags, of which 39 had an equivalent CyTOF antibody.

The CyTOF dataset contained batch corrected and transformed values (arcsinh, with cofactor 5), for a panel of 45 markers (details for the batch correction can be found in Ahern et al.¹²). From those, 40 had an equivalent gene sequenced in the mRNA data (Table 1) and were used for the integration. The IgA and IgG antibodies had very poor staining in the mass cytometry experiment and were not included in any analysis. One channel included a combination of antibodies for IgD and TCRgd that was not used for the integration of the major cell types but was used for the integration of the B cells.

Table 1. Table with gene and protein correspondences.

The gene names that correspond to the proteins from the mass cytometry (CyTOF) dataset and to the antibody derived tags (ADT) names from the COMBAT and refPBMC datasets.

CyTOF	Gene	ADT - COMBAT	ADT - refPBMC
BCL-2	BCL2	NA	NA
CCR7	CCR7	CCR7	NA
CD103	ITGAE	CD103	CD103
CD11c	ITGAX	CD11c	CD11c
CD123	IL3RA	CD123	CD123
CD127	IL7R	CD127	CD127
CD14	CD14	CD14	CD14
CD141	THBD	CD141	CD141
CD16	FCGR3A	CD16	CD16
CD161	KLRB1	CD161	CD161
CD19	CD19	CD19	CD19
CD20	MS4A1	CD20	CD20
CD25	IL2RA	CD25	CD25
CD27	CD27	CD27	CD27
CD28	CD28	CD28	CD28
CD3	CD3E	CD3	CD3-1
CD33	CD33	CD33	NA
CD38	CD38	CD38	CD38-1
CD39	ENTPD1	CD39	CD39
CD4	CD4	CD4	CD4-2
CD45	PTPRC	CD45	CD45-2
CD45RA	NA	CD45RA	CD45RA
CD45RO	NA	CD45RO	CD45RA
CD56	NCAM1	CD56	CD56-1
CD57	B3GAT1	CD57	CD57
CD66	CEACAM8	CD66	CD66a/c/e
CD69	CD69	CD69	CD69
CD8	CD8A	CD8	CD8
CD99	CD99	CD99	CD99
CLA	SELPLG	NA	NA
CTLA4	CTLA4	CTLA4	NA
CX3CR1	CX3CR1	CX3CR1	CX3CR1
FOXP3	FOXP3	NA	NA
GZB	GZMB	NA	NA
HLA-DR	HLA-DRA	HLA-DR	HLA-DR
IgA	not used	not used	NA
IgD/IgD-TCRgd	IGHD (only for B cell integration)	IgD (only for B cell integration)	IgD (only for B cell integration)
IgG	not used	not used	NA
IgM	IGHM	IgM	IgM
Ki-67	MKI67	NA	NA
KLRG1	KLRG1	KLRG1	NA
PD1	PDCD1	PD1	NA
Siglec-8	SIGLEC8	NA	Siglec-8
Va7-2	TRAV1-2	Va7-2	TCR-V-7.2
Vd2	TRDV2	Vd2	TCR-V-2

For the B cell datasets, we extracted all cells from the scRNA-Seq that had been annotated as B cells (32,728 cells) and performed a batch correction with the healthy controls as a reference using Seurat v3¹⁴ (89 samples after the exclusion of 2 samples with less than 80 cells). For the CyTOF dataset we subsampled 700 B cells per sample (57,159 cells).

Human PBMC Atlas datasets

The processed (after cellranger and QC) and annotated cell count matrices and the metadata of the scRNA-Seq was downloaded from the GEO database (GEO accession GS164378). The normalised and processed data object for the 3prime CITE-Seq dataset (ADT) was downloaded from https://atlas.fredhutch.org/data/nygc/multimodal/pbmc_multimodal.h5seurat (from here on named refPBMC). For the scRNA-Seq, each of the 24 samples were individually log normalised and scaled with linear regression of the covariate nUMI. To identify shared cell types across samples, batch correction and sample integration was performed for donor and time across 24 samples using the Seurat v3 Log normalize anchor based reciprocal PCA (RPCA) workflow with the unvaccinated samples (Day 0 samples) designated as the reference dataset, with 50 PCA dimensions and 3000 highly variable genes.

From the 161,764 cells, 146,480 cells were retained after filtering of cells annotated as doublets and those with less than 10 UMI counts for the common features between scRNA and CyTOF datasets. Additionally, the celltype.l2 annotations from the publication were renamed and grouped to best match the annotations from the COMBAT CyTOF dataset for visualisation and evaluation of the integration of the two modalities, but still included the cell-types exclusive to the refPBMC Atlas (erythroid cells, hematopoietic stem and progenitor cells-HSPCs, innate lymphoid cells-ILCs and platelets). The renamed annotations were: B (B naïve, B intermediate and B memory); CD4 (including all subclusters of CD4); CD8 (with all subclusters of CD8); dendritic cells (DCs) (ASDC, cDC1, cDC2, pDC); and NK (NK, NK Proliferating and NK CD56 Bright).

For the Day 0 integration, the refPBMC was filtered to include only the unvaccinated (Day 0) samples (47,962 cells after filtering of cells on the same thresholds as the full dataset). The unvaccinated samples were regressed for the nUMI covariate, batch corrected and integrated across the 8 donors using the same method and parameters as for the full dataset. The same process was followed for integrating the other two conditions (Day 2 and Day 7) separately.

COVID-19 datasets for the B cell CD11c+ subpopulation characterisation

For the characterisation of the B cell subpopulations, the count matrices, standardised metadata and the predicted celltypes were obtained from the processed HDF5 objects generated as part of the COVID-19 meta-analysis study carried out by Tian et al.¹⁶ (downloaded from https://atlas.fredhutch.org/fredhutch/covid/). The RNA count data matrices of all PBMC datasets were extracted for the cells which were predicted to be B cells for the predicted.celltype.l1 and we further removed cells annotated as Plasmablasts in the predicted.celltype.l2. Additionally, to exclude any non-B cells that might have erroneously been predicted as B cells in the original meta-analysis, only cells that had zero UMI counts of CD3E, GNLY, CD14, FCER1A, FCGR3A, LYZ, PPBP and CD8A genes were retained, as previously done by Stewart et al.¹⁷ After filtering for non-B cells, datasets with fewer than 60 cells were removed, leaving 13 datasets for downstream analysis.

The PBMC dataset of Su et al.¹⁸ was processed separately in order to use the ADT tags as validation of the CD11c+ subpopulation (25,443 cells). The dataset was split by batch and then individually log-normalised, scaled and linearly regressed for the covariates nUMI and sex. Then it was batch corrected with the Seurat v3 Log normalize anchor based workflow with 1000 common highly variable genes identified using the SelectIntegrationFeatures function and 15 CCA dimensions. UMAP dimensional reduction analysis was performed on the integrated object after scaling and regressing out sex with 15 PCA dimensions.

Each of the 12 remaining datasets was first individually log-normalised and scaled with linear regression for the covariates nUMI, sample and sex. The normalised and scaled dataset objects were then integrated and batch corrected with the Seurat v3 Log normalize anchor based workflow with 1000 common highly variable genes and 15 CCA dimensions. Subsequently, the integrated assay was filtered to only include samples from COVID-19 patients (77,485 cells) and it was rescaled and regressed for the covariate sex. UMAP dimensional reduction analysis was performed with 15 PCA dimensions on the batch corrected log normalised counts for downstream analyses.

Integration of CyTOF and scRNA-Seq datasets

For all analyses, the scRNA-Seq cells that had very low expression for all common protein-mRNA genes (less than 5 reads in total on all genes) were filtered out. The anchors between the query (scRNA-Seq) and reference (CyTOF) datasets were found based on a canonical correlation analysis (CCA) (with 20 dimensions for COMBAT and 25 dimensions for the reference PBMC dataset) using the FindTransferAnchors function from Seurat v3¹⁴ (RRID:SCR_016341) and the defaults for the remaining parameters. Subsequently, the anchors were used to project the annotations of the CyTOF to the scRNA-Seq using the TransferData function with the integrated (batch corrected) PCA of the scRNA-Seq for the weight reduction and the CyTOF annotations as refdata. Equivalently, the TransferData function was also used to impute the CyTOF markers on the scRNA-Seq cells with the same parameters but using the CyTOF intensity values of the overlapping features as refdata. Finally, the imputed values were merged with the CyTOF data using the merge function and centered in order to run the UMAP¹⁹ and PAGA²⁰ analyses to visualize all the cells together. The PAGA analyses were run using the scanpy toolkit (v1.9.1).²¹ We would like to highlight here that the ADT data is not used in any of the steps of the integration, but at the evaluation stage only.

Alternative options for both the anchor finding and data transferring steps of the integration were assessed at the initial stages of the analysis. These included: a) using the reciprocal PCA (rpca) reduction method in the FindTransferAnchors function (Analysis1); b) using the pcaproject reduction method in the FindTransferAnchors function (Analysis2); c) using the canonical correlation analysis (CCA) reduction as a weight reduction method for the TransferData function (Analysis3); d) using the pcaproject reduction method in the FindTransferAnchors function and the using the pcaproject reduction method in the TransferData function (Analysis4) and e) using the non-integrated (non-batch corrected) PCA of the scRNA-Seq for the weight reduction in the TransferData function (Analysis5). These options were tried independently keeping all the remaining parameters identical (apart from the ones listed above) and were evaluated using the Adjusted Rand Index (ARI) and F1 scores. Analysis1 had very similar results to the parameters tried above but all the remaining analyses were found to be suboptimal (Supplementary Table 1).

To assess the robustness of the integration for various cell numbers, the COMBAT scRNA-Seq and CyTOF datasets were downsampled to 9,100/22,750/45,500/91,000 and 182,000 cells for each modality by randomly subsampling 100/250/500/1000/2000 cells from each sample. The main results were presented for the dataset of 45,500 cells, as the most representative scenario of scRNA-Seq studies.

For the transcriptional characterisation of the B cell subpopulations, we integrated the CyTOF COMBAT dataset separately with 1) the scRNA-Seq COMBAT dataset, 2) the Su et al. PBMC scRNA-Seq dataset and 3) a meta-analysis of 12 PBMC COVID-19 scRNA-Seq datasets. The first two integrations served in finding a transcriptional signature for the CD11c+ cells, whereas the integration with the meta-analysis of the COVID-19 datasets was done to further define the heterogeneity within the CD11c+ subpopulation. For all integrations, a reduced set of 28 B cell and Plasma markers was utilised based on the features that were used to annotate these subpopulations in Ahern et al.¹² The anchors and the projections of annotations and intensity values from the CyTOF were performed using a canonical correlation analysis (CCA) with 15 dimensions and k.weight=30 for the COMBAT and Su et al. integrations and k.weight=20 for the integration with the 12 COVID-19 scRNA-Seq datasets.

Evaluation criteria

In order to evaluate how well the predicted annotations converged to the true cell types we used the Hubert-Arabie Adjusted Rand Index (ARI) for measuring the correspondence between two partitions of an object set²² and the F1 score. The ARI metric is adjusted for chance and random clusterings are expected to have an ARI equal to 0 and identical clusterings to 1. The metric compared the projected cell types from the CyTOF to the published cell annotations (including both common and dataset specific clusters) and was calculated using the implementation in the mclust R package²³ (v5.4.9). The F1 score between the predicted and published annotations was calculated as the harmonic mean of precision and sensitivity for each cluster. For F1, the DN T cells and the cells for which there was not an equivalent annotation in both datasets were all grouped as Other. The cells from the cluster Other were retained to calculate the precision and sensitivity (misclassified cells) for each cluster but the F1 measure was not calculated for this cluster as it was a mix of multiple clusters. Finally, the F1 scores of two extra populations, the predicted Naïve CD4 and Naïve CD8 populations were calculated over T cells only.

Pseudobulk correlations

To compare the imputed CyTOF intensities with the ADT normalised values at the pseudobulk level, we summarised the values (using the mean) on the pseudobulk clusters’ annotations (as annotated in the COMBAT study¹²) and the sample ID, using the function aggregateAcrossCells from the package scuttle²⁴ (v 1.4). Then the values were correlated using Pearson’s correlation.

Transcriptional characterisation of cells

CD11c+ cells were determined using a threshold defined by the imputed protein values. We reasoned that the distribution of the imputed values for CD11c followed the CyTOF distribution of that marker since the imputation of the markers is a weighted average of the original values. Therefore, we firstly selected the CD11c+ populations of the CyTOF annotations. Then, for those cells, we chose the 1^st quartile of the CD11c intensity distribution, which gave an intensity threshold of 1.6 (Supplementary Figure 1, Extended data⁴²) to define the CD11c+ cells. Therefore, all cells with an imputed CD11c value greater than 1.6 in the Su et al. dataset were defined as CD11c+ cells. For the integration with the 12 COVID-19 datasets, the predicted annotations were used to define the subpopulations of interest (the cluster “CLA+ Switched memory B cells” comprised of 23 cells only and was removed from the downstream analysis). The transcriptional identity of the subpopulations was ascertained by finding DEGs using Seurat’s implementation of the Wilcoxon rank-sum test (FindMarkers default parameters, adjusted p value < 0.05). A gene set enrichment analysis (GSEA) was performed using the DGE results generated from FindMarkers (with parameters logfc.threshold=0 and min.pct = 0) and with the R package clusterProfiler²⁵ (v 4.3.3) using the fgsea algorithm with the maximal size of genes annotated for testing set to 800 and minimum gene set size set to 10.

The R scripts used to run the analyses presented in this article are available from GitHub and archived with Zenodo.⁴³

Results

The ultimate motive of multimodal integrations is to take advantage of complimentary information of the datasets for a holistic understanding of cellular processes of interest. Stuart et al. demonstrated that transferring protein measurements across datasets can be a powerful tool to further our understanding of cell subpopulations. They also showed that this transfer can lead to accurate protein predictions even in the absence of a strong correlation between the RNA and protein of a gene, if alternative combinations of genes exhibit expression patterns that are correlated with cellular immunophenotypes.¹⁴ However, these integrations were performed based on RNA features alone. Here we extend this work to show that the integration of transcripts and proteins from RNA and CyTOF datasets based on mRNA-protein correspondences can also yield accurate predictions of cell populations and imputed protein features.

We used mass cytometry and CITE-Seq data from the COMBAT Consortium¹² to showcase and validate the integration of the two modalities. The two datasets were downsampled to 45,500 cells for each modality (sampled as 500 cells per sample) to approximate a balanced real-life scenario and were integrated using Seurat with CyTOF as a reference and scRNA-Seq as a query with the Seurat v3 software.¹⁴ After finding the anchors between the two datasets, we used them to transfer the protein intensities to the RNA dataset, creating imputed values of the protein markers for each cell. We then merged the datasets to get a co-embedding of the two datasets on the same space (Figure 1A).

Figure 1. CyTOF can help define main populations of scRNA-Seq with high accuracy.

A. UMAP of the integrated COMBAT dataset coloured according to cell type (left) and split by modality (right). Red cells originate from the CyTOF (mass cytometry) dataset and the blue cells from the scRNA-Seq dataset, showing a good overlap between the two modalities.

B. Riverplot of the predicted labels (annotations) for each cell (right) compared to their actual major cell type (published annotations) in the COMBAT dataset (left).

C. Barplot of the F1 scores of the major cell types for different sample sizes of the COMBAT dataset. Each bar shows the F1 score for the integration using a downsampled subset of the data (number of cells according to colour).

D. Frequencies of the observed clusters (per sample cluster frequency based on published annotations) (x axes) compared to the frequencies as predicted by the integration from the CyTOF annotations (y axes) (per sample predicted cluster frequency) in the COMBAT dataset.

We observed that all major populations of the integrated dataset appear well separated in the uniform manifold and projection (UMAP) visualization and that all common populations are represented equally from both modalities (Figure 1A). Importantly, the T subpopulations are clearly distinguished, separating CD4, CD8 and NK cells. Interestingly, dataset specific subpopulations, such as the basophils from the CyTOF and the platelets from the RNA, form dataset-specific, distinct islands in the UMAP and are not mixed with other populations. The only populations that are not separating clearly and seem to be dispersed across the island of the monocytes are the conventional dendritic cells (cDCs) (cDC1 and cDC2 populations). To further validate the relationships of the cell types in a topologically meaningful embedding, we also generated the PAGA-initialized single-cell embeddings and PAGA graphs (Supplementary Figure 2) which confirmed the above observations. These results are in line with the findings in Hao et al. that CD8+ and CD4+ T cells were partially mixed when analysing the transcriptome but separated clearly when clustering on protein data, whereas cDCs formed distinct clusters when analysing RNA but were interspersed with other cell types based on surface protein abundance from the CITE-Seq dataset.¹³

Mass cytometry and RNA integration facilitates the identification of major cell populations

One of the hardest tasks in the analysis of scRNA-Seq experiments is the annotation of cell types post clustering. Certain populations, such as CD4/CD8 T and CD8/NK cells are notoriously difficult to disentangle from the transcriptional data alone.¹³^,²⁶^,²⁷ Using the anchors provided by Seurat, the cell annotations of CyTOF can be projected to the scRNA dataset to inform the cluster identification of the latter. To validate the accuracy of these predictions, we compared the projected cell types from the CyTOF to the published cell annotations from the COMBAT Consortium which were created using expert immunological knowledge to guide a curated manual integration of the data from the different modalities (GEX, ADT and VDJ)¹² (Figure 1B-D). Using the adjusted Rand Index (ARI) as measure of the similarity between two data annotations, we verified a high correspondence of the two label lists, with an ARI of 0.832. Furthermore, we calculated the F1 score of each population to assess the precision and sensitivity of the projections (Figure 1C). B cells and plasmablasts were the cell populations with the best scores, followed by classical monocytes, NK and CD4 T cells, all having a score above 0.9. Non classical monocytes, CD8 and MAIT cells had slightly reduced scores but were still well defined. On the other hand, 40% of gamma delta (γδ) T cells (GDT) were wrongly predicted as CD8 cells, due to the lack of common markers defining this population. TCRgd, the main marker for distinguishing GDT cells, could not be used for the integration because its antibody was added in the same channel as IgD for efficiency, and thus there was no one to one correspondence between the genes and the protein channels anymore. Furthermore, a large proportion of the DC cells were misclassified as classical monocytes, in line with previous observations that cDCs are well defined from transcriptomics signatures, but not easily separated based on surface protein abundance.¹³ Applying a filter for the prediction score (0.6) markedly improved the F1 scores of DC and GDT to 0.798 and 0.665, respectively, but reduced the number of cells with a prediction by 10.3%. These observations were further validated comparing the observed frequencies of the common cell types in the scRNA to the predicted annotations from CyTOF per sample (Figure 1D).

To quantify the amount by which the predictions are influenced by the number of cells in the two modalities, alternative scenarios were also examined. Smaller subsets were used to assess the robustness of the projections for lower numbers of cells and two scenarios of larger datasets were used to evaluate whether an increased number of cells further improves the predictions. Therefore, the analysis was repeated for 9,100, 22,750, 91,000 and 182,000 cells per modality. Most predictions remain consistently high even for the lower subsets of cells and increasing the number of cells does not seem to help all the clusters with low scores (Figure 1C). Finally, increasing the number of cells in the CyTOF dataset to 182,000 cells with the same resolution for the scRNA (45,500 cells) also does not provide a significant improvement in the scores (data not shown).

Imputation of common and dataset specific markers

We next explored the potential of the transcriptomic and proteomic integration to reliably impute the protein markers on the scRNA data. The aim was to assess the accuracy of imputed markers for features that were used in the integration (common genes-proteins between RNA and CyTOF) and for features that did not have an equivalent gene, such as CD45RA and CD45RO. To evaluate the imputation, the transferred protein intensities were compared to the ADT values from the CITE-Seq data, which was considered the “ground truth” of protein measurements.

As expected, the accuracy of the imputation, measured using Pearson’s correlation of the imputed protein with the ADT, reflected to a certain degree the strength of the correlation between the RNA and real protein measurements (ADT) (Figure 2A). Genes for which the correlation between RNA and ADT was very low in the CITE-Seq data (below 0.2), did not provide a good proxy with the imputed protein values either. This included markers that were not relevant in the dataset (CD66 is a marker for neutrophils, of which there are none in this dataset) and antibodies that did not perform well in the ADT (CTLA4 and CCR7). Notable exceptions were CD141 (a marker for the cDC1 dendritic cells) and CD57 (a marker of cytotoxicity in T and NK populations), for which, even though the correlation between the transcriptome and the protein was poor, the integration was able to recover good proxies for the imputed features.

Figure 2. Imputation of common and dataset specific markers.

A. Scatterplot showing the correlation coefficients of the RNA to the ADT for each gene-protein pair (x axes) versus the correlations of the imputed CyTOF (mass cytometry) values to their equivalent ADT (y axes). The three coloured bands depict the strength of the correlations between RNA and ADT values of each gene-protein pair in regions of poor (Pearson’s correlations<0.2, red), medium (0.2<Pearson’s correlations<0.7, orange) and strong (Pearson’s correlations>0.7, green) correlations.

B. UMAP of the integrated dataset (constrained to the cells from the scRNA-Seq alone for all subpanels) showing the RNA expression (1^st column) of CD4 (top row) and CD8 (bottom row) versus the ADT (2^nd column) and the imputed CyTOF (3^rd column).

C. Density plots of the CD4 expression (top 1^st column), the CD4 imputed CyTOF intensity (top 2^nd column) and the CD4 ADT (top 3^rd column) followed by a violin plot of the CD4 imputed intensities according to the published annotations (bottom).

D. Scatterplots of pseudobulk populations per sample and cell type for CD45RO (left) and CD45RA (right). The averaged imputed value of the proteins is depicted on the x axes and the averaged observed ADT value for the same populations on the y axes.

Interestingly, most of the protein markers had a stronger correlation with their ADT counterpart than they did with their respective gene, highlighting that the anchor-based integration was able to recapitulate the protein intensities using the population structure similarities and not directly using the correlations of the proteins with the genes. As a result, we speculated that this integration can be used even for populations for which the correlations between the genes and the proteins is low to moderate. As additional evidence to this and to further assess whether this methodology could define populations which are transcriptionally challenging to disassociate, we focused on the CD4 and CD8 separation. As shown in Figure 2B–C, the imputed values of CD4 provide a very strong signal for the CD4 cells emphasising the differences between positive and negative populations and are strongly correlated with the ADT values (Pearson’s correlation 0.80), even though the gene is very lowly expressed and very poorly correlated with the protein values (Pearson’s correlations of expression to the ADT: 0.34 and to the imputed CyTOF: 0.24). In more detail, only 33.2% of the annotated CD4 cells had any expression of CD4 (more than 0 reads), whereas 90% of them had a high value for the imputed protein (imputed intensity > 2) suggesting that the imputed values could also be used to annotate scRNA-Seq datasets. Intriguingly, 12% of the CD8 cells also appeared to be CD4 positive using the imputed values, representing a CD8 subpopulation that was misaligned by the integration. Most of the CD8 cells with erroneously high imputed CD4 came from the CD8 T central memory (TCM) subpopulation, which has the lowest expression of CD8 amongst the CD8 subpopulations (Supplementary Figure 3, Extended data⁴²). A differential gene expression between the two subpopulations (CD8.TCM vs CD4.TCM) revealed only 4 genes with a significant adjusted p-value and a logFC above 1 (CD8, CD8B, CD4 and CD161), suggesting a very similar transcriptional signature between the two populations, explaining this discordance.

Finally, we examined the correlations between the imputed CD45RA and CD45RO with their respective ADT. Even though these two markers were not used for the integration, weighted projected values could be calculated for each cell based on its anchors. Both CD45RO and CD45RA displayed a moderate Pearson correlation with the equivalent ADT (0.595 and 0.563 respectively, Supplementary Figure 4, Extended data⁴²), but when the correlations were repeated on a pseudobulk level per subpopulation and sample, both markers exhibited a stronger correlation with the ADT, with coefficients of 0.759 and 0.723 respectively (Figure 2D), suggesting that even though the individual single cell predictions can be noisy, the bulk estimates are much more accurate. This was supported by the observation that the imputations were accentuating the differences between the positive and negative populations, creating a smoothing effect on the population imputed antibody intensities (Supplementary Figure 5, Extended data⁴²). CD45RA and CD45RO are distinct isoforms of the CD45 protein and are encoded from the PTPRC gene. Following the same pattern as their ADT equivalent, neither imputed marker correlated with the PTPRC gene expression and they were shown to be mutually exclusive (Supplementary Figure 6). Furthermore, CD45RA and CD45RO are primarily markers that are used for the identification of naïve and memory populations of T cells. To provide additional evidence that the T cell subpopulation structure was captured correctly by the integration we calculated the F1 scores of the predicted Naïve CD4 and CD8 populations. Both populations had a high F1 score, with a value of 0.781 for Naïve CD4 and 0.801 for Naïve CD8. These observations further highlight the ability of the imputation to determine protein expression from the overall RNA profile of the cells rather than specific gene RNA expression patterns.

Integration of datasets of unmatched samples from different conditions

To further the utility of this method, we hypothesized that the integration could be used to inform transcriptomic datasets coming from unmatched samples and different conditions. The applicability of this would mean that publicly available CyTOF datasets of relevant samples could also help the identification of subpopulations in scRNA-Seq data of independent studies. To evaluate this type of integration, we used a publicly available dataset of twenty-four PBMC samples (hereafter named refPBMC) from eight volunteers collected at three time points: day 0 (immediately before); 3 days and 7 days after the administration of a VSV-vectored human immunodeficiency virus (HIV) vaccine,¹³^,²⁸^,²⁹ which we integrated with the CyTOF from the COMBAT Consortium¹² (of healthy and COVID-19 samples) (from here on named “Mixed integration”). Consistent with our previous analysis, the integration successfully assigned the correct cell identities to 92.4% of the cells including all the cells which belonged to clusters that could not be resolved using the CyTOF information (erythrocytes, circulating innate lymphoid cells (ILC), HSPCs, platelets) and 92.7% of the cells excluding these populations of cells. The ARI between our predicted clusters and the published clusters (based on the WNN integration of RNA and ADT¹³) was 0.87, further confirming a good agreement between the annotations.

To evaluate if the main source of variability between the modalities is the immunological landscape of different individuals or the variety of conditions that these samples are coming from (COVID-19 infection vs HIV vaccination), we repeated the integration a) on the samples from day 0 (unvaccinated subjects only) with the CyTOF samples of healthy controls from COMBAT (in the following analysis named “Day 0 integration”) b) all the twenty-four PBMC samples with the CyTOF samples of healthy controls from COMBAT (in the following analysis named “refPBMC Mixed and CyTOF HC integration”). Qualitatively the three integrations are very distinct, where in the mixed integration the populations of B cells and monocytes form islands where the cells are coming from one modality only, whereas there is a much better overlap of cells in the unvaccinated and healthy controls (Day 0) integration for most major cell types (Figure 3A). This observation suggests that the variability observed in the integration of the mixed datasets is the result of biological variation between infection and vaccination effects, and not the result of the integration of different samples. Interestingly, most of the predicted annotations were similarly accurate in all three integrations apart from the plasmablast (PB) cluster (Figure 3B and Supplementary Figure 7, Extended data⁴²). This subpopulation had an F1 score of 0.10 in the Mixed integration, whereas for the Day 0 integration the F1 was 0.83 and for the refPBMC Mixed and CyTOF HC integration it was 0.68. Ahern et al. showed an expanded PB population in COVID-19 patients indicating that abnormal frequencies in the reference population could create artefacts in the projected populations. In the Mixed integration, many monocytes were erroneously predicted to be PBs and had very low prediction scores (Supplementary Figure 8, Extended data⁴²) highlighting that on integrations of unmatched datasets and different conditions, applying a prediction score threshold may be necessary to mitigate the effect of unbalanced population frequencies due to disease. Comparing the PAGA embeddings (Supplementary Figure 9), the Day 0 integration (COMBAT HC and unvaccinated refPBMC) shows the PB separating well and appearing the closest to the B cells, whereas in the Mixed integration (all COMBAT and refPBMC) the plasmablasts appear closer to the monocyte populations, highlighting the inaccuracies of the integration for this subpopulation. Contrary to what we would expect, for the Mixed integration we observed an equal representation of the misclassified PB on all samples, independent of day of collection (pre/post-vaccination) (Supplementary Figure 7, Extended data⁴²) providing further support that the reference population was the cause of the inflated population of PBs.

Figure 3. Integration of unmatched datasets can provide accurate annotations and imputed markers.

A. UMAP of the integrated dataset coloured according to cell type (left) and split by modality (right). Red cells originate from the CyTOF dataset, whereas blue cells from the scRNA-Seq dataset. The 1^st row shows the mixed integration of the COMBAT CyTOF with the reference PBMC dataset (unvaccinated and vaccinated individuals); the 2^nd row the Day 0 integration of the COMBAT CyTOF healthy samples with the refPBMC unvaccinated (healthy) samples only and the 3^rd row the refPBMC Mixed-CyTOF HC integration of the COMBAT CyTOF healthy samples with all the refPBMC samples (unvaccinated and vaccinated individuals).

B. F1 score comparison for the Day 0 integration (dark blue); the Mixed integration (dark green) and the refPBMC Mixed-CyTOF HC integration (light green). All the integration scores are highly concordant for most celltypes apart from the plasmablasts (PB).

C. Scatterplot comparing the correlation coefficients between the imputed CyTOF and the ADT for each gene-protein pair in the COMBAT integration (x axes) versus the same correlations for the unmatched dataset (refPBMC Mixed-CyTOF HC integration) (y axes). The colour of the dot (and protein name) reflects the strength of the correlation of the RNA with the ADT values in the COMBAT dataset (red for poor correlation, orange for medium and green for strong correlation) and the size of the dot reflects the correlation of the RNA with the ADT values in the refPBMC dataset.

Having demonstrated the integration’s ability to accurately annotate the cell populations, we next imputed the CyTOF markers on the refPBMC dataset (refPBMC Mixed and CyTOF HC integration) to assess the granularity that these integrations can disentangle the underlying biological variation for unmatched datasets. To quantitatively validate the imputations, we calculated the correlations of the imputed protein intensities to the “ground truth” ADT measurements of the reference PBMC dataset. We noted a remarkable similarity of the correlations of the imputed markers with the refPBMC-ADT of the unmatched scenario to their respective correlations with the COMBAT-ADT of the matched integration (Figure 3C). Surprisingly, a number of imputed markers even outperformed their counterparts in the COMBAT integration, such as CD99, CD161 and CD14, suggesting differences in the efficiency of the staining of those antibodies between the two ADT datasets.

Finally, we proceeded to assess whether the batch correction of the RNA dataset influences the ability of the method to annotate rare subpopulations accurately. We compared the ARI and F1 scores of the “refPBMC Mixed and CyTOF HC integration” (in which the batch correction had been done on all samples together) with two alternative scenarios: a) of the COMBAT CyTOF healthy control samples being integrated separately with each individual sample of the refPBMC (per sample integration) and b) of the COMBAT CyTOF healthy control samples being integrated with each condition (Day 0/2/7) of the refPBMC separately (per condition integration) (Table 2). In each case the predicted labels of the per sample (or per condition) integration were concatenated to compare them to the published annotations. The ARIs from the three scenarios were very similar (ranging from 0.863 to 0.879) and showed little variation within each sample (Supplementary Figure 10). However, the F1 scores of the subpopulations DCs and the rare PBs showed considerable variation ranging from 0.594 to 0.899 for the DCs and from 0.569 to 0.829 for the PBs. For both populations the “per sample” integration had the lowest F1 score indicating the reduced power to detect rare or low frequency populations for small cell numbers. Interestingly, for PBs, the “per condition” integration showed a marked improvement in F1 score compared to the integration with all conditions together (0.829 vs 0.677). Comparing the correlations of the imputed CyTOF with the ADT values in the three scenarios, all markers apart from CD127 show an equal or higher correlation when all samples are integrated together compared to the per sample and per conditions integrations (Supplementary Figure 11). These results suggest no significant improvement of the integration when performed on a per sample or per condition basis, compared to the integration with the batch corrected dataset of all conditions together.

Table 2. Table of ARI and F1 scores of refPBMC integration scenarios.

Table with the ARI and F1 scores for the three scenarios of integration: the CyTOF HC dataset integrated with each sample of the refPBMCs separately; the CyTOF HC with each condition (Day 0/2/7) of the refPBMC samples; and the CyTOF HC with all refPBMC samples together (refPBMC Mixed and CyTOF HC integration). The results reflect the ARI and F1 scores of the concatenated labels of all the samples together, not the average between all samples/conditions.

		Per sample	Per condition	Across all biological conditions
ARI		0.863	0.879	0.871
F1	B	0.998	0.999	0.999
	CD14 Mono	0.956	0.964	0.958
	CD16 Mono	0.915	0.876	0.842
	CD4	0.946	0.946	0.946
	CD8	0.884	0.890	0.886
	DC	0.594	0.871	0.899
	GDT	0.333	0.308	0.239
	MAIT	0.693	0.881	0.879
	NK	0.946	0.973	0.972
	PB	0.569	0.829	0.677

Characterisation of a rare B cell subpopulation of CD11c+ cells in COVID-19

One of the most interesting applications of the integration of proteomic with transcriptomic datasets is the ability to use the former to identify subpopulations of interest and then characterise them in the latter, coupling phenotypic definitions to transcriptomic datasets. CD11c+ B cells are a heterogeneous subset of B cells (usually defined as CD19+, CD21- and CD27-) which is present at low frequency in healthy individuals³⁰^,³¹ but is increased in several conditions including malaria, HIV and autoimmune diseases.³¹^–³⁵ These cells have assumed many names in the literature, including ‘atypical (memory) B cells’ (or atBC), ‘Age associated cells’ (ABCs), ‘memory precursors’, ‘exhausted memory cells’³¹^–³³^,³⁶ and more recently have been subdivided into DN2 (double negative 2/extrafollicular ASC precursor B cells) and activated Naïve cells³² but there is no general consensus as to their definition. In the COMBAT Consortium, the authors reported subpopulations of the CD11c+ B cells that were significantly more abundant in critical COVID patients compared to healthy volunteers.¹² Atypical B cells along with DN2 and activated naïve CD11c+ cells have also been shown to be increased in COVID-19,³⁷^–³⁹ but their transcriptional heterogeneity after COVID-19 infection has not been explored. We therefore set off to find and transcriptionally characterise these cells in the scRNA-Seq dataset.

Using only the B cells subpopulations from both COMBAT CyTOF and scRNA-Seq datasets (excluding the plasmablasts), we were able to integrate the two modalities using as common features the markers that were selected for the B cells subclustering analysis from the COMBAT Consortium.¹² In order to validate the results of the integration, we then downloaded the RNA and CITE-Seq data from an additional dataset from Su et al. of healthy and COVID19 samples¹⁸ and repeated the integration using the Su scRNA-Seq and the COMBAT CyTOF data. Even though both scRNA-Seq datasets had a low gene expression of CD11c, both independent integrations revealed a strong presence of CD11c+ subpopulations as defined by the imputed protein marker, which was also validated using the ADT values of the two studies (Figure 4A).

Figure 4. Transcriptional characterization of CD11c+ B cells.

A. UMAP of the COMBAT scRNA-Seq (1^st row) and Su et al. scRNA-Seq (2^nd row) datasets showing the RNA expression of CD11c (1^st column) versus the ADT (2^nd column) and the imputed CyTOF (mass cytometry) (3^rd column) validating the imputed CD11c predictions.

B. Venn diagram of the differential gene expression of CD11c+ vs CD11c- B cells for the COMBAT and Su et al. datasets. The average logFCs of the differentially expressed genes in both datasets (430 genes) are plotted as a scatterplot.

C. GSEA of differentially expressed CD11c+ genes from COMBAT and Su et al. datasets with two published gene list signatures of atypical memory B cells (MBCs) vs Naïve and vs Classical B cells of malaria infected individuals (Portugal et al.) and of CD11c+ B cells vs CD11c- from healthy donors (Golinksi et al.) showing significant enrichment of these gene sets.

Due to the low frequencies of the CD11c+ subpopulations, the naïve-like CD11c+ and the switched memory CD11c+ subpopulation could only be found in a fraction of the samples in both COMBAT and Su et al. datasets (Supplementary Figure 12, Extended data⁴²). Therefore, we used the imputed values to identify the CD11c+ population of interest (details in the Methods section). Differential gene expression of the CD11c+ cells uncovered a transcriptional signature that had 430 genes in common between the two datasets (99 downregulated and 331 upregulated), all of which had the same direction of effects (Figure 4B). Moreover, we performed a gene set enrichment analysis (GSEA) of the two lists of differentially expressed genes from the COMBAT and Su et al. datasets with two published signatures of CD11c+ cells from healthy controls³⁰ and from atypical BCs of malaria patients.³⁴ Both show a significant enrichment for the CD11c+/atBC signatures (NES>2 and p-values<10^-17), further confirming the identity of those cells (Figure 4C).

Amongst the genes with the largest logFC in both studies were FGR, CIB1, EMP3, MPP6 and RHOB (Figure 4B), genes that have been previously associated with the DN2 phenotype.¹⁷ Moreover, DN2 cells are known to express high levels of inhibitory receptors, such as those belonging to the family of Fc-receptor-like (FCRL) molecules,¹⁷^,⁴⁰ consistent with the upregulation of FCRL3 and FCRL5 in the CD11c+ subpopulation. TCL1A, on the other hand, is a marker of transitional and naïve cells, which is known to be absent in the other B cell populations.¹⁷ T-bet (TBX21) is a transcription factor that has been shown to be expressed in CD11c+ B cells, although not all CD11c+ B cells are T-bet+.³⁰ In the COMBAT dataset TBX21 (T-bet) was filtered out due to low expression (expressed in less than 10% of the cells in both subsets) and could not be tested for differential expression between the two subsets. In contrast, TBX21 was significantly differentially expressed between the CD11c+ and CD11c- populations in the Su et al. dataset with a logFC of 0.51 (adjusted p-value<10^-14). Additionally, many known transcriptional markers correlating with T-bet expression in B cells, such as ACTB, ALOX5AP, GSTP1 and LAPTM5,¹⁷ were enriched in CD11c+ cells for both datasets (Supplementary Figure 13, Extended data⁴²).

To further explore the transcriptional heterogeneity of the CD11c+ population in COVID-19 patients, we integrated the CyTOF dataset with B cell populations of a meta-analysis of 12 scRNA-Seq datasets of COVID-19 studies.¹⁶ The analysis recovered all three CD11c+ subpopulations that had been found in the CyTOF COMBAT dataset. It identified 153 naïve CD11c+ (activated naïve cells/naive-like CD11c B cells, defined as CD27-, IgD+, CD11c+); 400 DN CD11c+ (DN2, defined as CD27-, IgD-, CD11c+) and a less well characterised population of 440 switched memory CD11c+ cells (CD27+, IgD-, CD11c+ cells). All three subpopulations, had a clear signature of genes associating with the CD11c+ phenotype, such as a higher expression of CD19 and CD20 (MSA4A1) as well as CD11c+ hallmark cell surface receptors CD84, CD68, CD86³⁰; inhibitory receptors CD72, FCGR2B, FCRL3, FCRL5, LILRB1³³ and transcription factors TBX21, ZEB2 and ZBTB32³⁴^,³⁵ (Supplementary Figure 14, Extended data⁴²).

Pairwise analysis of the 3 subpopulations uncovered 91 differentially expressed genes between the DN CD11c+ cells and the switched memory CD11c+; 44 between the naïve-like CD11c+ and the switched memory CD11c+ (Figure 5A) and a very similar transcriptional profile for the DN2 and the naïve-like CD11c+ cells (no significant differentially expressed genes between the two), in line with the phenotypic similarities of these two populations that have been previously reported.³² The similarity between naïve-like and DN2 CD11c+ cells, and the differences between naïve-like CD11c+ cells and the naïve (CD11c-) cells (529 differentially expressed genes - data not shown), indicates that these naïve-like CD11c+ (or activated naïve) cells are transcriptionally closer to the DN2 than they are to the naïve counterparts. This suggests that these might not be naïve cells, but rather a small subset of CD11c+ cells sharing a limited set of phenotypic markers with classical naïve cells.

Figure 5. Differential gene expression between 3 rare CD11c+ subpopulations of B cells.

A. Volcano plots of DGE between Naïve-like CD11c+ vs Switched memory CD11c+ (left) and DN CD11c+ and Switched memory CD11c+ cells (right) in the meta-analysis of the 12 COVID-19 scRNA-Seq datasets. Significant genes shown in red.

B. Dotplot showing the top genes from A. for the 3 predicted subpopulations of CD11c+ cells in the dataset of Su et al. The genes with an asterisk were also shown to be significant in the comparison of Naïve-like CD11c+ cells vs Switched memory CD11c+ (p-adjusted<0.05).

The differentially expressed genes between the three subsets uncovered a subset specific expression of CXCR3, which was upregulated in the switched memory subset of CD11c+ cells (Figure 5A). CXCR3, an inflammatory chemokine receptor, has been shown to be upregulated in CD11c+ cells in multiple studies,³¹^,³⁸^,⁴⁰^,⁴¹ with heterogeneous expression amongst these cells,³⁸^,⁴⁰ but had not previously been associated with a specific subset of CD11c+ cells. Similarly, MPP6 a transcriptional marker of Tbet+ (TBX21) cells³⁴ was downregulated in switched memory CD11c+, providing further evidence for the observed heterogeneity of CD11c+ cells in terms of expression of TBX21.³² Seven of the differentially expressed genes between the naïve-like CD11c+ cells and the switched memory CD11+ cells (CD27, LTB, TSC22D3, ACTG1, FCRL3, FCRL5 and KLF2) were also independently validated in the smaller dataset of Su et al. with significant adjusted p-values (Figure 5B) and many other important markers such as CXCR3 and MPP6 had the same direction of effects.

To our knowledge, this is the first study to be transcriptionally characterising these very rare subpopulations of cells after COVID-19 infection by leveraging information from two independent single cell technologies and using only publicly available datasets. Importantly, this subpopulation of CD11c+ cells has never been identified without prior sorting of B cells. These results have clearly demonstrated that the multimodal integration substantially improves our ability to resolve cell states, allowing us to identify and characterise previously unreported B cell subpopulations.

Discussion

In the last decade, single cell methodologies have transformed all fields of biology. Mass cytometry and single cell RNA-Seq have been very widely used in the past, but the integration of the two modalities has not been widely explored so far. In this work, we show the unique advantages that can be gained from formally integrating these two modalities. We demonstrate that CyTOF datasets can be used to annotate and produce high quality imputed proteomics values for both matched and unmatched scRNA datasets. More importantly, we prove that the imputed values can be used to help define rare subpopulations in order to transcriptionally characterise them in depth using publicly available datasets.

The integration of modalities using imputed or projected values can give us unprecedented power to further our understanding of the transcriptional and proteomic immunological landscapes of health and disease and provide a wealth of interesting hypotheses to be tested, but caution is warranted when using this methodology without validation. We observed great potential in integrating well defined populations of cells, such as CD4 and CD8 T cells, but in cases where the common markers were not sufficient to separate the cellular heterogeneity, the agreement between the projected cellular annotations and the real ones were suboptimal (such as GDT cells; Figure 1B-D). Fundamentally, this integration will always depend on the quality and quantity of the common CyTOF-RNA features.

An additional limitation of this methodology is that the frequencies of rare subpopulations between different conditions cannot be estimated with high accuracy for medium size scRNA-Seq studies (Supplementary Figure 12, Extended data⁴²). Populations that were less than 1% of the B cells had a high proportion of samples for which no cells could be salvaged for those subpopulations. However, we demonstrated that meta-analyses of RNA datasets can bypass this constraint and managed to unravel the transcriptional heterogeneity of these populations. Exploiting this methodology with the abundance of scRNA-Seq data from the Human Cell Atlas¹ could provide endless opportunities to harness the differences between rare subpopulations in health and disease.

We should note here that the integration of CyTOF with scRNA-Seq datasets is not intended to substitute the invaluable information gained from multi-modal technologies such as CITE-Seq; measurements of antibody derived tags will inherently be more accurate than imputed values. However, there are instances where mass cytometry can provide imputed measures which would not be available in CITE-Seq, such as when rare subpopulations of cells are of interest, as we demonstrate in Figures 4,5. Additionally, intracellular protein modifications of cells, which are not possible to measure in CITE-Seq, could be projected to the RNA datasets, in a similar manner to our imputation of non-overlapping features CD45RA and CD45RO to identify epigenetic states of the cells. Furthermore, extending the integration of scRNA-Seq datasets with Imaging Mass Cytometry (IMC) datasets could facilitate new and exciting discoveries of spatial interactions of healthy and disease states.

In this article, we have successfully identified and characterised a previously elusive subpopulation of CD11c+ B cells. To our knowledge, this is the first study showing that imputed protein values from a CyTOF and scRNA-Seq integration can lead to an in-depth transcriptional characterisation of a heterogeneous rare subpopulation of cells. Even though the frequencies of these subpopulations vary between datasets, we show that the transcriptional signature that defines those cells can be robustly identified with this integration. To date, there are only a handful of scRNA-Seq experiments that try to describe this CD11c+ population but all are very limited in the numbers of samples and cells used and none of them has managed to isolate all three subpopulations of CD11c+ cells.¹⁷^,³⁵^,⁴⁰^,⁴¹ In addition, in these studies the subpopulations of interest had to be sorted prior to sequencing to enrich for rare subtypes. Cell sorting, however, is a laborious process that can introduce bias in the process of selection of cells, especially in the case of less well-defined subpopulations, as is the case with CD11c+ subpopulations, explaining why none of these studies was able to identify all three subpopulations. Managing to couple phenotype-based studies with single cell transcriptomics in an unbiased way can give us unprecedented insights into the cellular machinery of all sections of biology.

Data availability

Source data

The Ahern et al.¹² data was obtained from https://doi.org/10.5281/zenodo.5139560.

The human PBMC atlas datasets¹³ were obtained from the GEO database accession GS164378 and from https://atlas.fredhutch.org/data/nygc/multimodal/pbmc_multimodal.h5seurat.

The Tian et al.¹⁶ data was obtained from https://atlas.fredhutch.org/fredhutch/covid/.

Extended data

Zenodo: Supplementary Figures for Repapi et al. https://doi.org/10.5281/zenodo.7236116.⁴²

This project contains the following extended data:

- supplementary-figures_Repapi_et_al.pdf (PDF file containing Supplementary Figures 1–14)
- supplementary-table_Repapi_et_al.pdf (PDF file containing Supplementary Table 1)

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Analysis code

Analysis code available from: https://github.com/emmanuelaaaaa/CyTOF_scRNA_integration

Archived analysis code at time of publication: https://doi.org/10.5281/zenodo.6546982 ⁴³

License: MIT

Acknowledgements

We thank Fabiola Curion from the Institute of Computational Biology, Helmholtz Center Munich in Germany for her critical support throughout this project and for her expert advice on the manuscript.

References

1. Regev A, et al.: The human cell atlas. elife. 2017; 6. Publisher Full Text
2. Aldridge S, Teichmann SA: Single cell transcriptomics comes of age. Nat. Commun. 2020; 11: 1–4. Publisher Full Text
3. Lähnemann D, et al.: Eleven grand challenges in single-cell data science. Genome Biol. 2020; 21: 53.
4. Buettner F, et al.: Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol. 2015; 33: 155–160. PubMed Abstract | Publisher Full Text
5. Bodenmiller B, et al.: Multiplexed mass cytometry profiling of cellular states perturbed by small-molecule regulators. Nat. Biotechnol. 2012; 30: 858–867. PubMed Abstract | Publisher Full Text
6. Damond N, et al.: A Map of Human Type 1 Diabetes Progression by Imaging Mass Cytometry. Cell Metab. 2019; 29: 755–768.e5. PubMed Abstract | Publisher Full Text
7. Kashima Y, et al.: Potentiality of multiple modalities for single-cell analyses to evaluate the tumor microenvironment in clinical specimens. Sci. Reports. 2021; 11: 1–11.
8. Reimegård J, et al.: A combined approach for single-cell mRNA and intracellular protein expression analysis. Commun. Biol. 2021; 4: 624. PubMed Abstract | Publisher Full Text
9. Labib M, Kelley SO: Single-cell analysis targeting the proteome. Nat. Rev. Chem. 2020; 4: 143–158. Publisher Full Text
10. Levy E, Slavov N: Single cell protein analysis for systems biology. Essays Biochem. 2018; 62: 595–605. PubMed Abstract | Publisher Full Text
11. Adossa N, Khan S, Rytkönen KT, et al.: Computational strategies for single-cell multi-omics integration. Comput. Struct. Biotechnol. J. 2021; 19: 2588–2596. PubMed Abstract | Publisher Full Text
12. Ahern DJ, et al.: A blood atlas of COVID-19 defines hallmarks of disease severity and specificity. Cell. 2022; 185: 916–938.e58. PubMed Abstract | Publisher Full Text
13. Hao Y, et al.: Integrated analysis of multimodal single-cell data. Cell. 2021; 184: 3573–3587.e29. PubMed Abstract | Publisher Full Text
14. Stuart T, et al.: Comprehensive Integration of Single-Cell Data. Cell. 2019; 177: 1888–1902.e21. PubMed Abstract | Publisher Full Text
15. Mulè MP, Martins AJ, Tsang JS: Normalizing and denoising protein expression data from droplet-based single cell profiling. Nat. Commun. 2022; 13: 1–12. Publisher Full Text
16. Tian Y, et al.: Single-cell immunology of SARS-CoV-2 infection. Nat. Biotechnol. 2021; 40: 30–41.
17. Stewart A, et al.: Single-Cell Transcriptomic Analyses Define Distinct Peripheral B Cell Subsets and Discrete Development Pathways. Front. Immunol. 2021; 12: 743.
18. Su Y, et al.: Multi-Omics Resolves a Sharp Disease-State Shift between Mild and Moderate COVID-19. Cell. 2020; 183: 1479–1495.e20. PubMed Abstract | Publisher Full Text
19. McInnes L, Healy J, Melville J: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.Reference Source
20. Wolf FA, Hamey FK, Plass M, et al.: PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 2019; 20(1): 1–9. Publisher Full Text
21. Wolf FA, Angerer P, Theis FJ: SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol. 2018; 19(1): 1–5. Publisher Full Text
22. Hubert L, Arabie P: Comparing partitions. J. Classif. 1985; 21(2): 193–218.
23. Scrucca L, Fop M, Murphy TB, et al.: Mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. R J. 2016; 8: 289–317. Publisher Full Text
24. McCarthy DJ, Campbell KR, Lun ATL, et al.: Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics. 2017; 33: btw777–btw1186. Publisher Full Text
25. Yu G, Wang LG, Han Y, et al.: clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012; 16: 284–287. PubMed Abstract | Publisher Full Text
26. Mereu E, et al.: Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat. Biotechnol. 2020; 38: 747–755. PubMed Abstract | Publisher Full Text
27. Ding J, et al.: Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat. Biotechnol. 2020; 38: 737–746. PubMed Abstract | Publisher Full Text
28. Elizaga ML, et al.: Safety and tolerability of HIV-1 multiantigen pDNA vaccine given with IL-12 plasmid DNA via electroporation, boosted with a recombinant vesicular stomatitis virus HIV Gag vaccine in healthy volunteers in a randomized, controlled clinical trial. PLoS One. 2018; 13: e0202753. PubMed Abstract | Publisher Full Text
29. Li SS, et al.: DNA priming increases frequency of T-cell responses to a vesicular stomatitis virus HIV vaccine with specific enhancement of CD8 T-cell responses by interleukin-12 plasmid DNA. Clin. Vaccine Immunol. 2017; 24. PubMed Abstract | Publisher Full Text
30. Golinski ML, et al.: CD11c+ B Cells Are Mainly Memory Cells, Precursors of Antibody Secreting Cells in Healthy Donors. Front. Immunol. 2020; 11: 32. PubMed Abstract | Publisher Full Text
31. Karnell JL, et al.: Role of CD11c+ T-bet+ B cells in human health and disease. Cell. Immunol. 2017; 321: 40–45. PubMed Abstract | Publisher Full Text
32. Sanz I, et al.: Challenges and opportunities for consistent classification of human b cell and plasma cell populations. Front. Immunol. 2019; 10: 2458. PubMed Abstract | Publisher Full Text
33. Portugal S, et al.: Malaria-associated atypical memory B cells exhibit markedly reduced B cell receptor signaling and effector function. elife. 2015; 4. PubMed Abstract | Publisher Full Text
34. Jenks SA, et al.: Distinct Effector B Cells Induced by Unregulated Toll-like Receptor 7 Contribute to Pathogenic Responses in Systemic Lupus Erythematosus. Immunity. 2018; 49: 725–739.e6. PubMed Abstract | Publisher Full Text
35. Holla P, et al.: Shared transcriptional profiles of atypical B cells suggest common drivers of expansion and function in malaria, HIV, and autoimmunity. Sci. Adv. 2021; 7: 8384–8410.
36. Schulte-Schrepping J, et al.: Severe COVID-19 Is Marked by a Dysregulated Myeloid Cell Compartment. Cell. 2020; 182: 1419–1440.e23. PubMed Abstract | Publisher Full Text
37. Oliviero B, et al.: Expansion of atypical memory B cells is a prominent feature of COVID-19. Cell. Mol. Immunol. 2020; 17: 1101–1103. PubMed Abstract | Publisher Full Text
38. Wildner NH, et al.: B cell analysis in SARS-CoV-2 versus malaria: Increased frequencies of plasmablasts and atypical memory B cells in COVID-19. J. Leukoc. Biol. 2021; 109: 77–90. PubMed Abstract | Publisher Full Text
39. Woodruff MC, et al.: Extrafollicular B cell responses correlate with neutralizing antibodies and morbidity in COVID-19. Nat. Immunol. 2020; 21: 1506–1516. PubMed Abstract | Publisher Full Text
40. Sutton HJ, et al.: Atypical B cells are part of an alternative lineage of B cells that participates in responses to vaccination and infection in humans. Cell Rep. 2021; 34: 108684. PubMed Abstract | Publisher Full Text
41. He B, et al.: Rapid isolation and immune profiling of SARS-CoV-2 specific memory B cell in convalescent COVID-19 patients via LIBRA-seq. Signal Transduct. Target. Ther. 2021; 6: 1–12.
42. Repapi E, Agarwal D, Napolitani G, et al.:Supplementary Figures and Table for Repapi et al. 2022. [Dataset]. 2022. Publisher Full Text
43. Repapi E, Agarwal D: emmanuelaaaaa/CyTOF_scRNA_integration: v2.0 (v2.0). Zenodo. [Analysis code]. 2022. Publisher Full Text

Comments on this article Comments (0)

Version 3

VERSION 3 PUBLISHED 23 May 2022

Author details Author details

¹ Medical Research Council Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, Oxfordshire, OX3 9DS, UK
² Department of Haematology, King's College London, London, SE1 1UL, UK

Devika Agarwal
Roles: Data Curation, Formal Analysis, Validation, Visualization, Writing – Review & Editing

Giorgio Napolitani
Roles: Conceptualization, Supervision, Writing – Review & Editing

David Sims
Roles: Supervision, Writing – Review & Editing

Stephen Taylor
Roles: Funding Acquisition, Supervision, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This work was supported by UKRI (MR/S005471/1; a Rutherford Fund Fellowship to E.R), and Wellcome (209235,https://doi.org/10.35802/209235 https://doi.org/10.35802/209235; a Collaborative Award to D.S.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (3)

version 3

Revised

Published: 02 May 2023, 11:560

https://doi.org/10.12688/f1000research.121829.3

version 2

Revised

Published: 04 Nov 2022, 11:560

https://doi.org/10.12688/f1000research.121829.2

version 1

Published: 23 May 2022, 11:560

https://doi.org/10.12688/f1000research.121829.1

© 2022 Repapi E et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Repapi E, Agarwal D, Napolitani G et al. Integration of single-cell RNA-Seq and CyTOF data characterises heterogeneity of rare cell subpopulations [version 2; peer review: 1 approved, 1 approved with reservations]. F1000Research 2022, 11:560 (https://doi.org/10.12688/f1000research.121829.2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 2

VERSION 2

PUBLISHED 04 Nov 2022

Revised

Views

Reviewer Report 26 Jan 2023

Xiang Chen, Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.140163.r158770

While I agree with Dr. Andrews that the authors demonstrated that the existing software could be used to effectively impute the relative protein levels using the scRNA-seq data, I have several major concerns regarding the aims and the experimental design ... Continue reading

What is the value of this approach in real data analysis? Single cell analyses, such as clustering, and pseudotemporal analysis of scRNA-seq data rarely rely on a small set of known markers. Instead, the whole transcriptome (or at least the top variably expressed genes) were used in the analysis. Although the relative expression of known biomarkers is useful in annotation, there are many existing algorithms to annotate the cell type based on single cell transcriptome. Therefore, what is the additional value of imputed protein levels of a limited set of biomarkers to a given scRNA-seq data? If the purpose is to validate the annotation, I don’t think an imputed biomarker expression is sufficient to replace the actual protein-level measurement given the widely available FACS-sorting as well as newly developed single cell analysis platforms. The authors did provide an example of characterizing the CD11c+ B cells in the manuscript. However, the authors’ approach (using an arbitrary threshold to separate the B cells to CD11c+ and – cells) is very different from their claim in the abstract that they “identify and transcriptionally characterize a rare subpopulation of Cd11 c positive B cells”. Specifically, they have not established that these CD11c strong cells form a unique rare subpopulation using either scRNA-seq data or their imputed protein levels. If this subset of B cells does form a unique rare subpopulation, is it possible to identify it using standard clustering of the scRNA-seq data only? Anyway, Figure 4A did suggest a high level of concordance between the RNA level and measured protein level for CD11c and I am not sure what the imputed protein level added into the identification of such a rare subpopulation.
The design of experiment needs justification. This manuscript applied standard approaches to existing datasets and did not run any experimental validation of their findings. Therefore, the appropriate experimental design is critical to justify their finding. For example, although the authors claimed that the published annotation were retained as gold standard, there are apparently more cell types reported in the COMABT cell paper than what was included in this manuscript. Why did the authors decide to merge several clusters (i.e., different B cells, different CD4+ T, different CD8+ T) into a single identity while not merging others (i.e., CD14+ monocyte and CD16+ monocyte)? If the existing imputation approach only works at a coarser level of subpopulation finding than existing scRNA-seq data can achieve, why do we need the imputed data?
The selection of algorithm parameters needs justification. Similarly, the selection of the parameters is important for scRNA-seq analysis. Please describe the rationale behind the selected parameters (if it is different from the default). For example, why different k.weight used in the COMBAT and the COVID-18 scRNA-seq data? If the parameters were selected to give the best F1/ARI score of the reported datasets here, is it possible to overfit the specific dataset and not generalizable?

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Cancer multiOMICS , single cell analysis, computational method development, machine learning

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 02 May 2023

Εμμανουέλα Ρεπαπή, Medical Research Council Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, OX3 9DS, UK

02 May 2023

Author Response

Firstly we would like to thank you for your detailed review of our manuscript, and your helpful suggestions for improving this manuscript. Below we are the responses to the individual ... Continue reading Firstly we would like to thank you for your detailed review of our manuscript, and your helpful suggestions for improving this manuscript. Below we are the responses to the individual comments.

1) We apologise that the aims of this methodology may not have been clearly outlined in the manuscript. The integration of CyTOF and scRNA-Seq can provide an affordable alternative to a unified view across modalities within the same cells, with similar advantages to the integration of proteomic and transcriptomic modalities of single cells which have been developed over the last few years (such as CITE-Seq, PLAYR, REAP-seq, ECCITE-Seq and SPARC) (Frei et al., 2016; Hao et al., 2021; Mimitou et al., 2019; Peterson et al., 2017; Reimegård et al., 2021). More specifically, in our manuscript we demonstrate the value of our approach in real data analysis in two ways. Firstly, by aiding the annotation of populations that are difficult to disentangle from RNA data alone (such as CD4 populations). Secondly, by identifying rare subpopulations that have not been studied in such a detail that a transcriptomic signature exists, such as the subpopulations of the CD11c positive B cells, namely the naïve-like CD11c+; the switched memory CD11c+; and the DN CD11c+. As a result, we are consolidating the annotations between the two modalities that are currently largely disjoint. This is done by transferring the annotations of the CD11c+ subpopulations that are ‘protein driven’ (based on the markers CD11c, CD27 and IgD) on the scRNA-Seq dataset and then identifying their transcriptional signature, with the caveat, however, that these signatures have not been experimentally validated. In this way the integration of modalities can lead to the understanding of the causal relationships between -omics data, a point that has been noted as an important milestone in cellular biology that can have implications in medical diagnostics and treatments (Colomé-Tatché & Theis, 2018). We have amended the Introduction of our manuscript to clearly state our aims. Even though using reference datasets can contribute to the annotation of well-established populations (like CD4), the remaining outcomes are not something that the existing algorithms that annotate cell types based on single cell transcriptomes can do. We have amended the Discussion to highlight these important differences to the already available algorithms for annotation based on single cell transcriptomes. We also agree with the Reviewer that the integration of CyTOF with scRNA-Seq datasets is not intended to substitute the invaluable information gained from any actual protein level measurements, instead it is aimed at creating hypotheses that can then be validated using actual protein level measurements. We have revised the Discussion of the manuscript to emphasise this critical point.

The Reviewer also commented on the threshold that was used for selecting the positive CD11c cells. The thinking behind using this specific threshold for annotating the CD11c+ cells was that the distribution of the imputed values for CD11c followed the CyTOF distribution of that marker since the imputation of the markers is a weighted average of the original values and therefore, we could use this information to select the threshold. We appreciate, nonetheless, that using this rather arbitrary threshold on the imputed CyTOF may not be a convincing enough strategy for selecting the CD11c+ B cells. For this reason, we proceeded by validating the annotation of this population firstly by showing an overlap with the measured ADT levels (Figure 4A); secondly by showing a common transcriptional signature between two independent integrations of the COMBAT and the Su et al. datasets using the same threshold (Figure 4B) and thirdly by overlaying this transcriptional signature with published datasets using a gene set enrichment analysis (GSEA) and showing a very significant enrichment (Figure 4C). We would like to highlight here that the published signatures of CD11c+ cells were the result of sorting the CD11c+ populations in order to enrich for these rare phenotypes and therefore can be considered as the ‘ground truth’ of this transcriptional signature.

Although we observed a concordance in the expression levels and the measured protein levels (ADT) of CD11c in B cells (Figure 4A), the expression levels of CD11c were found to be very low and therefore incapable to separate these subpopulations. To address this in a more concise manner, we have included Supplementary Figure 13 with the difference in expression between CD11c+ and CD11c- cells showing a large proportion of the cells that were predicted to be CD11c+ to have no expression of CD11c (ITGAX). In more detail, only 48.4% of the predicted CD11c+ cells had any CD11c expression at all (expression levels greater than 0) in the COMBAT dataset and only 40.4% of the predicted CD11c+ cells had any expression at all in the Su et al. dataset. We have amended our Results to include these comparisons.

Furthermore, we would like to clarify that this B cell subpopulation was not identified in the original COMBAT paper from scRNA-Seq data only (Ahern et al., 2022). Even though it was noted that subpopulations of the CD11c+ B cells in mass cytometry were found to be significantly increased in community and convalescent COVID-19 samples, this was not verified from the transcriptional point of view, most likely due to the differences in annotation between the mass cytometry and scRNA-Seq datasets. This further highlights the need to consolidate the annotations between modalities to facilitate bridging the gap between different approaches of cellular biology.

Finally, we would like to emphasise here that we do not postulate that the CD11c+ cells form a homogeneous rare subpopulation. On the contrary, we are claiming that even though this subpopulation has a transcriptional signature which separates it from the remaining B cells (CD11c- cells) (Figure 4B), it is very likely that there is further heterogeneity within the CD11c+ cells. This has been explored in Figure 5 but further validation is necessary to confirm this hypothesis.

2) The Reviewer is correct in noting that in this manuscript there is no experimental validation of the findings. Instead, for the first two sections of the Results, we used: a) the published annotations as golden standard to compare to the transferred labels and b) datasets with measured ADT protein levels which we compared to the imputed protein levels, to validate the recommended methodology. Both these comparisons confirmed high agreement to the ‘ground truth’ results in most cases. However, there are instances where our methodology did not perform well, for example in identifying the GDT cells (in all scenarios) and the PB cells (in the unmatched scenario). These cases are being examined at length in our Results and the caveats of this methodology are being discussed in the Discussion.

The Reviewer is also correct in that there were more COMBAT cell types than the ones included in this manuscript. The reasoning behind the merging of the celltypes was based on the granularity of the celltypes used in the Supplementary Figure 1 (A-B) of the Ahern et al. paper in which the ADT is integrated with the CyTOF to validate the cell composition between the two datasets. The intension of this manuscript is to provide an affordable alternative to the costly ADT measurements and therefore we show an equivalent integration between the RNA with the CyTOF datasets, using the same groupings. Similarly, for the Human PBMC Atlas datasets, the annotations were merged to match the COMBAT ones, to have comparable results in terms of the ARI and F1 metrics. We realise that this is not explained properly in the manuscript which was a clear omission on our part. Therefore, we have now added this explanation in the Methods section.

Finally, in the Discussion we highlight that the integration is heavily dependent on the quality and quantity of the common CyTOF-RNA features. If there are no available markers to separate specific subpopulations in the CyTOF dataset or if the efficiency of the markers is suboptimal or if there is no RNA-CyTOF corresponding pair, then these subpopulations will not be identifiable in the integration either. This is specifically discussed for the case of the GDT cells in which the TCRgd antibody was added in the same channel as IgD, and thus there was no one to one correspondence between the genes and the protein channels anymore. This caused a suboptimal annotation of the GDT populations. We have changed the Discussion to clarify these cases where the integration would be suboptimal. Nonetheless, in the manuscript we demonstrate that when there is a good availability of markers, the subpopulations of interest can also be found for more detailed annotations, as we have done with the CD11c+ B cell subpopulation. With this example we have shown that it is possible to achieve good integration in finer subpopulations and that the imputation is also possible for more detailed population definitions.

3) We thank the Reviewer for highlighting this important detail of the manuscript. We completely agree that the selection of parameters is often a critical step in assessing the validity of a methodology. This point was also raised by Dr. Andrews, who highlighted that it would be important to mention the alternative options explored for the dimensionality reduction. To this end, we included Supplementary Table 1 with five alternative options and how they compare to the selected choice. These comparisons were performed using the ARI and F1 scores so the Reviewer has sound reasons for speculating that this might have led to the data being overfitted for this dataset. However, the results were then replicated in a second dataset, the Human PBMC Atlas dataset (Hao et al., 2021), where no comparisons were done between the different dimensionality reduction methods for optimisation. Instead, we used the same choice of dimensionality reduction and demonstrated a high agreement of the predicted annotations with the published ‘ground truth’ ones (ARI 0.87).

In agreement with the Reviewer’s comments, we tried to retain the default parameters wherever possible to have generalisable results. Nevertheless, there were two parameters that we decreased for the section ‘Characterisation of a rare B cell subpopulation of CD11c+ cells in COVID-19’. Firstly, it was necessary to decrease the number of dimensions for the CCA because we only had a reduced set of 26 B cell and Plasma markers as common features (using the marker set of 28 proteins that was used to annotate these subpopulations in Ahern et al). The number of dimensions for the CCA needs to be less than the number of features used in the FindTransferAnchors function and, therefore, we reduced that number from 30, which is the default, to 15. Since in this application there is no ground truth in terms of the annotations, this number was not optimised in any way and no ARI or F1 was calculated for these comparisons. Consequently, in this instance there is no risk of overfitting the data. Secondly, we felt that the k.weight parameter needed further tuning for the application of identifying rare subpopulations such as the subpopulations of CD11c+ cells. This was decided based on the discussion on the github repository[1], where it is suggested that the reduction of k.weight can help in the identification of rare cell types. Indeed, for very rare subpopulations, having the default number of neighbours that are being considered when weighting anchors (k.weight=50) can bias the cell annotation predictions from cells with similar transcriptome that, however, come from distinct subpopulations. In line with the suggestion from the package developers, we observed that reducing the k.weight parameter resulted in increasing the number of cells that were predicted to belong to the CD11c+ subpopulations of cells. In more detail, for the integration of the COMBAT datasets, we observed that by changing the k.weight from 50 to 30, the numbers of cells increased from 3 to 10 for the Naïve-like CD11c+, from 29 to 58 for the Switched memory CD11c+ and they decreased from 685 to 655 for the DN CD11c+ cells. Moreover, for the integration with Su et al. the numbers of cells increased from 21 to 54 for the Naïve-like CD11c+, from 25 to 42 for the Switched memory CD11c+ and from 452 to 475 for the DN CD11c+ cells. Further reduction of this parameter did not seem to significantly change these predictions. Finally, reducing the k.weight from 50 to 30 for the integration with the 12 COVID-19 scRNA-Seq datasets showed an increase of the subpopulations from 20 to 69 for the Naïve-like CD11c+, from 267 to 332 for the Switched memory CD11c+ and from 346 to 404 for the DN CD11c+ cells. Selecting a k.weight=20, further increased the Naïve-like CD11c+ to 153 and the Switched memory CD11c+ to 440 (the DN CD11c+ cells remained stable at 400), justifying decreasing this parameter further. We appreciate that this line of thought was not clearly explained in the manuscript and we have included a reasoning for the reduced k.weight in the Methods section. We would like to highlight here that this section does not include any ‘ground truth’ from ADT values and therefore the transcriptional characterisation of these subpopulations was not optimised in any way. Additionally, it does not substitute any experimental validation that would need to be performed to better understand these subpopulations. The aim of this section, instead, is to demonstrate how this methodology can aid in understanding heterogeneity in very rare subpopulations and can create interesting hypotheses that can be taken forward for further validation.

References
Ahern, D. J., Ai, Z., Ainsworth, M., Allan, C., Allcock, A., Angus, B., … Zurke, Y.-X. (2022). A blood atlas of COVID-19 defines hallmarks of disease severity and specificity. Cell, 185(5), 916-938.e58. https://doi.org/10.1016/J.CELL.2022.01.012/ATTACHMENT/E8167B96-BF2B-4A6E-9D0F-FDF06FB48280/MMC10.PDF
Colomé-Tatché, M., & Theis, F. J. (2018, February 1). Statistical single cell multi-omics integration. Current Opinion in Systems Biology. Elsevier Ltd. https://doi.org/10.1016/j.coisb.2018.01.003
Frei, A. P., Bava, F.-A., Zunder, E. R., Hsieh, E. W. Y., Chen, S.-Y., Nolan, G. P., & Gherardini, P. F. (2016). Highly multiplexed simultaneous detection of RNAs and proteins in single cells. Nature Methods, 13(3), 269–275. https://doi.org/10.1038/nmeth.3742
Hao, Y., Hao, S., Andersen-Nissen, E., Mauck, W. M., Zheng, S., Butler, A., … Satija, R. (2021). Integrated analysis of multimodal single-cell data. Cell, 184(13), 3573-3587.e29. https://doi.org/10.1016/j.cell.2021.04.048
Mimitou, E. P., Cheng, A., Montalbano, A., Hao, S., Stoeckius, M., Legut, M., … Smibert, P. (2019). Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nature Methods 2019 16:5, 16(5), 409–412. https://doi.org/10.1038/s41592-019-0392-0
Peterson, V. M., Zhang, K. X., Kumar, N., Wong, J., Li, L., Wilson, D. C., … Klappenbach, J. A. (2017). Multiplexed quantification of proteins and transcripts in single cells. Nature Biotechnology, 35(10), 936–939. https://doi.org/10.1038/nbt.3973
Reimegård, J., Tarbier, M., Danielsson, M., Schuster, J., Baskaran, S., Panagiotou, S., … Gallant, C. J. (2021). A combined approach for single-cell mRNA and intracellular protein expression analysis. Communications Biology, 4(1). https://doi.org/10.1038/S42003-021-02142-W

[1] https://github.com/satijalab/seurat/issues/1636
Firstly we would like to thank you for your detailed review of our manuscript, and your helpful suggestions for improving this manuscript. Below we are the responses to the individual comments.

1) We apologise that the aims of this methodology may not have been clearly outlined in the manuscript. The integration of CyTOF and scRNA-Seq can provide an affordable alternative to a unified view across modalities within the same cells, with similar advantages to the integration of proteomic and transcriptomic modalities of single cells which have been developed over the last few years (such as CITE-Seq, PLAYR, REAP-seq, ECCITE-Seq and SPARC) (Frei et al., 2016; Hao et al., 2021; Mimitou et al., 2019; Peterson et al., 2017; Reimegård et al., 2021). More specifically, in our manuscript we demonstrate the value of our approach in real data analysis in two ways. Firstly, by aiding the annotation of populations that are difficult to disentangle from RNA data alone (such as CD4 populations). Secondly, by identifying rare subpopulations that have not been studied in such a detail that a transcriptomic signature exists, such as the subpopulations of the CD11c positive B cells, namely the naïve-like CD11c+; the switched memory CD11c+; and the DN CD11c+. As a result, we are consolidating the annotations between the two modalities that are currently largely disjoint. This is done by transferring the annotations of the CD11c+ subpopulations that are ‘protein driven’ (based on the markers CD11c, CD27 and IgD) on the scRNA-Seq dataset and then identifying their transcriptional signature, with the caveat, however, that these signatures have not been experimentally validated. In this way the integration of modalities can lead to the understanding of the causal relationships between -omics data, a point that has been noted as an important milestone in cellular biology that can have implications in medical diagnostics and treatments (Colomé-Tatché & Theis, 2018). We have amended the Introduction of our manuscript to clearly state our aims. Even though using reference datasets can contribute to the annotation of well-established populations (like CD4), the remaining outcomes are not something that the existing algorithms that annotate cell types based on single cell transcriptomes can do. We have amended the Discussion to highlight these important differences to the already available algorithms for annotation based on single cell transcriptomes. We also agree with the Reviewer that the integration of CyTOF with scRNA-Seq datasets is not intended to substitute the invaluable information gained from any actual protein level measurements, instead it is aimed at creating hypotheses that can then be validated using actual protein level measurements. We have revised the Discussion of the manuscript to emphasise this critical point.

The Reviewer also commented on the threshold that was used for selecting the positive CD11c cells. The thinking behind using this specific threshold for annotating the CD11c+ cells was that the distribution of the imputed values for CD11c followed the CyTOF distribution of that marker since the imputation of the markers is a weighted average of the original values and therefore, we could use this information to select the threshold. We appreciate, nonetheless, that using this rather arbitrary threshold on the imputed CyTOF may not be a convincing enough strategy for selecting the CD11c+ B cells. For this reason, we proceeded by validating the annotation of this population firstly by showing an overlap with the measured ADT levels (Figure 4A); secondly by showing a common transcriptional signature between two independent integrations of the COMBAT and the Su et al. datasets using the same threshold (Figure 4B) and thirdly by overlaying this transcriptional signature with published datasets using a gene set enrichment analysis (GSEA) and showing a very significant enrichment (Figure 4C). We would like to highlight here that the published signatures of CD11c+ cells were the result of sorting the CD11c+ populations in order to enrich for these rare phenotypes and therefore can be considered as the ‘ground truth’ of this transcriptional signature.

Although we observed a concordance in the expression levels and the measured protein levels (ADT) of CD11c in B cells (Figure 4A), the expression levels of CD11c were found to be very low and therefore incapable to separate these subpopulations. To address this in a more concise manner, we have included Supplementary Figure 13 with the difference in expression between CD11c+ and CD11c- cells showing a large proportion of the cells that were predicted to be CD11c+ to have no expression of CD11c (ITGAX). In more detail, only 48.4% of the predicted CD11c+ cells had any CD11c expression at all (expression levels greater than 0) in the COMBAT dataset and only 40.4% of the predicted CD11c+ cells had any expression at all in the Su et al. dataset. We have amended our Results to include these comparisons.

Furthermore, we would like to clarify that this B cell subpopulation was not identified in the original COMBAT paper from scRNA-Seq data only (Ahern et al., 2022). Even though it was noted that subpopulations of the CD11c+ B cells in mass cytometry were found to be significantly increased in community and convalescent COVID-19 samples, this was not verified from the transcriptional point of view, most likely due to the differences in annotation between the mass cytometry and scRNA-Seq datasets. This further highlights the need to consolidate the annotations between modalities to facilitate bridging the gap between different approaches of cellular biology.

Finally, we would like to emphasise here that we do not postulate that the CD11c+ cells form a homogeneous rare subpopulation. On the contrary, we are claiming that even though this subpopulation has a transcriptional signature which separates it from the remaining B cells (CD11c- cells) (Figure 4B), it is very likely that there is further heterogeneity within the CD11c+ cells. This has been explored in Figure 5 but further validation is necessary to confirm this hypothesis.

2) The Reviewer is correct in noting that in this manuscript there is no experimental validation of the findings. Instead, for the first two sections of the Results, we used: a) the published annotations as golden standard to compare to the transferred labels and b) datasets with measured ADT protein levels which we compared to the imputed protein levels, to validate the recommended methodology. Both these comparisons confirmed high agreement to the ‘ground truth’ results in most cases. However, there are instances where our methodology did not perform well, for example in identifying the GDT cells (in all scenarios) and the PB cells (in the unmatched scenario). These cases are being examined at length in our Results and the caveats of this methodology are being discussed in the Discussion.

The Reviewer is also correct in that there were more COMBAT cell types than the ones included in this manuscript. The reasoning behind the merging of the celltypes was based on the granularity of the celltypes used in the Supplementary Figure 1 (A-B) of the Ahern et al. paper in which the ADT is integrated with the CyTOF to validate the cell composition between the two datasets. The intension of this manuscript is to provide an affordable alternative to the costly ADT measurements and therefore we show an equivalent integration between the RNA with the CyTOF datasets, using the same groupings. Similarly, for the Human PBMC Atlas datasets, the annotations were merged to match the COMBAT ones, to have comparable results in terms of the ARI and F1 metrics. We realise that this is not explained properly in the manuscript which was a clear omission on our part. Therefore, we have now added this explanation in the Methods section.

Finally, in the Discussion we highlight that the integration is heavily dependent on the quality and quantity of the common CyTOF-RNA features. If there are no available markers to separate specific subpopulations in the CyTOF dataset or if the efficiency of the markers is suboptimal or if there is no RNA-CyTOF corresponding pair, then these subpopulations will not be identifiable in the integration either. This is specifically discussed for the case of the GDT cells in which the TCRgd antibody was added in the same channel as IgD, and thus there was no one to one correspondence between the genes and the protein channels anymore. This caused a suboptimal annotation of the GDT populations. We have changed the Discussion to clarify these cases where the integration would be suboptimal. Nonetheless, in the manuscript we demonstrate that when there is a good availability of markers, the subpopulations of interest can also be found for more detailed annotations, as we have done with the CD11c+ B cell subpopulation. With this example we have shown that it is possible to achieve good integration in finer subpopulations and that the imputation is also possible for more detailed population definitions.

3) We thank the Reviewer for highlighting this important detail of the manuscript. We completely agree that the selection of parameters is often a critical step in assessing the validity of a methodology. This point was also raised by Dr. Andrews, who highlighted that it would be important to mention the alternative options explored for the dimensionality reduction. To this end, we included Supplementary Table 1 with five alternative options and how they compare to the selected choice. These comparisons were performed using the ARI and F1 scores so the Reviewer has sound reasons for speculating that this might have led to the data being overfitted for this dataset. However, the results were then replicated in a second dataset, the Human PBMC Atlas dataset (Hao et al., 2021), where no comparisons were done between the different dimensionality reduction methods for optimisation. Instead, we used the same choice of dimensionality reduction and demonstrated a high agreement of the predicted annotations with the published ‘ground truth’ ones (ARI 0.87).

In agreement with the Reviewer’s comments, we tried to retain the default parameters wherever possible to have generalisable results. Nevertheless, there were two parameters that we decreased for the section ‘Characterisation of a rare B cell subpopulation of CD11c+ cells in COVID-19’. Firstly, it was necessary to decrease the number of dimensions for the CCA because we only had a reduced set of 26 B cell and Plasma markers as common features (using the marker set of 28 proteins that was used to annotate these subpopulations in Ahern et al). The number of dimensions for the CCA needs to be less than the number of features used in the FindTransferAnchors function and, therefore, we reduced that number from 30, which is the default, to 15. Since in this application there is no ground truth in terms of the annotations, this number was not optimised in any way and no ARI or F1 was calculated for these comparisons. Consequently, in this instance there is no risk of overfitting the data. Secondly, we felt that the k.weight parameter needed further tuning for the application of identifying rare subpopulations such as the subpopulations of CD11c+ cells. This was decided based on the discussion on the github repository[1], where it is suggested that the reduction of k.weight can help in the identification of rare cell types. Indeed, for very rare subpopulations, having the default number of neighbours that are being considered when weighting anchors (k.weight=50) can bias the cell annotation predictions from cells with similar transcriptome that, however, come from distinct subpopulations. In line with the suggestion from the package developers, we observed that reducing the k.weight parameter resulted in increasing the number of cells that were predicted to belong to the CD11c+ subpopulations of cells. In more detail, for the integration of the COMBAT datasets, we observed that by changing the k.weight from 50 to 30, the numbers of cells increased from 3 to 10 for the Naïve-like CD11c+, from 29 to 58 for the Switched memory CD11c+ and they decreased from 685 to 655 for the DN CD11c+ cells. Moreover, for the integration with Su et al. the numbers of cells increased from 21 to 54 for the Naïve-like CD11c+, from 25 to 42 for the Switched memory CD11c+ and from 452 to 475 for the DN CD11c+ cells. Further reduction of this parameter did not seem to significantly change these predictions. Finally, reducing the k.weight from 50 to 30 for the integration with the 12 COVID-19 scRNA-Seq datasets showed an increase of the subpopulations from 20 to 69 for the Naïve-like CD11c+, from 267 to 332 for the Switched memory CD11c+ and from 346 to 404 for the DN CD11c+ cells. Selecting a k.weight=20, further increased the Naïve-like CD11c+ to 153 and the Switched memory CD11c+ to 440 (the DN CD11c+ cells remained stable at 400), justifying decreasing this parameter further. We appreciate that this line of thought was not clearly explained in the manuscript and we have included a reasoning for the reduced k.weight in the Methods section. We would like to highlight here that this section does not include any ‘ground truth’ from ADT values and therefore the transcriptional characterisation of these subpopulations was not optimised in any way. Additionally, it does not substitute any experimental validation that would need to be performed to better understand these subpopulations. The aim of this section, instead, is to demonstrate how this methodology can aid in understanding heterogeneity in very rare subpopulations and can create interesting hypotheses that can be taken forward for further validation.

References
Ahern, D. J., Ai, Z., Ainsworth, M., Allan, C., Allcock, A., Angus, B., … Zurke, Y.-X. (2022). A blood atlas of COVID-19 defines hallmarks of disease severity and specificity. Cell, 185(5), 916-938.e58. https://doi.org/10.1016/J.CELL.2022.01.012/ATTACHMENT/E8167B96-BF2B-4A6E-9D0F-FDF06FB48280/MMC10.PDF
Colomé-Tatché, M., & Theis, F. J. (2018, February 1). Statistical single cell multi-omics integration. Current Opinion in Systems Biology. Elsevier Ltd. https://doi.org/10.1016/j.coisb.2018.01.003
Frei, A. P., Bava, F.-A., Zunder, E. R., Hsieh, E. W. Y., Chen, S.-Y., Nolan, G. P., & Gherardini, P. F. (2016). Highly multiplexed simultaneous detection of RNAs and proteins in single cells. Nature Methods, 13(3), 269–275. https://doi.org/10.1038/nmeth.3742
Hao, Y., Hao, S., Andersen-Nissen, E., Mauck, W. M., Zheng, S., Butler, A., … Satija, R. (2021). Integrated analysis of multimodal single-cell data. Cell, 184(13), 3573-3587.e29. https://doi.org/10.1016/j.cell.2021.04.048
Mimitou, E. P., Cheng, A., Montalbano, A., Hao, S., Stoeckius, M., Legut, M., … Smibert, P. (2019). Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nature Methods 2019 16:5, 16(5), 409–412. https://doi.org/10.1038/s41592-019-0392-0
Peterson, V. M., Zhang, K. X., Kumar, N., Wong, J., Li, L., Wilson, D. C., … Klappenbach, J. A. (2017). Multiplexed quantification of proteins and transcripts in single cells. Nature Biotechnology, 35(10), 936–939. https://doi.org/10.1038/nbt.3973
Reimegård, J., Tarbier, M., Danielsson, M., Schuster, J., Baskaran, S., Panagiotou, S., … Gallant, C. J. (2021). A combined approach for single-cell mRNA and intracellular protein expression analysis. Communications Biology, 4(1). https://doi.org/10.1038/S42003-021-02142-W

[1] https://github.com/satijalab/seurat/issues/1636
Competing Interests: No competing interests to disclose. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 02 May 2023

Εμμανουέλα Ρεπαπή, Medical Research Council Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, OX3 9DS, UK

02 May 2023

Author Response

Firstly we would like to thank you for your detailed review of our manuscript, and your helpful suggestions for improving this manuscript. Below we are the responses to the individual ... Continue reading Firstly we would like to thank you for your detailed review of our manuscript, and your helpful suggestions for improving this manuscript. Below we are the responses to the individual comments.

1) We apologise that the aims of this methodology may not have been clearly outlined in the manuscript. The integration of CyTOF and scRNA-Seq can provide an affordable alternative to a unified view across modalities within the same cells, with similar advantages to the integration of proteomic and transcriptomic modalities of single cells which have been developed over the last few years (such as CITE-Seq, PLAYR, REAP-seq, ECCITE-Seq and SPARC) (Frei et al., 2016; Hao et al., 2021; Mimitou et al., 2019; Peterson et al., 2017; Reimegård et al., 2021). More specifically, in our manuscript we demonstrate the value of our approach in real data analysis in two ways. Firstly, by aiding the annotation of populations that are difficult to disentangle from RNA data alone (such as CD4 populations). Secondly, by identifying rare subpopulations that have not been studied in such a detail that a transcriptomic signature exists, such as the subpopulations of the CD11c positive B cells, namely the naïve-like CD11c+; the switched memory CD11c+; and the DN CD11c+. As a result, we are consolidating the annotations between the two modalities that are currently largely disjoint. This is done by transferring the annotations of the CD11c+ subpopulations that are ‘protein driven’ (based on the markers CD11c, CD27 and IgD) on the scRNA-Seq dataset and then identifying their transcriptional signature, with the caveat, however, that these signatures have not been experimentally validated. In this way the integration of modalities can lead to the understanding of the causal relationships between -omics data, a point that has been noted as an important milestone in cellular biology that can have implications in medical diagnostics and treatments (Colomé-Tatché & Theis, 2018). We have amended the Introduction of our manuscript to clearly state our aims. Even though using reference datasets can contribute to the annotation of well-established populations (like CD4), the remaining outcomes are not something that the existing algorithms that annotate cell types based on single cell transcriptomes can do. We have amended the Discussion to highlight these important differences to the already available algorithms for annotation based on single cell transcriptomes. We also agree with the Reviewer that the integration of CyTOF with scRNA-Seq datasets is not intended to substitute the invaluable information gained from any actual protein level measurements, instead it is aimed at creating hypotheses that can then be validated using actual protein level measurements. We have revised the Discussion of the manuscript to emphasise this critical point.

The Reviewer also commented on the threshold that was used for selecting the positive CD11c cells. The thinking behind using this specific threshold for annotating the CD11c+ cells was that the distribution of the imputed values for CD11c followed the CyTOF distribution of that marker since the imputation of the markers is a weighted average of the original values and therefore, we could use this information to select the threshold. We appreciate, nonetheless, that using this rather arbitrary threshold on the imputed CyTOF may not be a convincing enough strategy for selecting the CD11c+ B cells. For this reason, we proceeded by validating the annotation of this population firstly by showing an overlap with the measured ADT levels (Figure 4A); secondly by showing a common transcriptional signature between two independent integrations of the COMBAT and the Su et al. datasets using the same threshold (Figure 4B) and thirdly by overlaying this transcriptional signature with published datasets using a gene set enrichment analysis (GSEA) and showing a very significant enrichment (Figure 4C). We would like to highlight here that the published signatures of CD11c+ cells were the result of sorting the CD11c+ populations in order to enrich for these rare phenotypes and therefore can be considered as the ‘ground truth’ of this transcriptional signature.

Although we observed a concordance in the expression levels and the measured protein levels (ADT) of CD11c in B cells (Figure 4A), the expression levels of CD11c were found to be very low and therefore incapable to separate these subpopulations. To address this in a more concise manner, we have included Supplementary Figure 13 with the difference in expression between CD11c+ and CD11c- cells showing a large proportion of the cells that were predicted to be CD11c+ to have no expression of CD11c (ITGAX). In more detail, only 48.4% of the predicted CD11c+ cells had any CD11c expression at all (expression levels greater than 0) in the COMBAT dataset and only 40.4% of the predicted CD11c+ cells had any expression at all in the Su et al. dataset. We have amended our Results to include these comparisons.

Furthermore, we would like to clarify that this B cell subpopulation was not identified in the original COMBAT paper from scRNA-Seq data only (Ahern et al., 2022). Even though it was noted that subpopulations of the CD11c+ B cells in mass cytometry were found to be significantly increased in community and convalescent COVID-19 samples, this was not verified from the transcriptional point of view, most likely due to the differences in annotation between the mass cytometry and scRNA-Seq datasets. This further highlights the need to consolidate the annotations between modalities to facilitate bridging the gap between different approaches of cellular biology.

Finally, we would like to emphasise here that we do not postulate that the CD11c+ cells form a homogeneous rare subpopulation. On the contrary, we are claiming that even though this subpopulation has a transcriptional signature which separates it from the remaining B cells (CD11c- cells) (Figure 4B), it is very likely that there is further heterogeneity within the CD11c+ cells. This has been explored in Figure 5 but further validation is necessary to confirm this hypothesis.

2) The Reviewer is correct in noting that in this manuscript there is no experimental validation of the findings. Instead, for the first two sections of the Results, we used: a) the published annotations as golden standard to compare to the transferred labels and b) datasets with measured ADT protein levels which we compared to the imputed protein levels, to validate the recommended methodology. Both these comparisons confirmed high agreement to the ‘ground truth’ results in most cases. However, there are instances where our methodology did not perform well, for example in identifying the GDT cells (in all scenarios) and the PB cells (in the unmatched scenario). These cases are being examined at length in our Results and the caveats of this methodology are being discussed in the Discussion.

The Reviewer is also correct in that there were more COMBAT cell types than the ones included in this manuscript. The reasoning behind the merging of the celltypes was based on the granularity of the celltypes used in the Supplementary Figure 1 (A-B) of the Ahern et al. paper in which the ADT is integrated with the CyTOF to validate the cell composition between the two datasets. The intension of this manuscript is to provide an affordable alternative to the costly ADT measurements and therefore we show an equivalent integration between the RNA with the CyTOF datasets, using the same groupings. Similarly, for the Human PBMC Atlas datasets, the annotations were merged to match the COMBAT ones, to have comparable results in terms of the ARI and F1 metrics. We realise that this is not explained properly in the manuscript which was a clear omission on our part. Therefore, we have now added this explanation in the Methods section.

Finally, in the Discussion we highlight that the integration is heavily dependent on the quality and quantity of the common CyTOF-RNA features. If there are no available markers to separate specific subpopulations in the CyTOF dataset or if the efficiency of the markers is suboptimal or if there is no RNA-CyTOF corresponding pair, then these subpopulations will not be identifiable in the integration either. This is specifically discussed for the case of the GDT cells in which the TCRgd antibody was added in the same channel as IgD, and thus there was no one to one correspondence between the genes and the protein channels anymore. This caused a suboptimal annotation of the GDT populations. We have changed the Discussion to clarify these cases where the integration would be suboptimal. Nonetheless, in the manuscript we demonstrate that when there is a good availability of markers, the subpopulations of interest can also be found for more detailed annotations, as we have done with the CD11c+ B cell subpopulation. With this example we have shown that it is possible to achieve good integration in finer subpopulations and that the imputation is also possible for more detailed population definitions.

3) We thank the Reviewer for highlighting this important detail of the manuscript. We completely agree that the selection of parameters is often a critical step in assessing the validity of a methodology. This point was also raised by Dr. Andrews, who highlighted that it would be important to mention the alternative options explored for the dimensionality reduction. To this end, we included Supplementary Table 1 with five alternative options and how they compare to the selected choice. These comparisons were performed using the ARI and F1 scores so the Reviewer has sound reasons for speculating that this might have led to the data being overfitted for this dataset. However, the results were then replicated in a second dataset, the Human PBMC Atlas dataset (Hao et al., 2021), where no comparisons were done between the different dimensionality reduction methods for optimisation. Instead, we used the same choice of dimensionality reduction and demonstrated a high agreement of the predicted annotations with the published ‘ground truth’ ones (ARI 0.87).

In agreement with the Reviewer’s comments, we tried to retain the default parameters wherever possible to have generalisable results. Nevertheless, there were two parameters that we decreased for the section ‘Characterisation of a rare B cell subpopulation of CD11c+ cells in COVID-19’. Firstly, it was necessary to decrease the number of dimensions for the CCA because we only had a reduced set of 26 B cell and Plasma markers as common features (using the marker set of 28 proteins that was used to annotate these subpopulations in Ahern et al). The number of dimensions for the CCA needs to be less than the number of features used in the FindTransferAnchors function and, therefore, we reduced that number from 30, which is the default, to 15. Since in this application there is no ground truth in terms of the annotations, this number was not optimised in any way and no ARI or F1 was calculated for these comparisons. Consequently, in this instance there is no risk of overfitting the data. Secondly, we felt that the k.weight parameter needed further tuning for the application of identifying rare subpopulations such as the subpopulations of CD11c+ cells. This was decided based on the discussion on the github repository[1], where it is suggested that the reduction of k.weight can help in the identification of rare cell types. Indeed, for very rare subpopulations, having the default number of neighbours that are being considered when weighting anchors (k.weight=50) can bias the cell annotation predictions from cells with similar transcriptome that, however, come from distinct subpopulations. In line with the suggestion from the package developers, we observed that reducing the k.weight parameter resulted in increasing the number of cells that were predicted to belong to the CD11c+ subpopulations of cells. In more detail, for the integration of the COMBAT datasets, we observed that by changing the k.weight from 50 to 30, the numbers of cells increased from 3 to 10 for the Naïve-like CD11c+, from 29 to 58 for the Switched memory CD11c+ and they decreased from 685 to 655 for the DN CD11c+ cells. Moreover, for the integration with Su et al. the numbers of cells increased from 21 to 54 for the Naïve-like CD11c+, from 25 to 42 for the Switched memory CD11c+ and from 452 to 475 for the DN CD11c+ cells. Further reduction of this parameter did not seem to significantly change these predictions. Finally, reducing the k.weight from 50 to 30 for the integration with the 12 COVID-19 scRNA-Seq datasets showed an increase of the subpopulations from 20 to 69 for the Naïve-like CD11c+, from 267 to 332 for the Switched memory CD11c+ and from 346 to 404 for the DN CD11c+ cells. Selecting a k.weight=20, further increased the Naïve-like CD11c+ to 153 and the Switched memory CD11c+ to 440 (the DN CD11c+ cells remained stable at 400), justifying decreasing this parameter further. We appreciate that this line of thought was not clearly explained in the manuscript and we have included a reasoning for the reduced k.weight in the Methods section. We would like to highlight here that this section does not include any ‘ground truth’ from ADT values and therefore the transcriptional characterisation of these subpopulations was not optimised in any way. Additionally, it does not substitute any experimental validation that would need to be performed to better understand these subpopulations. The aim of this section, instead, is to demonstrate how this methodology can aid in understanding heterogeneity in very rare subpopulations and can create interesting hypotheses that can be taken forward for further validation.

References
Ahern, D. J., Ai, Z., Ainsworth, M., Allan, C., Allcock, A., Angus, B., … Zurke, Y.-X. (2022). A blood atlas of COVID-19 defines hallmarks of disease severity and specificity. Cell, 185(5), 916-938.e58. https://doi.org/10.1016/J.CELL.2022.01.012/ATTACHMENT/E8167B96-BF2B-4A6E-9D0F-FDF06FB48280/MMC10.PDF
Colomé-Tatché, M., & Theis, F. J. (2018, February 1). Statistical single cell multi-omics integration. Current Opinion in Systems Biology. Elsevier Ltd. https://doi.org/10.1016/j.coisb.2018.01.003
Frei, A. P., Bava, F.-A., Zunder, E. R., Hsieh, E. W. Y., Chen, S.-Y., Nolan, G. P., & Gherardini, P. F. (2016). Highly multiplexed simultaneous detection of RNAs and proteins in single cells. Nature Methods, 13(3), 269–275. https://doi.org/10.1038/nmeth.3742
Hao, Y., Hao, S., Andersen-Nissen, E., Mauck, W. M., Zheng, S., Butler, A., … Satija, R. (2021). Integrated analysis of multimodal single-cell data. Cell, 184(13), 3573-3587.e29. https://doi.org/10.1016/j.cell.2021.04.048
Mimitou, E. P., Cheng, A., Montalbano, A., Hao, S., Stoeckius, M., Legut, M., … Smibert, P. (2019). Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nature Methods 2019 16:5, 16(5), 409–412. https://doi.org/10.1038/s41592-019-0392-0
Peterson, V. M., Zhang, K. X., Kumar, N., Wong, J., Li, L., Wilson, D. C., … Klappenbach, J. A. (2017). Multiplexed quantification of proteins and transcripts in single cells. Nature Biotechnology, 35(10), 936–939. https://doi.org/10.1038/nbt.3973
Reimegård, J., Tarbier, M., Danielsson, M., Schuster, J., Baskaran, S., Panagiotou, S., … Gallant, C. J. (2021). A combined approach for single-cell mRNA and intracellular protein expression analysis. Communications Biology, 4(1). https://doi.org/10.1038/S42003-021-02142-W

[1] https://github.com/satijalab/seurat/issues/1636
Firstly we would like to thank you for your detailed review of our manuscript, and your helpful suggestions for improving this manuscript. Below we are the responses to the individual comments.

1) We apologise that the aims of this methodology may not have been clearly outlined in the manuscript. The integration of CyTOF and scRNA-Seq can provide an affordable alternative to a unified view across modalities within the same cells, with similar advantages to the integration of proteomic and transcriptomic modalities of single cells which have been developed over the last few years (such as CITE-Seq, PLAYR, REAP-seq, ECCITE-Seq and SPARC) (Frei et al., 2016; Hao et al., 2021; Mimitou et al., 2019; Peterson et al., 2017; Reimegård et al., 2021). More specifically, in our manuscript we demonstrate the value of our approach in real data analysis in two ways. Firstly, by aiding the annotation of populations that are difficult to disentangle from RNA data alone (such as CD4 populations). Secondly, by identifying rare subpopulations that have not been studied in such a detail that a transcriptomic signature exists, such as the subpopulations of the CD11c positive B cells, namely the naïve-like CD11c+; the switched memory CD11c+; and the DN CD11c+. As a result, we are consolidating the annotations between the two modalities that are currently largely disjoint. This is done by transferring the annotations of the CD11c+ subpopulations that are ‘protein driven’ (based on the markers CD11c, CD27 and IgD) on the scRNA-Seq dataset and then identifying their transcriptional signature, with the caveat, however, that these signatures have not been experimentally validated. In this way the integration of modalities can lead to the understanding of the causal relationships between -omics data, a point that has been noted as an important milestone in cellular biology that can have implications in medical diagnostics and treatments (Colomé-Tatché & Theis, 2018). We have amended the Introduction of our manuscript to clearly state our aims. Even though using reference datasets can contribute to the annotation of well-established populations (like CD4), the remaining outcomes are not something that the existing algorithms that annotate cell types based on single cell transcriptomes can do. We have amended the Discussion to highlight these important differences to the already available algorithms for annotation based on single cell transcriptomes. We also agree with the Reviewer that the integration of CyTOF with scRNA-Seq datasets is not intended to substitute the invaluable information gained from any actual protein level measurements, instead it is aimed at creating hypotheses that can then be validated using actual protein level measurements. We have revised the Discussion of the manuscript to emphasise this critical point.

The Reviewer also commented on the threshold that was used for selecting the positive CD11c cells. The thinking behind using this specific threshold for annotating the CD11c+ cells was that the distribution of the imputed values for CD11c followed the CyTOF distribution of that marker since the imputation of the markers is a weighted average of the original values and therefore, we could use this information to select the threshold. We appreciate, nonetheless, that using this rather arbitrary threshold on the imputed CyTOF may not be a convincing enough strategy for selecting the CD11c+ B cells. For this reason, we proceeded by validating the annotation of this population firstly by showing an overlap with the measured ADT levels (Figure 4A); secondly by showing a common transcriptional signature between two independent integrations of the COMBAT and the Su et al. datasets using the same threshold (Figure 4B) and thirdly by overlaying this transcriptional signature with published datasets using a gene set enrichment analysis (GSEA) and showing a very significant enrichment (Figure 4C). We would like to highlight here that the published signatures of CD11c+ cells were the result of sorting the CD11c+ populations in order to enrich for these rare phenotypes and therefore can be considered as the ‘ground truth’ of this transcriptional signature.

Although we observed a concordance in the expression levels and the measured protein levels (ADT) of CD11c in B cells (Figure 4A), the expression levels of CD11c were found to be very low and therefore incapable to separate these subpopulations. To address this in a more concise manner, we have included Supplementary Figure 13 with the difference in expression between CD11c+ and CD11c- cells showing a large proportion of the cells that were predicted to be CD11c+ to have no expression of CD11c (ITGAX). In more detail, only 48.4% of the predicted CD11c+ cells had any CD11c expression at all (expression levels greater than 0) in the COMBAT dataset and only 40.4% of the predicted CD11c+ cells had any expression at all in the Su et al. dataset. We have amended our Results to include these comparisons.

Furthermore, we would like to clarify that this B cell subpopulation was not identified in the original COMBAT paper from scRNA-Seq data only (Ahern et al., 2022). Even though it was noted that subpopulations of the CD11c+ B cells in mass cytometry were found to be significantly increased in community and convalescent COVID-19 samples, this was not verified from the transcriptional point of view, most likely due to the differences in annotation between the mass cytometry and scRNA-Seq datasets. This further highlights the need to consolidate the annotations between modalities to facilitate bridging the gap between different approaches of cellular biology.

Finally, we would like to emphasise here that we do not postulate that the CD11c+ cells form a homogeneous rare subpopulation. On the contrary, we are claiming that even though this subpopulation has a transcriptional signature which separates it from the remaining B cells (CD11c- cells) (Figure 4B), it is very likely that there is further heterogeneity within the CD11c+ cells. This has been explored in Figure 5 but further validation is necessary to confirm this hypothesis.

2) The Reviewer is correct in noting that in this manuscript there is no experimental validation of the findings. Instead, for the first two sections of the Results, we used: a) the published annotations as golden standard to compare to the transferred labels and b) datasets with measured ADT protein levels which we compared to the imputed protein levels, to validate the recommended methodology. Both these comparisons confirmed high agreement to the ‘ground truth’ results in most cases. However, there are instances where our methodology did not perform well, for example in identifying the GDT cells (in all scenarios) and the PB cells (in the unmatched scenario). These cases are being examined at length in our Results and the caveats of this methodology are being discussed in the Discussion.

The Reviewer is also correct in that there were more COMBAT cell types than the ones included in this manuscript. The reasoning behind the merging of the celltypes was based on the granularity of the celltypes used in the Supplementary Figure 1 (A-B) of the Ahern et al. paper in which the ADT is integrated with the CyTOF to validate the cell composition between the two datasets. The intension of this manuscript is to provide an affordable alternative to the costly ADT measurements and therefore we show an equivalent integration between the RNA with the CyTOF datasets, using the same groupings. Similarly, for the Human PBMC Atlas datasets, the annotations were merged to match the COMBAT ones, to have comparable results in terms of the ARI and F1 metrics. We realise that this is not explained properly in the manuscript which was a clear omission on our part. Therefore, we have now added this explanation in the Methods section.

Finally, in the Discussion we highlight that the integration is heavily dependent on the quality and quantity of the common CyTOF-RNA features. If there are no available markers to separate specific subpopulations in the CyTOF dataset or if the efficiency of the markers is suboptimal or if there is no RNA-CyTOF corresponding pair, then these subpopulations will not be identifiable in the integration either. This is specifically discussed for the case of the GDT cells in which the TCRgd antibody was added in the same channel as IgD, and thus there was no one to one correspondence between the genes and the protein channels anymore. This caused a suboptimal annotation of the GDT populations. We have changed the Discussion to clarify these cases where the integration would be suboptimal. Nonetheless, in the manuscript we demonstrate that when there is a good availability of markers, the subpopulations of interest can also be found for more detailed annotations, as we have done with the CD11c+ B cell subpopulation. With this example we have shown that it is possible to achieve good integration in finer subpopulations and that the imputation is also possible for more detailed population definitions.

3) We thank the Reviewer for highlighting this important detail of the manuscript. We completely agree that the selection of parameters is often a critical step in assessing the validity of a methodology. This point was also raised by Dr. Andrews, who highlighted that it would be important to mention the alternative options explored for the dimensionality reduction. To this end, we included Supplementary Table 1 with five alternative options and how they compare to the selected choice. These comparisons were performed using the ARI and F1 scores so the Reviewer has sound reasons for speculating that this might have led to the data being overfitted for this dataset. However, the results were then replicated in a second dataset, the Human PBMC Atlas dataset (Hao et al., 2021), where no comparisons were done between the different dimensionality reduction methods for optimisation. Instead, we used the same choice of dimensionality reduction and demonstrated a high agreement of the predicted annotations with the published ‘ground truth’ ones (ARI 0.87).

In agreement with the Reviewer’s comments, we tried to retain the default parameters wherever possible to have generalisable results. Nevertheless, there were two parameters that we decreased for the section ‘Characterisation of a rare B cell subpopulation of CD11c+ cells in COVID-19’. Firstly, it was necessary to decrease the number of dimensions for the CCA because we only had a reduced set of 26 B cell and Plasma markers as common features (using the marker set of 28 proteins that was used to annotate these subpopulations in Ahern et al). The number of dimensions for the CCA needs to be less than the number of features used in the FindTransferAnchors function and, therefore, we reduced that number from 30, which is the default, to 15. Since in this application there is no ground truth in terms of the annotations, this number was not optimised in any way and no ARI or F1 was calculated for these comparisons. Consequently, in this instance there is no risk of overfitting the data. Secondly, we felt that the k.weight parameter needed further tuning for the application of identifying rare subpopulations such as the subpopulations of CD11c+ cells. This was decided based on the discussion on the github repository[1], where it is suggested that the reduction of k.weight can help in the identification of rare cell types. Indeed, for very rare subpopulations, having the default number of neighbours that are being considered when weighting anchors (k.weight=50) can bias the cell annotation predictions from cells with similar transcriptome that, however, come from distinct subpopulations. In line with the suggestion from the package developers, we observed that reducing the k.weight parameter resulted in increasing the number of cells that were predicted to belong to the CD11c+ subpopulations of cells. In more detail, for the integration of the COMBAT datasets, we observed that by changing the k.weight from 50 to 30, the numbers of cells increased from 3 to 10 for the Naïve-like CD11c+, from 29 to 58 for the Switched memory CD11c+ and they decreased from 685 to 655 for the DN CD11c+ cells. Moreover, for the integration with Su et al. the numbers of cells increased from 21 to 54 for the Naïve-like CD11c+, from 25 to 42 for the Switched memory CD11c+ and from 452 to 475 for the DN CD11c+ cells. Further reduction of this parameter did not seem to significantly change these predictions. Finally, reducing the k.weight from 50 to 30 for the integration with the 12 COVID-19 scRNA-Seq datasets showed an increase of the subpopulations from 20 to 69 for the Naïve-like CD11c+, from 267 to 332 for the Switched memory CD11c+ and from 346 to 404 for the DN CD11c+ cells. Selecting a k.weight=20, further increased the Naïve-like CD11c+ to 153 and the Switched memory CD11c+ to 440 (the DN CD11c+ cells remained stable at 400), justifying decreasing this parameter further. We appreciate that this line of thought was not clearly explained in the manuscript and we have included a reasoning for the reduced k.weight in the Methods section. We would like to highlight here that this section does not include any ‘ground truth’ from ADT values and therefore the transcriptional characterisation of these subpopulations was not optimised in any way. Additionally, it does not substitute any experimental validation that would need to be performed to better understand these subpopulations. The aim of this section, instead, is to demonstrate how this methodology can aid in understanding heterogeneity in very rare subpopulations and can create interesting hypotheses that can be taken forward for further validation.

References
Ahern, D. J., Ai, Z., Ainsworth, M., Allan, C., Allcock, A., Angus, B., … Zurke, Y.-X. (2022). A blood atlas of COVID-19 defines hallmarks of disease severity and specificity. Cell, 185(5), 916-938.e58. https://doi.org/10.1016/J.CELL.2022.01.012/ATTACHMENT/E8167B96-BF2B-4A6E-9D0F-FDF06FB48280/MMC10.PDF
Colomé-Tatché, M., & Theis, F. J. (2018, February 1). Statistical single cell multi-omics integration. Current Opinion in Systems Biology. Elsevier Ltd. https://doi.org/10.1016/j.coisb.2018.01.003
Frei, A. P., Bava, F.-A., Zunder, E. R., Hsieh, E. W. Y., Chen, S.-Y., Nolan, G. P., & Gherardini, P. F. (2016). Highly multiplexed simultaneous detection of RNAs and proteins in single cells. Nature Methods, 13(3), 269–275. https://doi.org/10.1038/nmeth.3742
Hao, Y., Hao, S., Andersen-Nissen, E., Mauck, W. M., Zheng, S., Butler, A., … Satija, R. (2021). Integrated analysis of multimodal single-cell data. Cell, 184(13), 3573-3587.e29. https://doi.org/10.1016/j.cell.2021.04.048
Mimitou, E. P., Cheng, A., Montalbano, A., Hao, S., Stoeckius, M., Legut, M., … Smibert, P. (2019). Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nature Methods 2019 16:5, 16(5), 409–412. https://doi.org/10.1038/s41592-019-0392-0
Peterson, V. M., Zhang, K. X., Kumar, N., Wong, J., Li, L., Wilson, D. C., … Klappenbach, J. A. (2017). Multiplexed quantification of proteins and transcripts in single cells. Nature Biotechnology, 35(10), 936–939. https://doi.org/10.1038/nbt.3973
Reimegård, J., Tarbier, M., Danielsson, M., Schuster, J., Baskaran, S., Panagiotou, S., … Gallant, C. J. (2021). A combined approach for single-cell mRNA and intracellular protein expression analysis. Communications Biology, 4(1). https://doi.org/10.1038/S42003-021-02142-W

[1] https://github.com/satijalab/seurat/issues/1636
Competing Interests: No competing interests to disclose. Close
Report a concern

Views

Reviewer Report 15 Nov 2022

Tallulah Andrews, Western University of Ontario, London, Canada

Approved

https://doi.org/10.5256/f1000research.140163.r155043

I thank the authors for addressing all of my questions and concerns and found the comparison of the ... Continue reading

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 23 May 2022

Views

Reviewer Report 01 Sep 2022

Tallulah Andrews, Western University of Ontario, London, Canada

Approved with Reservations

https://doi.org/10.5256/f1000research.133733.r147057

The authors demonstrate the effective imputation of protein-level expression in single-cell RNAseq data through the integration of CyTOF and the query scRNAseq dataset using the Seurat CCA approach. While they do not expand upon or improve existing methodology, they do effectively demonstrate that their approach is relatively accurate and useful for PBMC datasets from multiple conditions by comparing the imputed values to independent protein measurements from CITE-seq / ADT. The work is technically sound and will be of interest to some aspects of the community - particularly single-cell eQTL analyses which extensively employ PBMC data. I have only a few minor questions and concerns remaining:

The authors mention that "Alternative options for the reduction were assessed but were found to be suboptimal (data not shown)."

Which options were explored and how was it determined that they were suboptimal? Did the authors consider alternative integration methods e.g. LIGER ¹
Annotation and CyTOF intensity values were transferred to the RPCA integrated single-cell object combining many unannotated samples. However, RPCA integration can merge small rare cell populations into bigger clusters particularly if that cell-type is poorly represented / absent from one of the samples. Based on the results for the integration of unmatched samples, it seems that transferring data to individual novel samples rather than integrated maps may be more accurate, could the authors do a systematic comparison of imputing values for un-integrated novel samples vs integrated novel samples for a single biological condition vs integrated novel samples across all biological conditions? This would provide very valuable information for other researchers attempting to apply this approach to their own data.
Re: "For F1, the cells for which there was not an equivalent annotation in both datasets were all grouped as Other."

Were the "Other" cells retained for the ARI & F1 calculations? This could lead to misleading results as one of the known issues with CCA integration is the incorrect merging of cell-types that are present in only one dataset -> thus incorrectly merging Other with Other in this case despite them being two very different cell-types. The authors should ensure ARI and F1 score are calculated after excluding cells labelled as "Other".
Why did the authors calculate F1 scores for Naïve CD4 and Naïve CD8 populations only over T cells? Were there a significant number of non-T cells predicted to be Naïve CD4 / Naïve CD8 cells?
In the caption of Figure 1D, the authors should specify what is represented by the grey dashed line.
In Figure 1D the frequencies of cell-types are dependent on each other since they must sum to 1. Thus is it the case that the outlier from the 1:1 relationship in CD4, CD8, and NK plots represent the same sample or are these three different outlier samples?
Figure 1 and Figure 3 both show populations that appear clearly distinct in the UMAP but that have very low accuracy of prediction (Basophils and PB respectively). This seems incongruous but may be the result of known issues with UMAPs creating misleading visualizations. The authors should consider using an approach such as PAGA ² to visualize the relationships between the cell-types more accurately, as this may clarify that the Basophils and PBs are not as distinct as they appear in the UMAP.
Figure 2 could be improved by including plots imputed CD45RO vs CD45RA and each vs the RNA expression of the common gene (PTPRC) to more clearly illustrate the ability of the imputation to determine protein expression from the overall RNA profile of the cells rather than the respective RNA expression.
The very high agreement between CyTOF and imputed ADT values in the unmatched dataset, suggests the failure to correctly infer PB cell-type annotations is not a consequence of poor integration but of poor PB annotation in the original dataset. Could the authors confirm this by examining the accuracy of the protein expression imputation (e.g. Figure 3D) specifically for the falsely-predicted PB cells, as well as examining the PB-specific markers from the COMBAT dataset within the falsely-predicted PB cells vs the correctly predicted PB cells?

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

References

1. Welch J, Kozareva V, Ferreira A, Vanderburg C, et al.: Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell. 2019; 177 (7): 1873-1887.e17 Publisher Full Text
2. Wolf F, Hamey F, Plass M, Solana J, et al.: PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biology. 2019; 20 (1). Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bioinformatics, Single-cell RNAseq analysis and tool development.

CITE

Report a concern

Author Response 04 Nov 2022

Εμμανουέλα Ρεπαπή, Medical Research Council Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, OX3 9DS, UK

04 Nov 2022

Author Response
Thank you for reviewing our manuscript and for your helpful suggestions. Below are point-by-point responses to the individual comments.
1. We agree with the Reviewer that this needs further
... Continue reading
Thank you for reviewing our manuscript and for your helpful suggestions. Below are point-by-point responses to the individual comments.

We agree with the Reviewer that this needs further explanation and we have added more details in the Methods section along with a Supplementary Table comparing the options tested using the ARI and F1 metrics. We also tried LIGER with both the iNMF (Welch et al., 2019) and the online iNMF(Gao et al., 2021) algorithms. However, as discussed earlier this year in their github repository[1], LIGER does not currently have the option to directly transfer labels from a reference to a query dataset as part of their workflow. The closest equivalent is to cluster the integrated object and assign labels to the clusters via the majority rule from the reference. Therefore, the integrations did not perform well in producing a clustering in high agreement with the published annotation labels (ARI of 0.546 and 0.345 respectively). As a result, we felt that the comparison with the Seurat pipeline would not be on equal terms and could not be formally included in the paper.

We thank the Reviewer that highlighted this useful exploration for the benefit of future readers. We agree that this will add value to the paper and provide important information for anyone trying to utilise this approach. We have revised the text to include these important comparisons in the Results and added Table 2 with the ARI and F1 metrics comparing them. For most populations the F1 scores remain relatively consistent between the 3 scenarios, apart from the DC and PB populations which showed higher variability. For these subpopulations the “per sample” integration had the lowest scores, indicating the reduced power to detect rare or low frequency populations for small cell numbers. Furthermore, we have added a Supplementary Figure 11 to show that the imputations of the CyTOF features are very similar between the three scenarios.

We apologize that these evaluations were not explained clearly. We have revised the Methods section to reflect more accurately that the non-common (Other) celltypes were only grouped together for the F1 calculations. The ARI metric compared the projected cell types from the CyTOF to the published cell annotations (including both common and dataset specific clusters) without any changes to the clusters’ names. Removing the dataset-specific clusters results in higher estimates of ARI in all scenarios of our study but we felt that including all clusters gives a more representative view of cluster agreement as it considers dataset specific cells which have been misannotated with one of the common labels. For the F1 measure, the cells from the cluster Other were retained to calculate the precision and sensitivity for each cluster (again to ensure that misclassified cells are accurately represented) but the F1 measure was not calculated for this cluster as it was a mix of multiple clusters. Similarly to the ARI, the F1 scores increase with the removal of the dataset-specific clusters but the results are not taking into account the misannotated cells. For this reason we have kept all cells for the calculations of both metrics.

The F1 scores of the Naïve CD4 and CD8 were calculated over the T cells to reflect the conventional way of calculating the frequencies of those clusters over T cells alone. No predicted Naïve CD8 T cells had a non-T cell identity and only 15 out of the total 10294 cells that were predicted to be Naïve CD4 T cells were non-T cells.

We have added the description in the legend of Figure 1D.

Indeed the frequencies of cell types are dependent on each other, and this is the case for the outliers in the subsets of CD4 and CD8 cells which come from the same patient sample. However, the outlier of the NK sample is a distinct sample. Given that this was not a widespread effect, we did not consider removing that sample from the integration.

The Reviewer raises an important point regarding known issues with the UMAP algorithm. To address this potential ambiguity, we have added the relevant PAGA figures in the Supplementary Figures. The basophils appear well separated in all graphs, since it is a population with a very distinct signature, even though it is only present in the CyTOF dataset. However, it is obvious that although the plasmablasts (PB) are separating well in the COMBAT integration (Supplementary Figure 2), this is not always the case in the integration with the refPBMC samples (Supplementary Figure 9). Similarly to the results of the F1 score for this population, the Day 0 integration (COMBAT HC and unvaccinated refPBMC) shows the PB separating well and appearing the closest to the B cells. On the other hand, in the Mixed integration (all COMBAT and refPBMC) the PBs appear closer to the monocyte populations, highlighting the inaccuracies of the integration which were also shown in the predicted labels where some monocytes were wrongly predicted as PBs by the model.

We agree with the Reviewer that these figures could further enhance our argument of the ability of the imputation to determine protein expression from the overall RNA profile and have added these in our Supplementary Figure 6 and amended the text to highlight this.

We apologise that this section of the results was not more comprehensive to further evaluate the issue with the PBs’ misannotations and we thank the Reviewer for highlighting this omission in the comments so that we can address it. We have expanded the results section of the text to include another scenario where all the twenty-four refPBMC samples are integrated only with the CyTOF samples of healthy controls (from COMBAT). In this integration, we observed a much higher F1 score on the predicted PBs (0.68 vs 0.1 for the Mixed integration) highlighting that the reference dataset is key for the proper annotation of the query dataset and that abnormal frequencies in the reference population can create artefacts in the projected populations. However, that doesn’t fully address the question of the discordance of the F1 score of PBs in the mixed integration with the high imputation values that we showed in Figure 3C. The reason of this discordance is that the misannotated PBs are a very low proportion of cells (0.016) out of the total refPBMCs that the correlations are calculated on. Therefore, the effect of this subpopulation is miniscule on the correlations with the ADT values. As we showed in the revised version of our manuscript, the failure to correctly infer PB cell-type annotations is a consequence of the inflated PBs in the reference population (due to the COVID-19 samples) and not of poor PB annotation in the original dataset.

References

Gao, C., Liu, J., Kriebel, A. R., Preissl, S., Luo, C., Castanon, R., … Welch, J. D. (2021). Iterative single-cell multi-omic integration using online learning. Nature Biotechnology, 1–8.

Welch, J. D., Kozareva, V., Ferreira, A., Vanderburg, C., Martin, C., & Macosko, E. Z. (2019). Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell, 177(7), 1873-1887.e17.

[1] https://github.com/welch-lab/liger/issues/262
Thank you for reviewing our manuscript and for your helpful suggestions. Below are point-by-point responses to the individual comments.

We agree with the Reviewer that this needs further explanation and we have added more details in the Methods section along with a Supplementary Table comparing the options tested using the ARI and F1 metrics. We also tried LIGER with both the iNMF (Welch et al., 2019) and the online iNMF(Gao et al., 2021) algorithms. However, as discussed earlier this year in their github repository[1], LIGER does not currently have the option to directly transfer labels from a reference to a query dataset as part of their workflow. The closest equivalent is to cluster the integrated object and assign labels to the clusters via the majority rule from the reference. Therefore, the integrations did not perform well in producing a clustering in high agreement with the published annotation labels (ARI of 0.546 and 0.345 respectively). As a result, we felt that the comparison with the Seurat pipeline would not be on equal terms and could not be formally included in the paper.

We thank the Reviewer that highlighted this useful exploration for the benefit of future readers. We agree that this will add value to the paper and provide important information for anyone trying to utilise this approach. We have revised the text to include these important comparisons in the Results and added Table 2 with the ARI and F1 metrics comparing them. For most populations the F1 scores remain relatively consistent between the 3 scenarios, apart from the DC and PB populations which showed higher variability. For these subpopulations the “per sample” integration had the lowest scores, indicating the reduced power to detect rare or low frequency populations for small cell numbers. Furthermore, we have added a Supplementary Figure 11 to show that the imputations of the CyTOF features are very similar between the three scenarios.

We apologize that these evaluations were not explained clearly. We have revised the Methods section to reflect more accurately that the non-common (Other) celltypes were only grouped together for the F1 calculations. The ARI metric compared the projected cell types from the CyTOF to the published cell annotations (including both common and dataset specific clusters) without any changes to the clusters’ names. Removing the dataset-specific clusters results in higher estimates of ARI in all scenarios of our study but we felt that including all clusters gives a more representative view of cluster agreement as it considers dataset specific cells which have been misannotated with one of the common labels. For the F1 measure, the cells from the cluster Other were retained to calculate the precision and sensitivity for each cluster (again to ensure that misclassified cells are accurately represented) but the F1 measure was not calculated for this cluster as it was a mix of multiple clusters. Similarly to the ARI, the F1 scores increase with the removal of the dataset-specific clusters but the results are not taking into account the misannotated cells. For this reason we have kept all cells for the calculations of both metrics.

The F1 scores of the Naïve CD4 and CD8 were calculated over the T cells to reflect the conventional way of calculating the frequencies of those clusters over T cells alone. No predicted Naïve CD8 T cells had a non-T cell identity and only 15 out of the total 10294 cells that were predicted to be Naïve CD4 T cells were non-T cells.

We have added the description in the legend of Figure 1D.

Indeed the frequencies of cell types are dependent on each other, and this is the case for the outliers in the subsets of CD4 and CD8 cells which come from the same patient sample. However, the outlier of the NK sample is a distinct sample. Given that this was not a widespread effect, we did not consider removing that sample from the integration.

The Reviewer raises an important point regarding known issues with the UMAP algorithm. To address this potential ambiguity, we have added the relevant PAGA figures in the Supplementary Figures. The basophils appear well separated in all graphs, since it is a population with a very distinct signature, even though it is only present in the CyTOF dataset. However, it is obvious that although the plasmablasts (PB) are separating well in the COMBAT integration (Supplementary Figure 2), this is not always the case in the integration with the refPBMC samples (Supplementary Figure 9). Similarly to the results of the F1 score for this population, the Day 0 integration (COMBAT HC and unvaccinated refPBMC) shows the PB separating well and appearing the closest to the B cells. On the other hand, in the Mixed integration (all COMBAT and refPBMC) the PBs appear closer to the monocyte populations, highlighting the inaccuracies of the integration which were also shown in the predicted labels where some monocytes were wrongly predicted as PBs by the model.

We agree with the Reviewer that these figures could further enhance our argument of the ability of the imputation to determine protein expression from the overall RNA profile and have added these in our Supplementary Figure 6 and amended the text to highlight this.

We apologise that this section of the results was not more comprehensive to further evaluate the issue with the PBs’ misannotations and we thank the Reviewer for highlighting this omission in the comments so that we can address it. We have expanded the results section of the text to include another scenario where all the twenty-four refPBMC samples are integrated only with the CyTOF samples of healthy controls (from COMBAT). In this integration, we observed a much higher F1 score on the predicted PBs (0.68 vs 0.1 for the Mixed integration) highlighting that the reference dataset is key for the proper annotation of the query dataset and that abnormal frequencies in the reference population can create artefacts in the projected populations. However, that doesn’t fully address the question of the discordance of the F1 score of PBs in the mixed integration with the high imputation values that we showed in Figure 3C. The reason of this discordance is that the misannotated PBs are a very low proportion of cells (0.016) out of the total refPBMCs that the correlations are calculated on. Therefore, the effect of this subpopulation is miniscule on the correlations with the ADT values. As we showed in the revised version of our manuscript, the failure to correctly infer PB cell-type annotations is a consequence of the inflated PBs in the reference population (due to the COVID-19 samples) and not of poor PB annotation in the original dataset.

References

Gao, C., Liu, J., Kriebel, A. R., Preissl, S., Luo, C., Castanon, R., … Welch, J. D. (2021). Iterative single-cell multi-omic integration using online learning. Nature Biotechnology, 1–8.

Welch, J. D., Kozareva, V., Ferreira, A., Vanderburg, C., Martin, C., & Macosko, E. Z. (2019). Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell, 177(7), 1873-1887.e17.

[1] https://github.com/welch-lab/liger/issues/262
Competing Interests: None Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 04 Nov 2022

Εμμανουέλα Ρεπαπή, Medical Research Council Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, OX3 9DS, UK

04 Nov 2022

Author Response
Thank you for reviewing our manuscript and for your helpful suggestions. Below are point-by-point responses to the individual comments.
1. We agree with the Reviewer that this needs further
... Continue reading
Thank you for reviewing our manuscript and for your helpful suggestions. Below are point-by-point responses to the individual comments.

We agree with the Reviewer that this needs further explanation and we have added more details in the Methods section along with a Supplementary Table comparing the options tested using the ARI and F1 metrics. We also tried LIGER with both the iNMF (Welch et al., 2019) and the online iNMF(Gao et al., 2021) algorithms. However, as discussed earlier this year in their github repository[1], LIGER does not currently have the option to directly transfer labels from a reference to a query dataset as part of their workflow. The closest equivalent is to cluster the integrated object and assign labels to the clusters via the majority rule from the reference. Therefore, the integrations did not perform well in producing a clustering in high agreement with the published annotation labels (ARI of 0.546 and 0.345 respectively). As a result, we felt that the comparison with the Seurat pipeline would not be on equal terms and could not be formally included in the paper.

We thank the Reviewer that highlighted this useful exploration for the benefit of future readers. We agree that this will add value to the paper and provide important information for anyone trying to utilise this approach. We have revised the text to include these important comparisons in the Results and added Table 2 with the ARI and F1 metrics comparing them. For most populations the F1 scores remain relatively consistent between the 3 scenarios, apart from the DC and PB populations which showed higher variability. For these subpopulations the “per sample” integration had the lowest scores, indicating the reduced power to detect rare or low frequency populations for small cell numbers. Furthermore, we have added a Supplementary Figure 11 to show that the imputations of the CyTOF features are very similar between the three scenarios.

We apologize that these evaluations were not explained clearly. We have revised the Methods section to reflect more accurately that the non-common (Other) celltypes were only grouped together for the F1 calculations. The ARI metric compared the projected cell types from the CyTOF to the published cell annotations (including both common and dataset specific clusters) without any changes to the clusters’ names. Removing the dataset-specific clusters results in higher estimates of ARI in all scenarios of our study but we felt that including all clusters gives a more representative view of cluster agreement as it considers dataset specific cells which have been misannotated with one of the common labels. For the F1 measure, the cells from the cluster Other were retained to calculate the precision and sensitivity for each cluster (again to ensure that misclassified cells are accurately represented) but the F1 measure was not calculated for this cluster as it was a mix of multiple clusters. Similarly to the ARI, the F1 scores increase with the removal of the dataset-specific clusters but the results are not taking into account the misannotated cells. For this reason we have kept all cells for the calculations of both metrics.

The F1 scores of the Naïve CD4 and CD8 were calculated over the T cells to reflect the conventional way of calculating the frequencies of those clusters over T cells alone. No predicted Naïve CD8 T cells had a non-T cell identity and only 15 out of the total 10294 cells that were predicted to be Naïve CD4 T cells were non-T cells.

We have added the description in the legend of Figure 1D.

Indeed the frequencies of cell types are dependent on each other, and this is the case for the outliers in the subsets of CD4 and CD8 cells which come from the same patient sample. However, the outlier of the NK sample is a distinct sample. Given that this was not a widespread effect, we did not consider removing that sample from the integration.

The Reviewer raises an important point regarding known issues with the UMAP algorithm. To address this potential ambiguity, we have added the relevant PAGA figures in the Supplementary Figures. The basophils appear well separated in all graphs, since it is a population with a very distinct signature, even though it is only present in the CyTOF dataset. However, it is obvious that although the plasmablasts (PB) are separating well in the COMBAT integration (Supplementary Figure 2), this is not always the case in the integration with the refPBMC samples (Supplementary Figure 9). Similarly to the results of the F1 score for this population, the Day 0 integration (COMBAT HC and unvaccinated refPBMC) shows the PB separating well and appearing the closest to the B cells. On the other hand, in the Mixed integration (all COMBAT and refPBMC) the PBs appear closer to the monocyte populations, highlighting the inaccuracies of the integration which were also shown in the predicted labels where some monocytes were wrongly predicted as PBs by the model.

We agree with the Reviewer that these figures could further enhance our argument of the ability of the imputation to determine protein expression from the overall RNA profile and have added these in our Supplementary Figure 6 and amended the text to highlight this.

We apologise that this section of the results was not more comprehensive to further evaluate the issue with the PBs’ misannotations and we thank the Reviewer for highlighting this omission in the comments so that we can address it. We have expanded the results section of the text to include another scenario where all the twenty-four refPBMC samples are integrated only with the CyTOF samples of healthy controls (from COMBAT). In this integration, we observed a much higher F1 score on the predicted PBs (0.68 vs 0.1 for the Mixed integration) highlighting that the reference dataset is key for the proper annotation of the query dataset and that abnormal frequencies in the reference population can create artefacts in the projected populations. However, that doesn’t fully address the question of the discordance of the F1 score of PBs in the mixed integration with the high imputation values that we showed in Figure 3C. The reason of this discordance is that the misannotated PBs are a very low proportion of cells (0.016) out of the total refPBMCs that the correlations are calculated on. Therefore, the effect of this subpopulation is miniscule on the correlations with the ADT values. As we showed in the revised version of our manuscript, the failure to correctly infer PB cell-type annotations is a consequence of the inflated PBs in the reference population (due to the COVID-19 samples) and not of poor PB annotation in the original dataset.

References

Gao, C., Liu, J., Kriebel, A. R., Preissl, S., Luo, C., Castanon, R., … Welch, J. D. (2021). Iterative single-cell multi-omic integration using online learning. Nature Biotechnology, 1–8.

Welch, J. D., Kozareva, V., Ferreira, A., Vanderburg, C., Martin, C., & Macosko, E. Z. (2019). Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell, 177(7), 1873-1887.e17.

[1] https://github.com/welch-lab/liger/issues/262
Thank you for reviewing our manuscript and for your helpful suggestions. Below are point-by-point responses to the individual comments.

We agree with the Reviewer that this needs further explanation and we have added more details in the Methods section along with a Supplementary Table comparing the options tested using the ARI and F1 metrics. We also tried LIGER with both the iNMF (Welch et al., 2019) and the online iNMF(Gao et al., 2021) algorithms. However, as discussed earlier this year in their github repository[1], LIGER does not currently have the option to directly transfer labels from a reference to a query dataset as part of their workflow. The closest equivalent is to cluster the integrated object and assign labels to the clusters via the majority rule from the reference. Therefore, the integrations did not perform well in producing a clustering in high agreement with the published annotation labels (ARI of 0.546 and 0.345 respectively). As a result, we felt that the comparison with the Seurat pipeline would not be on equal terms and could not be formally included in the paper.

We thank the Reviewer that highlighted this useful exploration for the benefit of future readers. We agree that this will add value to the paper and provide important information for anyone trying to utilise this approach. We have revised the text to include these important comparisons in the Results and added Table 2 with the ARI and F1 metrics comparing them. For most populations the F1 scores remain relatively consistent between the 3 scenarios, apart from the DC and PB populations which showed higher variability. For these subpopulations the “per sample” integration had the lowest scores, indicating the reduced power to detect rare or low frequency populations for small cell numbers. Furthermore, we have added a Supplementary Figure 11 to show that the imputations of the CyTOF features are very similar between the three scenarios.

We apologize that these evaluations were not explained clearly. We have revised the Methods section to reflect more accurately that the non-common (Other) celltypes were only grouped together for the F1 calculations. The ARI metric compared the projected cell types from the CyTOF to the published cell annotations (including both common and dataset specific clusters) without any changes to the clusters’ names. Removing the dataset-specific clusters results in higher estimates of ARI in all scenarios of our study but we felt that including all clusters gives a more representative view of cluster agreement as it considers dataset specific cells which have been misannotated with one of the common labels. For the F1 measure, the cells from the cluster Other were retained to calculate the precision and sensitivity for each cluster (again to ensure that misclassified cells are accurately represented) but the F1 measure was not calculated for this cluster as it was a mix of multiple clusters. Similarly to the ARI, the F1 scores increase with the removal of the dataset-specific clusters but the results are not taking into account the misannotated cells. For this reason we have kept all cells for the calculations of both metrics.

The F1 scores of the Naïve CD4 and CD8 were calculated over the T cells to reflect the conventional way of calculating the frequencies of those clusters over T cells alone. No predicted Naïve CD8 T cells had a non-T cell identity and only 15 out of the total 10294 cells that were predicted to be Naïve CD4 T cells were non-T cells.

We have added the description in the legend of Figure 1D.

Indeed the frequencies of cell types are dependent on each other, and this is the case for the outliers in the subsets of CD4 and CD8 cells which come from the same patient sample. However, the outlier of the NK sample is a distinct sample. Given that this was not a widespread effect, we did not consider removing that sample from the integration.

The Reviewer raises an important point regarding known issues with the UMAP algorithm. To address this potential ambiguity, we have added the relevant PAGA figures in the Supplementary Figures. The basophils appear well separated in all graphs, since it is a population with a very distinct signature, even though it is only present in the CyTOF dataset. However, it is obvious that although the plasmablasts (PB) are separating well in the COMBAT integration (Supplementary Figure 2), this is not always the case in the integration with the refPBMC samples (Supplementary Figure 9). Similarly to the results of the F1 score for this population, the Day 0 integration (COMBAT HC and unvaccinated refPBMC) shows the PB separating well and appearing the closest to the B cells. On the other hand, in the Mixed integration (all COMBAT and refPBMC) the PBs appear closer to the monocyte populations, highlighting the inaccuracies of the integration which were also shown in the predicted labels where some monocytes were wrongly predicted as PBs by the model.

We agree with the Reviewer that these figures could further enhance our argument of the ability of the imputation to determine protein expression from the overall RNA profile and have added these in our Supplementary Figure 6 and amended the text to highlight this.

We apologise that this section of the results was not more comprehensive to further evaluate the issue with the PBs’ misannotations and we thank the Reviewer for highlighting this omission in the comments so that we can address it. We have expanded the results section of the text to include another scenario where all the twenty-four refPBMC samples are integrated only with the CyTOF samples of healthy controls (from COMBAT). In this integration, we observed a much higher F1 score on the predicted PBs (0.68 vs 0.1 for the Mixed integration) highlighting that the reference dataset is key for the proper annotation of the query dataset and that abnormal frequencies in the reference population can create artefacts in the projected populations. However, that doesn’t fully address the question of the discordance of the F1 score of PBs in the mixed integration with the high imputation values that we showed in Figure 3C. The reason of this discordance is that the misannotated PBs are a very low proportion of cells (0.016) out of the total refPBMCs that the correlations are calculated on. Therefore, the effect of this subpopulation is miniscule on the correlations with the ADT values. As we showed in the revised version of our manuscript, the failure to correctly infer PB cell-type annotations is a consequence of the inflated PBs in the reference population (due to the COVID-19 samples) and not of poor PB annotation in the original dataset.

References

Gao, C., Liu, J., Kriebel, A. R., Preissl, S., Luo, C., Castanon, R., … Welch, J. D. (2021). Iterative single-cell multi-omic integration using online learning. Nature Biotechnology, 1–8.

Welch, J. D., Kozareva, V., Ferreira, A., Vanderburg, C., Martin, C., & Macosko, E. Z. (2019). Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell, 177(7), 1873-1887.e17.

[1] https://github.com/welch-lab/liger/issues/262
Competing Interests: None Close
Report a concern

Comments on this article Comments (0)

Version 3

VERSION 3 PUBLISHED 23 May 2022

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 3 (revision) 02 May 23	read	read
Version 2 (revision) 04 Nov 22	read	read
Version 1 23 May 22	read

Tallulah Andrews, Western University of Ontario, London, Canada
Xiang Chen, St. Jude Children's Research Hospital, Memphis, USA

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

5 Views

10 Oct 2023 | for Version 3

Tallulah Andrews, Western University of Ontario, London, Canada

5 Views Cite this report Responses(0)

Approved

I thank the authors for their clear and thorough responses to my comments and believe the paper is much improved.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics, Single-cell RNAseq analysis and tool development.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

7 Views

06 Jun 2023 | for Version 3

Xiang Chen, Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, USA

7 Views Cite this report Responses(0)

Approved With Reservations

In this revision, the authors clarified several concerns but unfortunately there are still unsolved concerns.

Since the authors did not have experiments to validate their claim, they have to numerically establish the claimed values of their proposed approach:

to aid the cell population annotation. There are many scRNA-seq-based cell type annotation pipelines published, including both unsupervised and supervised approaches. Please demonstrate that the annotation derived from the proposed approach is significantly better than the current state-of-art scRNA-seq-based annotation algorithms.
to identify the rare subpopulations. Again, most scRNA-seq analysis does not group individual cells by a small set of genes (due to the low capture efficiency). Instead, the analysis/inference was most performed based on group of cells (or clusters). In the case of CD11c+ B cells with protein markers of CD11c, CD27 and IgD, considering that the expression of CD11c was used as a prior biological knowledge, we can first run scRNA-seq clustering analysis for the COMBAT and Su data, followed by identification of subpopulations that enrich/over-express CD11c. Please demonstrate the value of the proposed approach compared to this standard scRNA-seq-based analysis.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Cancer multiOMICS , single cell analysis, computational method development, machine learning

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

37 Views

26 Jan 2023 | for Version 2

Xiang Chen, Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, USA

37 Views Cite this report Responses(1)

Approved With Reservations

What is the value of this approach in real data analysis? Single cell analyses, such as clustering, and pseudotemporal analysis of scRNA-seq data rarely rely on a small set of known markers. Instead, the whole transcriptome (or at least the top variably expressed genes) were used in the analysis. Although the relative expression of known biomarkers is useful in annotation, there are many existing algorithms to annotate the cell type based on single cell transcriptome. Therefore, what is the additional value of imputed protein levels of a limited set of biomarkers to a given scRNA-seq data? If the purpose is to validate the annotation, I don’t think an imputed biomarker expression is sufficient to replace the actual protein-level measurement given the widely available FACS-sorting as well as newly developed single cell analysis platforms. The authors did provide an example of characterizing the CD11c+ B cells in the manuscript. However, the authors’ approach (using an arbitrary threshold to separate the B cells to CD11c+ and – cells) is very different from their claim in the abstract that they “identify and transcriptionally characterize a rare subpopulation of Cd11 c positive B cells”. Specifically, they have not established that these CD11c strong cells form a unique rare subpopulation using either scRNA-seq data or their imputed protein levels. If this subset of B cells does form a unique rare subpopulation, is it possible to identify it using standard clustering of the scRNA-seq data only? Anyway, Figure 4A did suggest a high level of concordance between the RNA level and measured protein level for CD11c and I am not sure what the imputed protein level added into the identification of such a rare subpopulation.
The design of experiment needs justification. This manuscript applied standard approaches to existing datasets and did not run any experimental validation of their findings. Therefore, the appropriate experimental design is critical to justify their finding. For example, although the authors claimed that the published annotation were retained as gold standard, there are apparently more cell types reported in the COMABT cell paper than what was included in this manuscript. Why did the authors decide to merge several clusters (i.e., different B cells, different CD4+ T, different CD8+ T) into a single identity while not merging others (i.e., CD14+ monocyte and CD16+ monocyte)? If the existing imputation approach only works at a coarser level of subpopulation finding than existing scRNA-seq data can achieve, why do we need the imputed data?
The selection of algorithm parameters needs justification. Similarly, the selection of the parameters is important for scRNA-seq analysis. Please describe the rationale behind the selected parameters (if it is different from the default). For example, why different k.weight used in the COMBAT and the COVID-18 scRNA-seq data? If the parameters were selected to give the best F1/ARI score of the reported datasets here, is it possible to overfit the specific dataset and not generalizable?

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Cancer multiOMICS , single cell analysis, computational method development, machine learning

Respond to this report

Responses (1)

Author Response

02 May 2023

Εμμανουέλα Ρεπαπή, Medical Research Council Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, OX3 9DS, UK

Firstly we would like to thank you for your detailed review of our manuscript, and your helpful suggestions for improving this manuscript. Below we are the responses to the individual comments.

1) We apologise that the aims of this methodology may not have been clearly outlined in the manuscript. The integration of CyTOF and scRNA-Seq can provide an affordable alternative to a unified view across modalities within the same cells, with similar advantages to the integration of proteomic and transcriptomic modalities of single cells which have been developed over the last few years (such as CITE-Seq, PLAYR, REAP-seq, ECCITE-Seq and SPARC) (Frei et al., 2016; Hao et al., 2021; Mimitou et al., 2019; Peterson et al., 2017; Reimegård et al., 2021). More specifically, in our manuscript we demonstrate the value of our approach in real data analysis in two ways. Firstly, by aiding the annotation of populations that are difficult to disentangle from RNA data alone (such as CD4 populations). Secondly, by identifying rare subpopulations that have not been studied in such a detail that a transcriptomic signature exists, such as the subpopulations of the CD11c positive B cells, namely the naïve-like CD11c+; the switched memory CD11c+; and the DN CD11c+. As a result, we are consolidating the annotations between the two modalities that are currently largely disjoint. This is done by transferring the annotations of the CD11c+ subpopulations that are ‘protein driven’ (based on the markers CD11c, CD27 and IgD) on the scRNA-Seq dataset and then identifying their transcriptional signature, with the caveat, however, that these signatures have not been experimentally validated. In this way the integration of modalities can lead to the understanding of the causal relationships between -omics data, a point that has been noted as an important milestone in cellular biology that can have implications in medical diagnostics and treatments (Colomé-Tatché & Theis, 2018). We have amended the Introduction of our manuscript to clearly state our aims. Even though using reference datasets can contribute to the annotation of well-established populations (like CD4), the remaining outcomes are not something that the existing algorithms that annotate cell types based on single cell transcriptomes can do. We have amended the Discussion to highlight these important differences to the already available algorithms for annotation based on single cell transcriptomes. We also agree with the Reviewer that the integration of CyTOF with scRNA-Seq datasets is not intended to substitute the invaluable information gained from any actual protein level measurements, instead it is aimed at creating hypotheses that can then be validated using actual protein level measurements. We have revised the Discussion of the manuscript to emphasise this critical point.

The Reviewer also commented on the threshold that was used for selecting the positive CD11c cells. The thinking behind using this specific threshold for annotating the CD11c+ cells was that the distribution of the imputed values for CD11c followed the CyTOF distribution of that marker since the imputation of the markers is a weighted average of the original values and therefore, we could use this information to select the threshold. We appreciate, nonetheless, that using this rather arbitrary threshold on the imputed CyTOF may not be a convincing enough strategy for selecting the CD11c+ B cells. For this reason, we proceeded by validating the annotation of this population firstly by showing an overlap with the measured ADT levels (Figure 4A); secondly by showing a common transcriptional signature between two independent integrations of the COMBAT and the Su et al. datasets using the same threshold (Figure 4B) and thirdly by overlaying this transcriptional signature with published datasets using a gene set enrichment analysis (GSEA) and showing a very significant enrichment (Figure 4C). We would like to highlight here that the published signatures of CD11c+ cells were the result of sorting the CD11c+ populations in order to enrich for these rare phenotypes and therefore can be considered as the ‘ground truth’ of this transcriptional signature.

Although we observed a concordance in the expression levels and the measured protein levels (ADT) of CD11c in B cells (Figure 4A), the expression levels of CD11c were found to be very low and therefore incapable to separate these subpopulations. To address this in a more concise manner, we have included Supplementary Figure 13 with the difference in expression between CD11c+ and CD11c- cells showing a large proportion of the cells that were predicted to be CD11c+ to have no expression of CD11c (ITGAX). In more detail, only 48.4% of the predicted CD11c+ cells had any CD11c expression at all (expression levels greater than 0) in the COMBAT dataset and only 40.4% of the predicted CD11c+ cells had any expression at all in the Su et al. dataset. We have amended our Results to include these comparisons.

Furthermore, we would like to clarify that this B cell subpopulation was not identified in the original COMBAT paper from scRNA-Seq data only (Ahern et al., 2022). Even though it was noted that subpopulations of the CD11c+ B cells in mass cytometry were found to be significantly increased in community and convalescent COVID-19 samples, this was not verified from the transcriptional point of view, most likely due to the differences in annotation between the mass cytometry and scRNA-Seq datasets. This further highlights the need to consolidate the annotations between modalities to facilitate bridging the gap between different approaches of cellular biology.

Finally, we would like to emphasise here that we do not postulate that the CD11c+ cells form a homogeneous rare subpopulation. On the contrary, we are claiming that even though this subpopulation has a transcriptional signature which separates it from the remaining B cells (CD11c- cells) (Figure 4B), it is very likely that there is further heterogeneity within the CD11c+ cells. This has been explored in Figure 5 but further validation is necessary to confirm this hypothesis.

2) The Reviewer is correct in noting that in this manuscript there is no experimental validation of the findings. Instead, for the first two sections of the Results, we used: a) the published annotations as golden standard to compare to the transferred labels and b) datasets with measured ADT protein levels which we compared to the imputed protein levels, to validate the recommended methodology. Both these comparisons confirmed high agreement to the ‘ground truth’ results in most cases. However, there are instances where our methodology did not perform well, for example in identifying the GDT cells (in all scenarios) and the PB cells (in the unmatched scenario). These cases are being examined at length in our Results and the caveats of this methodology are being discussed in the Discussion.

The Reviewer is also correct in that there were more COMBAT cell types than the ones included in this manuscript. The reasoning behind the merging of the celltypes was based on the granularity of the celltypes used in the Supplementary Figure 1 (A-B) of the Ahern et al. paper in which the ADT is integrated with the CyTOF to validate the cell composition between the two datasets. The intension of this manuscript is to provide an affordable alternative to the costly ADT measurements and therefore we show an equivalent integration between the RNA with the CyTOF datasets, using the same groupings. Similarly, for the Human PBMC Atlas datasets, the annotations were merged to match the COMBAT ones, to have comparable results in terms of the ARI and F1 metrics. We realise that this is not explained properly in the manuscript which was a clear omission on our part. Therefore, we have now added this explanation in the Methods section.

Finally, in the Discussion we highlight that the integration is heavily dependent on the quality and quantity of the common CyTOF-RNA features. If there are no available markers to separate specific subpopulations in the CyTOF dataset or if the efficiency of the markers is suboptimal or if there is no RNA-CyTOF corresponding pair, then these subpopulations will not be identifiable in the integration either. This is specifically discussed for the case of the GDT cells in which the TCRgd antibody was added in the same channel as IgD, and thus there was no one to one correspondence between the genes and the protein channels anymore. This caused a suboptimal annotation of the GDT populations. We have changed the Discussion to clarify these cases where the integration would be suboptimal. Nonetheless, in the manuscript we demonstrate that when there is a good availability of markers, the subpopulations of interest can also be found for more detailed annotations, as we have done with the CD11c+ B cell subpopulation. With this example we have shown that it is possible to achieve good integration in finer subpopulations and that the imputation is also possible for more detailed population definitions.

3) We thank the Reviewer for highlighting this important detail of the manuscript. We completely agree that the selection of parameters is often a critical step in assessing the validity of a methodology. This point was also raised by Dr. Andrews, who highlighted that it would be important to mention the alternative options explored for the dimensionality reduction. To this end, we included Supplementary Table 1 with five alternative options and how they compare to the selected choice. These comparisons were performed using the ARI and F1 scores so the Reviewer has sound reasons for speculating that this might have led to the data being overfitted for this dataset. However, the results were then replicated in a second dataset, the Human PBMC Atlas dataset (Hao et al., 2021), where no comparisons were done between the different dimensionality reduction methods for optimisation. Instead, we used the same choice of dimensionality reduction and demonstrated a high agreement of the predicted annotations with the published ‘ground truth’ ones (ARI 0.87).

In agreement with the Reviewer’s comments, we tried to retain the default parameters wherever possible to have generalisable results. Nevertheless, there were two parameters that we decreased for the section ‘Characterisation of a rare B cell subpopulation of CD11c+ cells in COVID-19’. Firstly, it was necessary to decrease the number of dimensions for the CCA because we only had a reduced set of 26 B cell and Plasma markers as common features (using the marker set of 28 proteins that was used to annotate these subpopulations in Ahern et al). The number of dimensions for the CCA needs to be less than the number of features used in the FindTransferAnchors function and, therefore, we reduced that number from 30, which is the default, to 15. Since in this application there is no ground truth in terms of the annotations, this number was not optimised in any way and no ARI or F1 was calculated for these comparisons. Consequently, in this instance there is no risk of overfitting the data. Secondly, we felt that the k.weight parameter needed further tuning for the application of identifying rare subpopulations such as the subpopulations of CD11c+ cells. This was decided based on the discussion on the github repository[1], where it is suggested that the reduction of k.weight can help in the identification of rare cell types. Indeed, for very rare subpopulations, having the default number of neighbours that are being considered when weighting anchors (k.weight=50) can bias the cell annotation predictions from cells with similar transcriptome that, however, come from distinct subpopulations. In line with the suggestion from the package developers, we observed that reducing the k.weight parameter resulted in increasing the number of cells that were predicted to belong to the CD11c+ subpopulations of cells. In more detail, for the integration of the COMBAT datasets, we observed that by changing the k.weight from 50 to 30, the numbers of cells increased from 3 to 10 for the Naïve-like CD11c+, from 29 to 58 for the Switched memory CD11c+ and they decreased from 685 to 655 for the DN CD11c+ cells. Moreover, for the integration with Su et al. the numbers of cells increased from 21 to 54 for the Naïve-like CD11c+, from 25 to 42 for the Switched memory CD11c+ and from 452 to 475 for the DN CD11c+ cells. Further reduction of this parameter did not seem to significantly change these predictions. Finally, reducing the k.weight from 50 to 30 for the integration with the 12 COVID-19 scRNA-Seq datasets showed an increase of the subpopulations from 20 to 69 for the Naïve-like CD11c+, from 267 to 332 for the Switched memory CD11c+ and from 346 to 404 for the DN CD11c+ cells. Selecting a k.weight=20, further increased the Naïve-like CD11c+ to 153 and the Switched memory CD11c+ to 440 (the DN CD11c+ cells remained stable at 400), justifying decreasing this parameter further. We appreciate that this line of thought was not clearly explained in the manuscript and we have included a reasoning for the reduced k.weight in the Methods section. We would like to highlight here that this section does not include any ‘ground truth’ from ADT values and therefore the transcriptional characterisation of these subpopulations was not optimised in any way. Additionally, it does not substitute any experimental validation that would need to be performed to better understand these subpopulations. The aim of this section, instead, is to demonstrate how this methodology can aid in understanding heterogeneity in very rare subpopulations and can create interesting hypotheses that can be taken forward for further validation.

References
Ahern, D. J., Ai, Z., Ainsworth, M., Allan, C., Allcock, A., Angus, B., … Zurke, Y.-X. (2022). A blood atlas of COVID-19 defines hallmarks of disease severity and specificity. Cell, 185(5), 916-938.e58. https://doi.org/10.1016/J.CELL.2022.01.012/ATTACHMENT/E8167B96-BF2B-4A6E-9D0F-FDF06FB48280/MMC10.PDF
Colomé-Tatché, M., & Theis, F. J. (2018, February 1). Statistical single cell multi-omics integration. Current Opinion in Systems Biology. Elsevier Ltd. https://doi.org/10.1016/j.coisb.2018.01.003
Frei, A. P., Bava, F.-A., Zunder, E. R., Hsieh, E. W. Y., Chen, S.-Y., Nolan, G. P., & Gherardini, P. F. (2016). Highly multiplexed simultaneous detection of RNAs and proteins in single cells. Nature Methods, 13(3), 269–275. https://doi.org/10.1038/nmeth.3742
Hao, Y., Hao, S., Andersen-Nissen, E., Mauck, W. M., Zheng, S., Butler, A., … Satija, R. (2021). Integrated analysis of multimodal single-cell data. Cell, 184(13), 3573-3587.e29. https://doi.org/10.1016/j.cell.2021.04.048
Mimitou, E. P., Cheng, A., Montalbano, A., Hao, S., Stoeckius, M., Legut, M., … Smibert, P. (2019). Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nature Methods 2019 16:5, 16(5), 409–412. https://doi.org/10.1038/s41592-019-0392-0
Peterson, V. M., Zhang, K. X., Kumar, N., Wong, J., Li, L., Wilson, D. C., … Klappenbach, J. A. (2017). Multiplexed quantification of proteins and transcripts in single cells. Nature Biotechnology, 35(10), 936–939. https://doi.org/10.1038/nbt.3973
Reimegård, J., Tarbier, M., Danielsson, M., Schuster, J., Baskaran, S., Panagiotou, S., … Gallant, C. J. (2021). A combined approach for single-cell mRNA and intracellular protein expression analysis. Communications Biology, 4(1). https://doi.org/10.1038/S42003-021-02142-W

[1] https://github.com/satijalab/seurat/issues/1636

View more View less

Competing Interests

No competing interests to disclose.

Back to all reports

Reviewer Report

17 Views

15 Nov 2022 | for Version 2

Tallulah Andrews, Western University of Ontario, London, Canada

17 Views Cite this report Responses(0)

Approved

I thank the authors for addressing all of my questions and concerns and found the comparison of the three different integration paradigms very useful and interesting. I am happy to approve the indexing of this article.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics, Single-cell RNAseq analysis and tool development.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

50 Views

01 Sep 2022 | for Version 1

Tallulah Andrews, Western University of Ontario, London, Canada

50 Views Cite this report Responses(1)

Approved With Reservations

The authors mention that "Alternative options for the reduction were assessed but were found to be suboptimal (data not shown)."

Which options were explored and how was it determined that they were suboptimal? Did the authors consider alternative integration methods e.g. LIGER ¹
Annotation and CyTOF intensity values were transferred to the RPCA integrated single-cell object combining many unannotated samples. However, RPCA integration can merge small rare cell populations into bigger clusters particularly if that cell-type is poorly represented / absent from one of the samples. Based on the results for the integration of unmatched samples, it seems that transferring data to individual novel samples rather than integrated maps may be more accurate, could the authors do a systematic comparison of imputing values for un-integrated novel samples vs integrated novel samples for a single biological condition vs integrated novel samples across all biological conditions? This would provide very valuable information for other researchers attempting to apply this approach to their own data.
Re: "For F1, the cells for which there was not an equivalent annotation in both datasets were all grouped as Other."

Were the "Other" cells retained for the ARI & F1 calculations? This could lead to misleading results as one of the known issues with CCA integration is the incorrect merging of cell-types that are present in only one dataset -> thus incorrectly merging Other with Other in this case despite them being two very different cell-types. The authors should ensure ARI and F1 score are calculated after excluding cells labelled as "Other".
Why did the authors calculate F1 scores for Naïve CD4 and Naïve CD8 populations only over T cells? Were there a significant number of non-T cells predicted to be Naïve CD4 / Naïve CD8 cells?
In the caption of Figure 1D, the authors should specify what is represented by the grey dashed line.
In Figure 1D the frequencies of cell-types are dependent on each other since they must sum to 1. Thus is it the case that the outlier from the 1:1 relationship in CD4, CD8, and NK plots represent the same sample or are these three different outlier samples?
Figure 1 and Figure 3 both show populations that appear clearly distinct in the UMAP but that have very low accuracy of prediction (Basophils and PB respectively). This seems incongruous but may be the result of known issues with UMAPs creating misleading visualizations. The authors should consider using an approach such as PAGA ² to visualize the relationships between the cell-types more accurately, as this may clarify that the Basophils and PBs are not as distinct as they appear in the UMAP.
Figure 2 could be improved by including plots imputed CD45RO vs CD45RA and each vs the RNA expression of the common gene (PTPRC) to more clearly illustrate the ability of the imputation to determine protein expression from the overall RNA profile of the cells rather than the respective RNA expression.
The very high agreement between CyTOF and imputed ADT values in the unmatched dataset, suggests the failure to correctly infer PB cell-type annotations is not a consequence of poor integration but of poor PB annotation in the original dataset. Could the authors confirm this by examining the accuracy of the protein expression imputation (e.g. Figure 3D) specifically for the falsely-predicted PB cells, as well as examining the PB-specific markers from the COMBAT dataset within the falsely-predicted PB cells vs the correctly predicted PB cells?

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

References

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics, Single-cell RNAseq analysis and tool development.

Respond to this report

Responses (1)

Author Response

04 Nov 2022

Εμμανουέλα Ρεπαπή, Medical Research Council Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, OX3 9DS, UK

Thank you for reviewing our manuscript and for your helpful suggestions. Below are point-by-point responses to the individual comments.

We agree with the Reviewer that this needs further explanation and we have added more details in the Methods section along with a Supplementary Table comparing the options tested using the ARI and F1 metrics. We also tried LIGER with both the iNMF (Welch et al., 2019) and the online iNMF(Gao et al., 2021) algorithms. However, as discussed earlier this year in their github repository[1], LIGER does not currently have the option to directly transfer labels from a reference to a query dataset as part of their workflow. The closest equivalent is to cluster the integrated object and assign labels to the clusters via the majority rule from the reference. Therefore, the integrations did not perform well in producing a clustering in high agreement with the published annotation labels (ARI of 0.546 and 0.345 respectively). As a result, we felt that the comparison with the Seurat pipeline would not be on equal terms and could not be formally included in the paper.
We thank the Reviewer that highlighted this useful exploration for the benefit of future readers. We agree that this will add value to the paper and provide important information for anyone trying to utilise this approach. We have revised the text to include these important comparisons in the Results and added Table 2 with the ARI and F1 metrics comparing them. For most populations the F1 scores remain relatively consistent between the 3 scenarios, apart from the DC and PB populations which showed higher variability. For these subpopulations the “per sample” integration had the lowest scores, indicating the reduced power to detect rare or low frequency populations for small cell numbers. Furthermore, we have added a Supplementary Figure 11 to show that the imputations of the CyTOF features are very similar between the three scenarios.
We apologize that these evaluations were not explained clearly. We have revised the Methods section to reflect more accurately that the non-common (Other) celltypes were only grouped together for the F1 calculations. The ARI metric compared the projected cell types from the CyTOF to the published cell annotations (including both common and dataset specific clusters) without any changes to the clusters’ names. Removing the dataset-specific clusters results in higher estimates of ARI in all scenarios of our study but we felt that including all clusters gives a more representative view of cluster agreement as it considers dataset specific cells which have been misannotated with one of the common labels. For the F1 measure, the cells from the cluster Other were retained to calculate the precision and sensitivity for each cluster (again to ensure that misclassified cells are accurately represented) but the F1 measure was not calculated for this cluster as it was a mix of multiple clusters. Similarly to the ARI, the F1 scores increase with the removal of the dataset-specific clusters but the results are not taking into account the misannotated cells. For this reason we have kept all cells for the calculations of both metrics.
The F1 scores of the Naïve CD4 and CD8 were calculated over the T cells to reflect the conventional way of calculating the frequencies of those clusters over T cells alone. No predicted Naïve CD8 T cells had a non-T cell identity and only 15 out of the total 10294 cells that were predicted to be Naïve CD4 T cells were non-T cells.
We have added the description in the legend of Figure 1D.
Indeed the frequencies of cell types are dependent on each other, and this is the case for the outliers in the subsets of CD4 and CD8 cells which come from the same patient sample. However, the outlier of the NK sample is a distinct sample. Given that this was not a widespread effect, we did not consider removing that sample from the integration.
The Reviewer raises an important point regarding known issues with the UMAP algorithm. To address this potential ambiguity, we have added the relevant PAGA figures in the Supplementary Figures. The basophils appear well separated in all graphs, since it is a population with a very distinct signature, even though it is only present in the CyTOF dataset. However, it is obvious that although the plasmablasts (PB) are separating well in the COMBAT integration (Supplementary Figure 2), this is not always the case in the integration with the refPBMC samples (Supplementary Figure 9). Similarly to the results of the F1 score for this population, the Day 0 integration (COMBAT HC and unvaccinated refPBMC) shows the PB separating well and appearing the closest to the B cells. On the other hand, in the Mixed integration (all COMBAT and refPBMC) the PBs appear closer to the monocyte populations, highlighting the inaccuracies of the integration which were also shown in the predicted labels where some monocytes were wrongly predicted as PBs by the model.
We agree with the Reviewer that these figures could further enhance our argument of the ability of the imputation to determine protein expression from the overall RNA profile and have added these in our Supplementary Figure 6 and amended the text to highlight this.
We apologise that this section of the results was not more comprehensive to further evaluate the issue with the PBs’ misannotations and we thank the Reviewer for highlighting this omission in the comments so that we can address it. We have expanded the results section of the text to include another scenario where all the twenty-four refPBMC samples are integrated only with the CyTOF samples of healthy controls (from COMBAT). In this integration, we observed a much higher F1 score on the predicted PBs (0.68 vs 0.1 for the Mixed integration) highlighting that the reference dataset is key for the proper annotation of the query dataset and that abnormal frequencies in the reference population can create artefacts in the projected populations. However, that doesn’t fully address the question of the discordance of the F1 score of PBs in the mixed integration with the high imputation values that we showed in Figure 3C. The reason of this discordance is that the misannotated PBs are a very low proportion of cells (0.016) out of the total refPBMCs that the correlations are calculated on. Therefore, the effect of this subpopulation is miniscule on the correlations with the ADT values. As we showed in the revised version of our manuscript, the failure to correctly infer PB cell-type annotations is a consequence of the inflated PBs in the reference population (due to the COVID-19 samples) and not of poor PB annotation in the original dataset.

References

Gao, C., Liu, J., Kriebel, A. R., Preissl, S., Luo, C., Castanon, R., … Welch, J. D. (2021). Iterative single-cell multi-omic integration using online learning. Nature Biotechnology, 1–8.

Welch, J. D., Kozareva, V., Ferreira, A., Vanderburg, C., Martin, C., & Macosko, E. Z. (2019). Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell, 177(7), 1873-1887.e17.

[1] https://github.com/welch-lab/liger/issues/262

View more View less

Competing Interests

None

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Regev A, et al.: The human cell atlas. elife. 2017; 6. Publisher Full Text

[2] 2. Aldridge S, Teichmann SA: Single cell transcriptomics comes of age. Nat. Commun. 2020; 11: 1–4. Publisher Full Text

[3] 3. Lähnemann D, et al.: Eleven grand challenges in single-cell data science. Genome Biol. 2020; 21: 53.

[4] 4. Buettner F, et al.: Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol. 2015; 33: 155–160. PubMed Abstract | Publisher Full Text

[5] 5. Bodenmiller B, et al.: Multiplexed mass cytometry profiling of cellular states perturbed by small-molecule regulators. Nat. Biotechnol. 2012; 30: 858–867. PubMed Abstract | Publisher Full Text

[6] 6. Damond N, et al.: A Map of Human Type 1 Diabetes Progression by Imaging Mass Cytometry. Cell Metab. 2019; 29: 755–768.e5. PubMed Abstract | Publisher Full Text

[7] 7. Kashima Y, et al.: Potentiality of multiple modalities for single-cell analyses to evaluate the tumor microenvironment in clinical specimens. Sci. Reports. 2021; 11: 1–11.

[8] 8. Reimegård J, et al.: A combined approach for single-cell mRNA and intracellular protein expression analysis. Commun. Biol. 2021; 4: 624. PubMed Abstract | Publisher Full Text

[9] 9. Labib M, Kelley SO: Single-cell analysis targeting the proteome. Nat. Rev. Chem. 2020; 4: 143–158. Publisher Full Text

[10] 10. Levy E, Slavov N: Single cell protein analysis for systems biology. Essays Biochem. 2018; 62: 595–605. PubMed Abstract | Publisher Full Text

[11] 11. Adossa N, Khan S, Rytkönen KT, et al.: Computational strategies for single-cell multi-omics integration. Comput. Struct. Biotechnol. J. 2021; 19: 2588–2596. PubMed Abstract | Publisher Full Text

[12] 12. Ahern DJ, et al.: A blood atlas of COVID-19 defines hallmarks of disease severity and specificity. Cell. 2022; 185: 916–938.e58. PubMed Abstract | Publisher Full Text

[13] 13. Hao Y, et al.: Integrated analysis of multimodal single-cell data. Cell. 2021; 184: 3573–3587.e29. PubMed Abstract | Publisher Full Text

[14] 14. Stuart T, et al.: Comprehensive Integration of Single-Cell Data. Cell. 2019; 177: 1888–1902.e21. PubMed Abstract | Publisher Full Text

[15] 15. Mulè MP, Martins AJ, Tsang JS: Normalizing and denoising protein expression data from droplet-based single cell profiling. Nat. Commun. 2022; 13: 1–12. Publisher Full Text

[16] 16. Tian Y, et al.: Single-cell immunology of SARS-CoV-2 infection. Nat. Biotechnol. 2021; 40: 30–41.

[17] 17. Stewart A, et al.: Single-Cell Transcriptomic Analyses Define Distinct Peripheral B Cell Subsets and Discrete Development Pathways. Front. Immunol. 2021; 12: 743.

[18] 18. Su Y, et al.: Multi-Omics Resolves a Sharp Disease-State Shift between Mild and Moderate COVID-19. Cell. 2020; 183: 1479–1495.e20. PubMed Abstract | Publisher Full Text

[19] 19. McInnes L, Healy J, Melville J: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.Reference Source

[20] 20. Wolf FA, Hamey FK, Plass M, et al.: PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 2019; 20(1): 1–9. Publisher Full Text

[21] 21. Wolf FA, Angerer P, Theis FJ: SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol. 2018; 19(1): 1–5. Publisher Full Text

[22] 22. Hubert L, Arabie P: Comparing partitions. J. Classif. 1985; 21(2): 193–218.

[23] 23. Scrucca L, Fop M, Murphy TB, et al.: Mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. R J. 2016; 8: 289–317. Publisher Full Text

[24] 24. McCarthy DJ, Campbell KR, Lun ATL, et al.: Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics. 2017; 33: btw777–btw1186. Publisher Full Text

[25] 25. Yu G, Wang LG, Han Y, et al.: clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012; 16: 284–287. PubMed Abstract | Publisher Full Text

[26] 26. Mereu E, et al.: Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat. Biotechnol. 2020; 38: 747–755. PubMed Abstract | Publisher Full Text

[27] 27. Ding J, et al.: Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat. Biotechnol. 2020; 38: 737–746. PubMed Abstract | Publisher Full Text

[28] 28. Elizaga ML, et al.: Safety and tolerability of HIV-1 multiantigen pDNA vaccine given with IL-12 plasmid DNA via electroporation, boosted with a recombinant vesicular stomatitis virus HIV Gag vaccine in healthy volunteers in a randomized, controlled clinical trial. PLoS One. 2018; 13: e0202753. PubMed Abstract | Publisher Full Text

[29] 29. Li SS, et al.: DNA priming increases frequency of T-cell responses to a vesicular stomatitis virus HIV vaccine with specific enhancement of CD8 T-cell responses by interleukin-12 plasmid DNA. Clin. Vaccine Immunol. 2017; 24. PubMed Abstract | Publisher Full Text

[30] 30. Golinski ML, et al.: CD11c+ B Cells Are Mainly Memory Cells, Precursors of Antibody Secreting Cells in Healthy Donors. Front. Immunol. 2020; 11: 32. PubMed Abstract | Publisher Full Text

[31] 31. Karnell JL, et al.: Role of CD11c+ T-bet+ B cells in human health and disease. Cell. Immunol. 2017; 321: 40–45. PubMed Abstract | Publisher Full Text

[32] 32. Sanz I, et al.: Challenges and opportunities for consistent classification of human b cell and plasma cell populations. Front. Immunol. 2019; 10: 2458. PubMed Abstract | Publisher Full Text

[33] 33. Portugal S, et al.: Malaria-associated atypical memory B cells exhibit markedly reduced B cell receptor signaling and effector function. elife. 2015; 4. PubMed Abstract | Publisher Full Text

[34] 34. Jenks SA, et al.: Distinct Effector B Cells Induced by Unregulated Toll-like Receptor 7 Contribute to Pathogenic Responses in Systemic Lupus Erythematosus. Immunity. 2018; 49: 725–739.e6. PubMed Abstract | Publisher Full Text

[35] 35. Holla P, et al.: Shared transcriptional profiles of atypical B cells suggest common drivers of expansion and function in malaria, HIV, and autoimmunity. Sci. Adv. 2021; 7: 8384–8410.

[36] 36. Schulte-Schrepping J, et al.: Severe COVID-19 Is Marked by a Dysregulated Myeloid Cell Compartment. Cell. 2020; 182: 1419–1440.e23. PubMed Abstract | Publisher Full Text

[37] 37. Oliviero B, et al.: Expansion of atypical memory B cells is a prominent feature of COVID-19. Cell. Mol. Immunol. 2020; 17: 1101–1103. PubMed Abstract | Publisher Full Text

[38] 38. Wildner NH, et al.: B cell analysis in SARS-CoV-2 versus malaria: Increased frequencies of plasmablasts and atypical memory B cells in COVID-19. J. Leukoc. Biol. 2021; 109: 77–90. PubMed Abstract | Publisher Full Text

[39] 39. Woodruff MC, et al.: Extrafollicular B cell responses correlate with neutralizing antibodies and morbidity in COVID-19. Nat. Immunol. 2020; 21: 1506–1516. PubMed Abstract | Publisher Full Text

[40] 40. Sutton HJ, et al.: Atypical B cells are part of an alternative lineage of B cells that participates in responses to vaccination and infection in humans. Cell Rep. 2021; 34: 108684. PubMed Abstract | Publisher Full Text

[41] 41. He B, et al.: Rapid isolation and immune profiling of SARS-CoV-2 specific memory B cell in convalescent COVID-19 patients via LIBRA-seq. Signal Transduct. Target. Ther. 2021; 6: 1–12.

[42] 42. Repapi E, Agarwal D, Napolitani G, et al.:Supplementary Figures and Table for Repapi et al. 2022. [Dataset]. 2022. Publisher Full Text

[43] 43. Repapi E, Agarwal D: emmanuelaaaaa/CyTOF_scRNA_integration: v2.0 (v2.0). Zenodo. [Analysis code]. 2022. Publisher Full Text

Integration of single-cell RNA-Seq and CyTOF data characterises heterogeneity of rare cell subpopulations

Abstract

Keywords

Revised Amendments from Version 1

Introduction

Methods

Datasets and preprocessing

Table 1. Table with gene and protein correspondences.

Integration of CyTOF and scRNA-Seq datasets

Evaluation criteria

Pseudobulk correlations

Transcriptional characterisation of cells

Results

Figure 1. CyTOF can help define main populations of scRNA-Seq with high accuracy.

Mass cytometry and RNA integration facilitates the identification of major cell populations

Imputation of common and dataset specific markers

Figure 2. Imputation of common and dataset specific markers.

Integration of datasets of unmatched samples from different conditions

Figure 3. Integration of unmatched datasets can provide accurate annotations and imputed markers.

Table 2. Table of ARI and F1 scores of refPBMC integration scenarios.

Characterisation of a rare B cell subpopulation of CD11c+ cells in COVID-19

Figure 4. Transcriptional characterization of CD11c+ B cells.

Figure 5. Differential gene expression between 3 rare CD11c+ subpopulations of B cells.

Discussion

Data availability

Source data

Extended data

Analysis code

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated