Cell-type classification of cancer single-cell RNA-seq data using the Subsemble ensemble-based machine learning classifier

David Chen; Parisa Shooshtari

doi:10.12688/f1000research.125579.1

Home Browse Cell-type classification of cancer single-cell RNA-seq data using...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Cell-type classification of cancer single-cell RNA-seq data using the Subsemble ensemble-based machine learning classifier

[version 1; peer review: 3 approved with reservations]

David Chen^1-3, Parisa Shooshtari ^1-4

PUBLISHED 14 Apr 2023

Author details Author details

¹ Department of Pathology and Laboratory Medicine, the University of Western Ontario, London, Ontario, ON N6A 5C1, Canada
² Department of Computer Science, the University of Western Ontario, London, Ontario, ON N6A 5C1, Canada
³ The Children’s Health Research Institute—Lawson Health Research Institute, London, Ontario, ON N6C 2R5, Canada
⁴ Ontario Institute for Cancer Research, Toronto, Ontario, ON M5G 0A3, Canada

David Chen
Roles: Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Investigation, Methodology, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Parisa Shooshtari
Roles: Conceptualization, Funding Acquisition, Investigation, Methodology, Project Administration, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Cell & Molecular Biology gateway.

This article is included in the Machine Learning in Genomics collection.

This article is included in the Single-Cell RNA-Sequencing collection.

Abstract

Background
The advent of single-cell RNA sequencing (scRNA-seq) has provided a high-resolution overview of the cellular heterogeneity of different tissue types. Manual cell type annotation of gene expression datasets remains a useful but time-intensive task. Ensemble machine learning methods leverage the predictive power of multiple classifiers and can be applied to classify high-dimensional gene expression data. Here, we present a novel application of the Subsemble supervised ensemble machine learning classifier used to classify novel cells with known cell type labels using gene expression data.
Methods
First, we tested the classification performance of different pre-processing steps used to normalize and upsample the training dataset for the Subsemble using a colorectal cancer dataset. Second, we conducted a cross-validated performance benchmark of the Subsemble classifier compared to nine other cell type classification methods across five metrics tested, using an acute myeloid leukemia dataset. Third, we conducted a comparative performance benchmark of the Subsemble classifier using a patient-based leave-one-out cross-validation scheme. Rank normalized scores were calculated for each classifier to aggregate performance across multiple metrics.
Results
The Subsemble classifier performed best when trained on a dataset that was log-transformed then upsampled to generate balanced class distributions. The Subsemble classifier was consistently the top-ranked classifier across five classification performance metrics compared to the nine other baseline classifiers and showed an improvement in performance as the training dataset increased. When tested using the patient-based leave-one-out cross-validation scheme, the Subsemble was the top-ranked classifier based on rank normalized scores.
Conclusions
Our proof-of-concept study showed that the Subsemble classifier can be used to accurately predict known cell type labels from single-cell gene expression data. The top-ranked classification performance of the Subsemble across two validation datasets, two cross-validation schemes, and five performance metrics motivates future development of accurate ensemble classifiers of scRNA-seq datasets.

Keywords

Subsemble, scRNA-seq, machine learning, deep learning, supervised, classification, ensemble, cell-type annotation

Corresponding author: Parisa Shooshtari

Competing interests: No competing interests were disclosed.

Grant information: This project was supported by the Government of Canada through the Natural Sciences and Engineering Research Council (NSERC) (DGECR-2021-00298) (PS), and the Children's Health Research Institute (PS). PS is supported by an Early Investigator Award from the Ontario Institute for Cancer Research (OICR), and DC is supported by NSERC Undergraduate Student Research Award. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2023 Chen D and Shooshtari P. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Chen D and Shooshtari P. Cell-type classification of cancer single-cell RNA-seq data using the Subsemble ensemble-based machine learning classifier [version 1; peer review: 3 approved with reservations]. F1000Research 2023, 12:406 (https://doi.org/10.12688/f1000research.125579.1) First published: 14 Apr 2023, 12:406 (https://doi.org/10.12688/f1000research.125579.1) Latest published: 14 Apr 2023, 12:406 (https://doi.org/10.12688/f1000research.125579.1)

Introduction

Transcriptomic analyses are an increasingly important method to characterize biological processes, cell lineages, and the genetic profiles of cell states by quantifying all of the RNA in a cell population (Wirka et al., 2018). The advent of single-cell RNA sequencing (scRNA-seq) has increased the resolution to which analyses of cell type heterogeneity can be used to identify rare cell populations, reveal gene regulation networks, as well as trace cell lineages and map their fates (Hwang et al., 2018; VanHorn & Morris, 2021). Moreover, the growth in the number of scRNA-seq datasets and the resulting need to annotate cell type labels for each cell in these datasets remains a time and labor-intensive task. Automation of cell type annotation has previously been addressed using supervised (e.g. scPred (Alquicira-Hernandez et al., 2019), ACTINN (Ma & Pellegrini, 2019), LAmbDA (Johnson et al., 2019)), semi-supervised (e.g. CALLR (Wei & Zhang, 2021), semiRNet (Dong et al., 2022), scNym (Kimmel & Kelley, 2021)) and unsupervised (e.g. ScType (Ianevski et al., 2022), UNIFAN (Li et al., 2022), cellVGAE (Buterez et al., 2022)) machine learning methods.

Standalone machine learning classifiers and classical statistical methods used to annotate cell types may not be able to de-noise and fully leverage the complex parameter space from high-dimensional scRNA-seq datasets in order to make accurate, robust predictions (Asada et al., 2021). In comparison, deep learning classifiers are prone to overfit on training data, especially in cases where imbalanced classes lead to increased classification performance when tested on majority classes while trading for reduced classification performance when tested on minority classes (Oller-Moreno et al., 2021). Data augmentation using statistical techniques such as Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al., 2002) and more complex generative adversarial network models (Marouf et al., 2020) can be used to generate realistic observations needed to correct for class imbalance that may lead to suboptimal machine learning classification performance (Krawczyk, 2016).

Ensemble learning involves aggregating predictions from multiple individual machine learning classifiers to improve overall classification performance (Dietterich, 2000). Moreover, ensemble classifiers can be optimized by weighting the contribution of individual classifiers to the final prediction based on their historical performance (Akhter et al., 2021). For instance, previous studies have shown that ensemble machine learning models outperform individual machine learning classifiers when tasked with detecting doublets (Xiong et al., 2022), predicting RNA velocity (Wang & Zheng, 2021), clustering cells (Geddes et al., 2019), and impute read counts of dropout events (Lu et al., 2021) using scRNA-seq data as input.

The Subsemble is a subset ensemble prediction method that has been proposed to be scalable to large datasets by training individual classifiers on data subsets in parallel, employs a multi-layer model architecture that re-weights the contribution of individual classifiers to optimize the final prediction, and has been shown to consistently outperform individual classifiers across multiple test datasets (Sapp et al., 2014). The Subsemble improves model fit to the training data by combining subset-specific fits from multiple individual classifiers into a single meta prediction (Sapp et al., 2014).

Here, we introduce a novel proof of concept of the Subsemble supervised classifier coupled with a feature normalization and upsampling pipeline to accurately predict cell type labels from single-cell gene expression data (RRID: SCR_022784). First, we identified a novel feature normalization and upsampling pre-processing pipeline that optimized classification performance of a baseline support vector classifier. Second, we compared the supervised classification performance of the proposed pre-processing pipeline and Subsemble classifier to predict cell type labels from gene expression data against nine other machine learning classifiers, deep learning classifiers, and an existing cell type annotation method using a 10-fold stratified cross-validation scheme and five classification performance metrics. Third, we tested if information loss from the training data when using five-fold, 10-fold and 20-fold stratified cross-validation schemes affected classifier performance. Fourth, we benchmarked classifier performance when trained on gene expression data associated with one subset of patients and trained on a second subset of patients using a leave-one-out cross-validation scheme within the same dataset.

Methods

Datasets

Two scRNA-seq datasets with cell type labels were used to evaluate the classification performance of the proposed Subsemble classifier compared to existing machine learning-based classifiers. Each dataset was formatted as one cell by gene expression matrix and one table that mapped each cell to a known cell type label. The two datasets were downloaded using the TMExplorer tool (Christensen et al., 2022).

The first scRNA-seq dataset from Li et al. (2017) was sourced from 11 primary colorectal tumors. The dataset consisted of the read count matrix associated with 359 cells and 57,241 genes as well as the cell type label of each cell. Cells were labelled with one out of five possible cell type labels. This dataset was used to test if different data pre-processing steps of the model training dataset, such as log-transformation and Synthetic Minority Oversampling Technique (SMOTE) upsampling, would increase Subsemble classification performance compared to training the same model with the raw read counts.

The second dataset from van Galen et al. (2019) was sourced from 40 bone marrow aspirates of 16 acute myeloid leukemia patients. The dataset consisted of the read count matrix associated with 22,284 cells and 27,899 genes as well as the cell type label of each cell. Cells were labelled with one out of six possible cell type labels. First, this dataset was used to evaluate the 10-fold, stratified, cross-validated performance of the proposed Subsemble classifier across five different performance metrics compared to nine different machine learning and deep learning-based classifiers. Second, this dataset was also used to evaluate the leave-one-out cross-validated performance of the proposed Subsemble classifier when trained on a dataset subset associated with all but one patient and tested on a dataset associated with the hold-out patient.

Data preprocessing

Each of the two scRNA-seq expression datasets was reduced to 100 principal components using principal component analysis, log-transformed, and upsampled using SMOTE to generate balanced classes based on cell type labels.

Model architecture

The proposed Subsemble model is a supervised ensemble-based classifier that partitions the training scRNA-seq expression dataset into 10 subsets, outputs the probability that each cell belongs to a known cell type class assigned by one base layer of multiple different machine learning and deep learning classifiers on each subset, and fits one meta layer consisting of one machine learning classifier on the cell type class probabilities (Figure 1) (Chen & Shoostari 2022a). The base layer consists of one XGBoost classifier, one Random Forest classifier, one Multi-layer Perceptron classifier, and three Support Vector classifiers initialized with three different kernels (linear, third-degree polynomial, and radial base function). The meta layer consists of one Support Vector classifier initialized with the radial base function kernel. The trained Subsemble model can then be used to predict the cell type of an unknown cell from expression data (see Figure 1 for model architecture schematic). The Subsemble model does not predict class probabilities associated with each cell.

Figure 1. Schematic of the Subsemble supervised machine learning classifier architecture.

A single cell gene expression matrix with known cell type labels is used to train the Subsemble classifier. First, the training dataset is randomly partitioned into 10 random subsets. Second, the Subsemble classifier is trained individually on each subset and the predictions from each subset are pooled together to make a final cell type prediction when used to predict cell type from the testing dataset. Four types of base learners, including XGBoost (XGB), Random Forest (RF), Multi-layer Perceptron (MLP), and Support Vector Classifier (SVC), are trained on each subset of the training dataset. The output of each base machine learning classifier is the class probability that the cell belongs to each of the known cell type labels. Next, the meta learner Support Vector Classifier (SVC) is trained on the output class probabilities from the base learners for each cell across the 10 random subsets of the training dataset. Third, a single cell gene expression matrix without known cell type labels is used to test the Subsemble classifier. The Subsemble classifier will make a single prediction of a known cell type label observed in the training dataset.

Performance evaluation

Five performance metrics were used to evaluate the classification performance of the proposed Subsemble classifier compared to known machine learning classifiers such as Naïve Bayes (NB), AdaBoost (ADA), Decision Tree (DT), K-nearest Neighbors (KNN), Random Forest (RF), and Support Vector Classifier (SVC), an extreme gradient boosting classifier (XGBoost), a deep learning classifier (Multi-layer Perceptron), and a deep learning-based annotation method (ACTINN). Classification performance metrics associated with classifier were generated for each fold in the dataset cross-validation scheme and repeated five times. The same random seed was used to generate the same folds in each cross-validation scheme. The same dataset pre-processing steps were used to process the input data used to benchmark each classifier and ensure that the difference in classification performance can be attributed to the classifier used. Classification performance metrics are reported as the median value across all repeats and folds of the cross-validation scheme used.

‘Accuracy’ measured the number of correct predictions made by the classifier divided by the total number of predictions made. ‘Precision’ measured the number of true positives divided by the number of true positives and the number of false positives. ‘Recall’ measured the number of true positives divided by the number of true positives and the number of false negatives. ‘F1-score’ measured the product of precision and recall multiplied by two, then divided by the sum of precision and recall. Matthew’s Correlation Coefficient was used to measure classifier performance (Matthews, 1975) and is known to fairly represent classifier performance tested on imbalanced datasets (Chicco et al., 2021).

Rank normalized scores

To rank the classification performance of the nine baseline classifiers and the proposed Subsemble classifier across the five classification performance metrics, we employed a rank normalized scoring approach previously used to benchmark the performance of machine learning classifiers when trained and tested on drug discovery (Korotcov et al., 2017) and raman spectroscopy datasets (Chen, 2021). Rank normalized scores for each classifier were calculated by first ranking each classifier by their performance in each metric, where a classifier with a lower rank represents relatively increased performance compared to a classifier with a higher rank. Next, the ranks for each metric were summed together for each classifier and ordered from lowest to highest based on the rank sum, where a classifier with a lower rank sum represents relatively increased performance across multiple performance metrics compared to a classifier with a higher rank sum.

Results

Optimization of the pre-processing pipeline

To optimize the classification performance of the pre-processing steps used, we tested the effect of using raw read counts, upsampled read counts using SMOTE, to create balanced class distributions, log₂-transformed read counts, and both log₂-transformed and upsampled read counts from the Li et al. colorectal cancer dataset on five metrics used to measure the performance of the Subsemble. Using five repetitions of a 10-fold stratified cross-validation technique to train and test the Subsemble, we found that the median accuracy of 94.44% across all folds and repetitions was highest when training and testing the Subsemble using log-transformed and upsampled read counts (Figure 2). In comparison, the median accuracy when using log-transformed read counts was 93.44%, followed by a drop-off in accuracy when using upsampled read counts (88.89%) and raw read counts (88.57%). Likewise, the Subsemble performed best in precision (94.61%), recall (94.44%), F1 score (93.62%), and Matthew’s Correlation Coefficient (86.41%) when trained and tested using log-transformed and upsampled read counts compared to other read count pre-processing methods. Moreover, log-transformation and SMOTE upsampling of read counts generally resulted in decreased variance in classification performance across each of the five performance metrics compared to log-transformation alone or when using raw read counts. Rank normalized scores of Subsemble classification performance measured across five metrics and tested using four different pre-processing pipelines showed that the log₂-transformation and SMOTE upsampling of raw read counts consistently ranked the highest among other pre-processing pipelines (Supplementary Table 1) (Chen & Shoostari, 2022b).

Figure 2. Subsemble classification performance of colorectal cancer cell types from the Li et al. colorectal cancer dataset using a stratified 10-fold cross-validation scheme repeated for five unique replicates.

The Subsemble classifier shows higher classification performance when trained and tested on single cell expression data that has been pre-processed using log₂-transformation for data normalization and SMOTE upsampling to generate balanced class proportions compared to pre-processing with each step alone and to no pre-processing conducted (raw read counts).

Classification performance benchmark of Subsemble and other cell type annotation methods

First, we benchmarked the classification performance of the proposed Subsemble classifier compared to nine machine learning classifiers, deep learning classifiers, and an existing cell type annotation method, using the van Galen et al. acute myeloid leukemia dataset and a ten-fold stratified cross-validation scheme for classifier training and testing. We used the log_2- transformation for data normalization and SMOTE upsampling to generate balanced class distributions for the training datasets used to train each classifier in the cross-validation benchmarks using the van Galen et al. acute myeloid leukemia dataset.

The Subsemble classifier consistently performed with the highest performance across all metrics, including median accuracy (88.65%), median precision (88.9%), median recall (88.65%), median F1 score (88.64%), and median Matthew’s Correlation Coefficient (85.81%) (Figure 3A). The ACTINN deep learning cell type annotation method was the second highest ranked classifier based on median performance across all five metrics, followed by the XGBoost classifier and Support Vector classifier. The variance of each metric measured from five repetitions of the cross-validation scheme and ten folds in each scheme was comparable between the Subsemble and other classifiers, with the exception of the AdaBoost classifier which showed a marked increase in performance variance across different folds.

Figure 3. Benchmark of classification performance of the Subsemble classifier and nine other cell type classification methods tested on the van Galen et al. acute myeloid leukemia dataset using a 10-fold stratified cross-validation scheme.

A) Classification performance measured using five metrics across all folds of the 10-fold cross-validation scheme. B) Median accuracy of cell type-specific predictions across all folds in the cross-validation scheme. C) Median F1 score of cell type-specific predictions across all folds in the cross-validation scheme.

Rank normalized scores ranked the Subsemble as the top performing cell type classifier, followed by the ACTINN deep learning annotation method, XGBoost classifier, and Support Vector Classifier (Supplementary Table 2) (Chen & Shoostari, 2022b). To test whether different classifiers showed biases in class-specific classification performance, we calculated the accuracy and F1 score metrics associated with each of the six cell type labels included in the van Galen et al. acute myeloid leukemia dataset (Figure 3B and C, respectively). We observed that the Granulocyte Monocyte Progenitor (GMP) cell type was a minority class and consistently showed relatively decreased median accuracy and median F1 score across all classifiers. Moreover, we noted that the Subsemble generally performed comparably or out-performed other classifiers when tasked with classification of all cell types except for the GMP cell type.

Second, we tested if the number of folds used for the stratified cross-validation scheme including five folds, 10 folds, and 20 folds, affected the median accuracy and F1 score performance of the proposed Subsemble classifier and nine other classifiers and annotation methods (Figure 4A and B, respectively). We expected that as the number of folds increased and a larger proportion of the entire dataset was used for training the classifiers, there would be an increase in the classification performance when tested on the hold-out fold of the dataset. We observed that as the number of folds in the stratified cross validation scheme increased from five-fold to 20-fold, there was a marked increase in the median accuracy and median F1 score for the Subsemble (+3.67% accuracy, +3.97% F1 Score), K-nearest neighbors (+5.97% accuracy, +5.21% F1 score), Naïve Bayes (+2.40% accuracy, +2.54% F1 Score) classifiers and AdaBoost (+2.57% accuracy, +2.19% F1 score) classifiers.

Figure 4. Benchmark of classification performance of the Subsemble classifier and nine other cell type classification methods tested on the van Galen et al. acute myeloid leukemia dataset using a five-fold, 10-fold, and 20-fold stratified cross-validation scheme.

A) Median accuracy performance across all folds of the five-fold, 10-fold, and 20-fold cross-validation scheme. B) Median F1- core performance across all folds of the five-fold, 10-fold, and 20-fold cross-validation scheme.

Rank normalized scores of median accuracy, aggregated across accuracy benchmarks using a five-fold, 10-fold, and 20-fold stratified cross-validation scheme, showed that the Subsemble was the top performing cell type classifier when trained and tested using five, 10, or 20 folds, followed by ACTINN, XGBoost, and SVC (Supplementary Table 3) (Chen & Shoostari, 2022b). We noted that ACTINN was the top performing cell type classifier based on median F1 score when tested using a five-fold stratified cross-validation scheme, while the Subsemble was the top performing cell type classifier based on median F1 score when tested using a 10-fold and 20-fold stratified cross-validation scheme. This shows that the ACTINN neural network based classifier may perform well in scenarios where large training datasets are not available, but the Subsemble continues to outperform individual classifiers as the training dataset size increases. Rank normalized scores of median F1 score showed that the Subsemble was the top performing cell type classifier, followed by ACTINN, XGBoost, and Support Vector Classifier (Supplementary Table 4) (Chen & Shoostari, 2022b).

Third, we compared the classification performance of the Subsemble classifier to the nine other classifiers and annotation methods when trained on the van Galen et al. acute myeloid leukemia data associated with all but one patient and tested on data associated with the hold-out patient using a leave-one-out cross-validation scheme based on patient. The Subsemble classifier reported the highest median accuracy (87.00%), median precision (94.27%), median recall (85.72%), median F1 score (87.79%), and median Matthew’s Correlation Coefficient (59.97%) (Figure 5). We noted that Subsemble generally showed the least variance in each performance metric across different folds as well as the smallest range compared to the nine other classifiers and annotation methods. Rank normalized scores aggregated across all five performance metrics for each classifier showed that the Subsemble was the top-ranked cell type classifier, followed by Support Vector Classifier, XGBoost, and ACTINN (Supplementary Table 5) (Chen & Shoostari, 2022b).

Figure 5. Classification performance measured using five metrics across each fold of the patient-based leave-one-out cross-validation scheme, where classifiers were trained on single cell gene expression data associated with all patients but one and tested on single cell gene expression data associated with the hold-out patient.

Discussion

Single cell RNA sequencing is a powerful method to analyze the transcriptome of a tumor sample at single cell resolution, providing an overview of cellular heterogeneity based on cell type and cell state. To automate the time and labor-intensive task of manual cell type annotation, we employed the Subsemble ensemble classifier trained on known cell types to accurately classify cells based on single cell gene expression data as a novel application of the Subsemble. In our proof-of-concept study, we showed that the log-transformation and SMOTE-upsampling pre-processing pipeline coupled with the Subsemble classifier consistently performed as the top-ranked classifier across five classification performance metrics, two different validation datasets, and two different cross-validation schemes.

Data normalization using log-transformation (Lytal et al., 2020) and upsampling of RNA sequencing data using SMOTE (Yap et al., 2021) have previously been reported to individually improve performance in classification tasks. This study uniquely applied both pre-processing steps to transform gene expression data from single cell RNA sequencing and showed that using both pre-processing steps increased Subsemble classification performance compared to each standalone step or using raw read counts as the training dataset input. The improvement of each of the five classification performance metrics when using both pre-processing steps to transform the Li et al. colorectal cancer dataset suggests that log-transformation and SMOTE upsampling may be useful for scaling and generating balanced class proportions recommended for training machine learning classifiers (Johnson & Khoshgoftaar, 2019).

Previous studies have shown that ensemble methods were superior in overall classification performance and tend to perform at least comparably if not better than individual machine learning classifiers (Dietterich, 2000; Zhao et al., 2020). The higher overall classification performance and lower variance of each performance metric suggest that the Subsemble is more consistent in its predictions across the entire test dataset. Moreover, as we increased the number of folds in the stratified cross-validation scheme, we observed that the Subsemble increased in median accuracy and F1 score which suggests that Subsemble performance can be further improved with larger training datasets. Finally, the accurate cell type predictions made by the Subsemble when trained on gene expression data and cell type labels associated with all patients but one and tested on the hold-out patient suggest that the Subsemble can be applied to cell type prediction of unlabelled gene expression data when trained on data generated from the same experiment and technical conditions.

Across the different performance benchmark tests, we noted that the individual Support Vector machine learning classifier and XGBoost boosting classifier were generally top-performing cell type classifiers across each of the performance metrics. The Support Vector machine learning classifier is computationally efficient and generalizable to a variety of classification tasks, including high-dimensional single cell RNA sequencing datasets (Karamizadeh et al., 2014). The XGBoost classifier is an ensemble gradient boosting model that is faster in training time than other implementations of gradient boosting and uses an iterative approach to continuously refine classification predictions (Santhanam et al., 2017). By incorporating both the Support Vector and XGBoost classifier as base learners in the Subsemble, our proposed model takes advantage of both classifiers’ strengths to optimize the final cell type prediction.

The performance of supervised machine learning-based cell type classifiers depends on the number of cell type observations and diversity of cell types within the training dataset. The poor performance of the Subsemble in the classification of the minority GMP cell type in the van Galen et al. acute myeloid leukemia dataset, alike to the other classifiers, requires a larger number of cell type observations with balanced class proportions in the training dataset. Thus, future work will aim to develop a Subsemble classifier pre-trained on an integrated gene expression dataset of multiple different reference cell types, including rare cell types, and sequenced using different technologies (Korsunsky et al., 2019), different tissue subtypes (Travaglini et al., 2020), and different patient clinical conditions to improve classification sensitivity (Mereu et al., 2020). Moreover, further classification performance benchmark studies of SMOTE upsampling compared to random undersampling when applied to high-dimensional data such as scRNA-seq gene expression datasets should be conducted to validate the pre-processing steps used to transform the training datasets for cell type machine learning classifiers (Blagus & Lusa, 2013).

Ensemble methods require multiple individual machine learning classifiers to be trained independently which increases the computational resources and time required to train the ensemble model compared to training standalone classifiers or using statistical methods for cell type classification. We plan to improve the Subsemble classifier by training individual classifiers using multiple different training subsets in parallel and conducting a hyperparameter grid search to optimize both classification performance and training time. Nevertheless, our proof-of-concept study showed that the Subsemble classifier trained on two validation single cell gene expression datasets is an accurate classifier of cell type labels compared to individual machine learning classifiers, deep learning classifiers, and ACTINN. Further benchmark tests comparing the Subsemble classifier to other supervised and unsupervised cell type annotation methods will be conducted.

Conclusions

In conclusion, our proof-of-concept study is a novel application of the Subsemble ensemble classifier to supervised classification of scRNA-seq gene expression data. Data normalization and upsampling coupled with the Subsemble classifier showed overall improved classification performance compared to nine other cell type classification methods when tested using five-fold, 10-fold, and 20-fold stratified cross-validation schemes. The Subsemble classifier also showed superior performance when tested using a patient-based leave-one-out cross-validation scheme. The superior classification performance of the Subsemble classifier across two different scRNA-seq gene expression datasets, two cross-validation schemes, and five performance metrics motivates future development of ultra-fast and accurate ensemble cell type classifiers and larger-scale systematic benchmark tests compared to other cell type annotation methods.

Data availability

Source data

The Li et al. and van Galen et al. single cell RNA sequencing gene expression datasets and respective cell type labels used to train and test the Subsemble cell type supervised classifier can be accessed on Gene Expression Omnibus at accessions GSE81861 and GSE116256 respectively.

Underlying data

Figshare: Figure Data. https://doi.org/10.6084/m9.figshare.20484153.v1 (Chen & Shoostari, 2022a).

This project contains the following underlying data:

- Figure_2 (Folder containing all benchmark metrics calculated after testing the performance of the Subsemble Cell type for figure 2)
- Figure_3 (Folder containing all benchmark metrics calculated after testing the performance of the Subsemble Cell type for figure 3)
- Figure_4AB (Folder containing all benchmark metrics calculated after testing the performance of the Subsemble Cell type for figure 4a and 4b)
- Figure_4C (Folder containing all benchmark metrics calculated after testing the performance of the Subsemble Cell type for figure 4c)

Extended data

Figshare: Supplementary Table 1-5. https://doi.org/10.6084/m9.figshare.20484144.v1 (Chen & Shoostari, 2022b).

This project contains the following extended data:

- ST1. Pre-processing_Pipeline_Rank_Normalized_Scores.csv
- ST2. Ten-Fold_CV_Rank_Normalized_Scores.csv
- ST3. CV_Fold_Accuracy_Rank_Normalized_Scores.csv
- ST4. CV_Fold_F1_Score_Rank_Normalized_Scores.csv
- ST5. Patient_LOOCV_Rank_Normalized_Scores.csv

Data are available under the terms of the Creative Commons Attribution 4.0 International Public License (CC-By 4.0).

Software availability

Software available from SciCrunch: RRID: SCR_022784

Source code available from: https://github.com/shooshtarilab/Subsemble_Cell_Type_Classifier

Archived source code at the time of publication: https://doi.org/10.5281/zenodo.7702391 (Chen, 2023)

License: CC0

References

Akhter MP, Zheng J, Afzal F, et al.: Supervised ensemble learning methods towards automatically filtering Urdu fake news within social media. PeerJ. Comput. Sci. 2021; 7: e425. PubMed Abstract | Publisher Full Text | Free Full Text
Alquicira-Hernandez J, Sathe A, Ji HP, et al.: Scpred: Accurate supervised method for cell-type classification from single-cell RNA-seq data. Genome Biol. 2019; 20(1): 264. PubMed Abstract | Publisher Full Text | Free Full Text
Asada K, Takasawa K, Machino H, et al.: Single-cell analysis using machine learning techniques and its application to medical research. Biomedicine. 2021; 9(11): 1513. PubMed Abstract | Publisher Full Text | Free Full Text
Blagus R, Lusa L: SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 2013; 14(1): 106. PubMed Abstract | Publisher Full Text | Free Full Text
Buterez D, Bica I, Tariq I, et al.: CellVGAE: An unsupervised scRNA-seq analysis workflow with graph attention networks. Bioinformatics. 2022; 38(5): 1277–1286. PubMed Abstract | Publisher Full Text | Free Full Text
Chawla NV, Bowyer KW, Hall LO, et al.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002; 16(1): 321–357. Publisher Full Text
Chen D: Analysis of machine learning methods for Covid-19 detection using serum raman spectroscopy. Appl. Artif. Intell. 2021; 35(14): 1147–1168. Publisher Full Text
Chen D: Code for Subsemble Cell Type Classifier. Zenodo. [Code].2023. Publisher Full Text
Chen D, Shoostari P: Data used to generate figures. [DATA]. 2022a. Publisher Full Text
Chen D, Shoostari P: Supplementary tables 1-5. [DATA]. 2022b. Publisher Full Text
Chicco D, Warrens MJ, Jurman G: The matthews correlation coefficient (Mcc) is more informative than cohen’s kappa and brier score in binary classification assessment. IEEE Access. 2021; 9: 78368–78381. Publisher Full Text
Christensen E, Naidas A, Chen D, et al.: TMExplorer: A tumour microenvironment single-cell RNAseq database and search tool. PLoS One. 2022; 17: e0272302. PubMed Abstract | Publisher Full Text | Free Full Text
Dietterich TG: Ensemble methods in machine learning. Multiple Classifier Systems. 2000; pp. 1–15. Publisher Full Text
Dong X, Chowdhury S, Victor U, et al.: Semi-supervised deep learning for cell type identification from single-cell transcriptomic data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022; 1–1. PubMed Abstract | Publisher Full Text
Geddes TA, Kim T, Nan L, et al.: Autoencoder-based cluster ensembles for single-cell RNA-seq data analysis. BMC Bioinformatics. 2019; 20(19): 660. PubMed Abstract | Publisher Full Text | Free Full Text
Hwang B, Lee JH, Bang D: Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 2018; 50(8): 1–14. PubMed Abstract | Publisher Full Text | Free Full Text
Ianevski A, Giri AK, Aittokallio T: Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat. Commun. 2022; 13(1): 1246. PubMed Abstract | Publisher Full Text | Free Full Text
Johnson JM, Khoshgoftaar TM: Survey on deep learning with class imbalance. J. Big Data. 2019; 6(1): 27. Publisher Full Text
Johnson TS, Wang T, Huang Z, et al.: LAmbDA: Label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detection. Bioinformatics. 2019; 35(22): 4696–4706. PubMed Abstract | Publisher Full Text | Free Full Text
Karamizadeh S, Abdullah SM, Halimi M, et al.: Advantage and drawback of support vector machine functionality. 2014 International Conference on Computer, Communications, and Control Technology (I4CT). 2014; pp. 63–65. Publisher Full Text
Kimmel JC, Kelley DR: Semisupervised adversarial neural networks for single-cell classification. Genome Res. 2021; 31(10): 1781–1793. PubMed Abstract | Publisher Full Text | Free Full Text
Korotcov A, Tkachenko V, Russo DP, et al.: Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery data sets. Mol. Pharm. 2017; 14(12): 4462–4475. PubMed Abstract | Publisher Full Text | Free Full Text
Korsunsky I, Millard N, Fan J, et al.: Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods. 2019; 16(12): 1289–1296. PubMed Abstract | Publisher Full Text | Free Full Text
Krawczyk B: Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016; 5(4): 221–232. Publisher Full Text
Li D, Ding J, Bar-Joseph Z: Unsupervised cell functional annotation for single-cell RNA-seq. Genome Res. 2022; 32: 1765–1775. PubMed Abstract | Publisher Full Text | Free Full Text
Li H, Courtois ET, Sengupta D, et al.: Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet. 2017; 49(5): 708–718. PubMed Abstract | Publisher Full Text
Lu F, Lin Y, Yuan C, et al.: Entssr: A weighted ensemble learning method to impute single-cell RNA sequencing data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021; 18(6): 2781–2787. PubMed Abstract | Publisher Full Text
Lytal N, Ran D, An L: Normalization methods on single-cell RNA-seq data: An empirical survey. Front. Genet. 2020; 11. PubMed Abstract | Publisher Full Text | Free Full Text
Ma F, Pellegrini M: ACTINN: Automated identification of cell types in single cell RNA sequencing. Bioinformatics. 2019; 36: 533–538. PubMed Abstract | Publisher Full Text
Marouf M, Machart P, Bansal V, et al.: Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun. 2020; 11(1): 166. PubMed Abstract | Publisher Full Text | Free Full Text
Matthews BW: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure. 1975; 405(2): 442–451. PubMed Abstract | Publisher Full Text
Mereu E, Lafzi A, Moutinho C, et al.: Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat. Biotechnol. 2020; 38(6): 747–755. PubMed Abstract | Publisher Full Text
Oller-Moreno S, Kloiber K, Machart P, et al.: Algorithmic advances in machine learning for single-cell expression analysis. Curr. Opin. Syst. Biol. 2021; 25: 27–33. Publisher Full Text
Santhanam R, Nishant U, Raman S, et al.: Experimenting XGBoost Algorithm for Prediction and Classification of Different Datasets. Int. J. Control. Theory Appl. 2017; 9.
Sapp S, van der Laan MJ , Canny J: Subsemble: An ensemble method for combining subset-specific algorithm fits. J. Appl. Stat. 2014; 41(6): 1247–1259. PubMed Abstract | Publisher Full Text | Free Full Text
Travaglini KJ, Nabhan AN, Penland L, et al.: A molecular cell atlas of the human lung from single-cell RNA sequencing. Nature. 2020; 587(7835): 619–625. PubMed Abstract | Publisher Full Text | Free Full Text
van Galen P , Hovestadt V, Wadsworth MH, et al.: Single-cell RNA-seq reveals aml hierarchies relevant to disease progression and immunity. Cell. 2019; 176(6): 1265–1281.e24. PubMed Abstract | Publisher Full Text | Free Full Text
VanHorn S, Morris SA: Next-generation lineage tracing and fate mapping to interrogate development. Dev. Cell. 2021; 56(1): 7–21. PubMed Abstract | Publisher Full Text
Wang X, Zheng J: Velo-Predictor: An ensemble learning pipeline for RNA velocity prediction. BMC Bioinformatics. 2021; 22(10): 419. PubMed Abstract | Publisher Full Text | Free Full Text
Wei Z, Zhang S: CALLR: A semi-supervised cell-type annotation method for single-cell RNA sequencing data. Bioinformatics. 2021; 37(Supplement_1): i51–i58. PubMed Abstract | Publisher Full Text | Free Full Text
Wirka RC, Pjanic M, Quertermous T: Advances in transcriptomics. Circ. Res. 2018; 122(9): 1200–1220. PubMed Abstract | Publisher Full Text | Free Full Text
Xiong K-X, Zhou H-L, Lin C, et al.: Chord: An ensemble machine learning algorithm to identify doublets in single-cell RNA sequencing data. Commun. Biol. 2022; 5(1): 1–11. PubMed Abstract | Publisher Full Text | Free Full Text
Yap M, Johnston RL, Foley H, et al.: Verifying explainability of a deep learning tissue classifier trained on RNA-seq data. Sci. Rep. 2021; 11: 2641. PubMed Abstract | Publisher Full Text | Free Full Text
Zhao X, Wu S, Fang N, et al.: Evaluation of single-cell classifiers for single-cell RNA sequencing data sets. Brief. Bioinform. 2020; 21(5): 1581–1595. PubMed Abstract | Publisher Full Text | Free Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 14 Apr 2023

Author details Author details

¹ Department of Pathology and Laboratory Medicine, the University of Western Ontario, London, Ontario, ON N6A 5C1, Canada
² Department of Computer Science, the University of Western Ontario, London, Ontario, ON N6A 5C1, Canada
³ The Children’s Health Research Institute—Lawson Health Research Institute, London, Ontario, ON N6C 2R5, Canada
⁴ Ontario Institute for Cancer Research, Toronto, Ontario, ON M5G 0A3, Canada

David Chen
Roles: Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Investigation, Methodology, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Parisa Shooshtari
Roles: Conceptualization, Funding Acquisition, Investigation, Methodology, Project Administration, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This project was supported by the Government of Canada through the Natural Sciences and Engineering Research Council (NSERC) (DGECR-2021-00298) (PS), and the Children's Health Research Institute (PS). PS is supported by an Early Investigator Award from the Ontario Institute for Cancer Research (OICR), and DC is supported by NSERC Undergraduate Student Research Award. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 14 Apr 2023, 12:406

https://doi.org/10.12688/f1000research.125579.1

Copyright

© 2023 Chen D and Shooshtari P. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Chen D and Shooshtari P. Cell-type classification of cancer single-cell RNA-seq data using the Subsemble ensemble-based machine learning classifier [version 1; peer review: 3 approved with reservations]. F1000Research 2023, 12:406 (https://doi.org/10.12688/f1000research.125579.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 1

VERSION 1

PUBLISHED 14 Apr 2023

Views

3

Reviewer Report 24 Jan 2024

Zhixiang Ren, Peng Cheng Laboratory, Shenzhen, China

Approved with Reservations

https://doi.org/10.5256/f1000research.137903.r233658

I think this is an interesting work that presents a compelling approach utilizing a Subsemble-based ensemble learning method for cell annotation in scRNA seq data, complemented by SMOTE up-sampling to balance data categories. The manuscript establishes a reliable benchmark to ... Continue reading

I think this is an interesting work that presents a compelling approach utilizing a Subsemble-based ensemble learning method for cell annotation in scRNA seq data, complemented by SMOTE up-sampling to balance data categories. The manuscript establishes a reliable benchmark to assess the performance differences between the proposed ensemble learning approach and traditional machine learning classification models, convincingly demonstrating the feasibility of your method. However, there are still some aspects that can be improved:

The manuscript needs to provide more details in the Method section. For the data preprocessing, it would be beneficial to provide parameters and necessary formulas for each step to enhance understanding. The current statement is somewhat brief. The manuscript should also clarify whether the data has been processed with quality control, normalization before up-sampling. Moreover, batch effect is a significant concern in multi-sample analysis, and an evaluation should be included to demonstrate the impact of batch effects on the data.

For the model architecture, more detailed algorithmic information is required, including essential formulas and details on hyper-parameter tuning. I also suggest adding ablation experiments to validate that the base learners in the current setup offer optimal performance. Testing the performance impact of removing or replacing models in the base learner could further strengthen your findings.

Comparisons with state-of-the-art methods in cell annotation, such as SingleR and scBERT, are currently missing. It is recommended to include performance comparisons with these methods to highlight the novelty and practical value of your approach.
The figure resolution requires enhancement for better clarity and understanding.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: deep learning, bioinformatics, single cell RNA sequencing data analysis, drug discovery

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

4

Reviewer Report 24 Jan 2024

Yue Cao, School of Mathematics and Statistics, The University of Sydney, Sydney, New South Wales, Australia; Charles Perkins Centre, The University of Sydney (Ringgold ID: 4334), Sydney, New South Wales, Australia; Sydney Precision Data Science Centre, The University of Sydney (Ringgold ID: 4334), Sydney, New South Wales, Australia

Approved with Reservations

https://doi.org/10.5256/f1000research.137903.r233659

In this study, Chen et al. demonstrated a proof-of-concept using the Subsemble model to perform cell type classification on scRNA-seq data. It is shown to outperform nine other common classifiers such as random forest. The authors further showed that upsampling ... Continue reading

In this study, Chen et al. demonstrated a proof-of-concept using the Subsemble model to perform cell type classification on scRNA-seq data. It is shown to outperform nine other common classifiers such as random forest. The authors further showed that upsampling using SMOTE achieved good performance. While the concept is interesting, the study is currently limited to two relatively small and simple datasets and does not have sufficient evidence to robustly support the performance of Subsemble and SMOTE. It is suggested that the authors incorporate a greater number of datasets, more complicated datasets, as well as more imbalance cases to illustrate the power of Subsemble and SMOTE.

SMOTE is a popular technique that deals with data imbalance by upsampling minority classes. In this study, the motivation behind using SMOTE is not clearly articulated. Did the two datasets have clear data imbalance or rare cell types? If so, this needs to be stated in the Methods and Results section.
The authors stated that “The Subsemble is a subset ensemble prediction method that has been proposed to be scalable to large datasets”. However, the sizes of the two datasets, one with 359 cells and one with 22, 284 cells, are not very large by current standards. Nowadays it is common to see scRNA-seq datasets with millions of cells. Data with a few hundred cells are rarely generated or analysed these days. The authors should consider testing on larger datasets, for example, the Tabula Muris dataset with 1 million cells.
Similarly, the number of cell types in the two datasets is relatively small, with only 5 and 6 cell types. The author should consider testing on more complicated datasets with a greater number of cell types.
The advantage of using SMOTE is not clear. In Figure 2, there is no visible difference between the log-transformed count and the log-transformed up sampled count across all five evaluation metrics. One potential reason could be the characteristic of datasets, that they are not imbalanced enough to benefit from up sampling. The authors should consider using alternative cases to better illustrate the advantage of SMOTE. For example, subsampling to create increasingly imbalanced datasets (eg, two cell types at the ratio of 1:10, 2:10, 3:10 etc.) and investigate at what data imbalance ratio does SMOTE significantly increase model performance.
The authors mentioned the datasets were “reduced to 100 principal components using principal component analysis, log-transformed, and upsampled using SMOTE”. Could the authors explain the rationale for using PCs over using the count matrix? What is the rationale for selecting 100 PCs? Also, why is log transformation written after PCA? The log normalisation should be performed before PCA.
Apart from model performance, the cost of running a model is also a key consideration in practice. Would an ensemble model significantly increase the computational resource requirement? The authors should report the running time and memory usage of Subsemble with other models.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: single-cell, spatial omics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

10

Reviewer Report 25 Apr 2023

Saskia Freytag, Personalised Oncology, Walter and Eliza Hall Institute (WEHI), Department of Medical Biology, University of Melbourne, Parkville, Vic, Australia

Approved with Reservations

https://doi.org/10.5256/f1000research.137903.r169736

Chen and Shooshtari present a benchmarking study of automatic cell classifiers. The benchmarking study is technically sound but could be improved in a number of places to increase the study's value to end-users:

I think for

Chen and Shooshtari present a benchmarking study of automatic cell classifiers. The benchmarking study is technically sound but could be improved in a number of places to increase the study's value to end-users:

I think for this type of study it would be important to compare the computational requirements of each of the methods. This would allow users to get a feeling for how much extra compute is required to get potentially modest gains in performance.
One common way of using automatic classifiers is to annotate cells with the most common label assigned by the classifier for other cells belonging to the same cluster. It would be interesting to see how this approach changes the performance.
It was unclear on how historical performance was used to weight the different methods used in the Subsemble classifier. Where did these come from?
I really liked the last investigation of changing the reference dataset, as this is more closely related to the situation we face in reality. In fact, I would recommend an expansion of this type of investigation and also use reference datasets that are unrelated to dataset of interest (i.e. different study investigating the same disease). This would be much closer to the situation that practitioners face in real life. The drop observed in performance in the test with the hold-out patient points to the effect of the reference dataset on performance and should therefore be discussed and further investigated.
In general, all datasets used in this benchmarking study are on the small side and do no longer reflect the standard in the field.
It should be clarified that the original labels in the study where taken as the ground truth and the limitations of this assumption should be discussed.
It was unclear how exactly the data was processed prior to PCA (i.e. log-transformation of the counts or PCA). I would therefore recommend adding additional clarification.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Single cell and spatial transcriptomics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 14 Apr 2023

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 1 14 Apr 23	read	read	read

Saskia Freytag, University of Melbourne, Parkville, Australia
Yue Cao, The University of Sydney, Sydney, Australia; The University of Sydney (Ringgold ID: 4334), Sydney, Australia; The University of Sydney (Ringgold ID: 4334), Sydney, Australia
Zhixiang Ren, Peng Cheng Laboratory, Shenzhen, China

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

3 Views

24 Jan 2024 | for Version 1

Zhixiang Ren, Peng Cheng Laboratory, Shenzhen, China

3 Views Cite this report Responses(0)

Approved With Reservations

I think this is an interesting work that presents a compelling approach utilizing a Subsemble-based ensemble learning method for cell annotation in scRNA seq data, complemented by SMOTE up-sampling to balance data categories. The manuscript establishes a reliable benchmark to assess the performance differences between the proposed ensemble learning approach and traditional machine learning classification models, convincingly demonstrating the feasibility of your method. However, there are still some aspects that can be improved:

The manuscript needs to provide more details in the Method section. For the data preprocessing, it would be beneficial to provide parameters and necessary formulas for each step to enhance understanding. The current statement is somewhat brief. The manuscript should also clarify whether the data has been processed with quality control, normalization before up-sampling. Moreover, batch effect is a significant concern in multi-sample analysis, and an evaluation should be included to demonstrate the impact of batch effects on the data.

For the model architecture, more detailed algorithmic information is required, including essential formulas and details on hyper-parameter tuning. I also suggest adding ablation experiments to validate that the base learners in the current setup offer optimal performance. Testing the performance impact of removing or replacing models in the base learner could further strengthen your findings.

Comparisons with state-of-the-art methods in cell annotation, such as SingleR and scBERT, are currently missing. It is recommended to include performance comparisons with these methods to highlight the novelty and practical value of your approach.
The figure resolution requires enhancement for better clarity and understanding.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

deep learning, bioinformatics, single cell RNA sequencing data analysis, drug discovery

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

4 Views

24 Jan 2024 | for Version 1

Yue Cao, School of Mathematics and Statistics, The University of Sydney, Sydney, New South Wales, Australia; Charles Perkins Centre, The University of Sydney (Ringgold ID: 4334), Sydney, New South Wales, Australia; Sydney Precision Data Science Centre, The University of Sydney (Ringgold ID: 4334), Sydney, New South Wales, Australia

4 Views Cite this report Responses(0)

Approved With Reservations

In this study, Chen et al. demonstrated a proof-of-concept using the Subsemble model to perform cell type classification on scRNA-seq data. It is shown to outperform nine other common classifiers such as random forest. The authors further showed that upsampling using SMOTE achieved good performance. While the concept is interesting, the study is currently limited to two relatively small and simple datasets and does not have sufficient evidence to robustly support the performance of Subsemble and SMOTE. It is suggested that the authors incorporate a greater number of datasets, more complicated datasets, as well as more imbalance cases to illustrate the power of Subsemble and SMOTE.

SMOTE is a popular technique that deals with data imbalance by upsampling minority classes. In this study, the motivation behind using SMOTE is not clearly articulated. Did the two datasets have clear data imbalance or rare cell types? If so, this needs to be stated in the Methods and Results section.
The authors stated that “The Subsemble is a subset ensemble prediction method that has been proposed to be scalable to large datasets”. However, the sizes of the two datasets, one with 359 cells and one with 22, 284 cells, are not very large by current standards. Nowadays it is common to see scRNA-seq datasets with millions of cells. Data with a few hundred cells are rarely generated or analysed these days. The authors should consider testing on larger datasets, for example, the Tabula Muris dataset with 1 million cells.
Similarly, the number of cell types in the two datasets is relatively small, with only 5 and 6 cell types. The author should consider testing on more complicated datasets with a greater number of cell types.
The advantage of using SMOTE is not clear. In Figure 2, there is no visible difference between the log-transformed count and the log-transformed up sampled count across all five evaluation metrics. One potential reason could be the characteristic of datasets, that they are not imbalanced enough to benefit from up sampling. The authors should consider using alternative cases to better illustrate the advantage of SMOTE. For example, subsampling to create increasingly imbalanced datasets (eg, two cell types at the ratio of 1:10, 2:10, 3:10 etc.) and investigate at what data imbalance ratio does SMOTE significantly increase model performance.
The authors mentioned the datasets were “reduced to 100 principal components using principal component analysis, log-transformed, and upsampled using SMOTE”. Could the authors explain the rationale for using PCs over using the count matrix? What is the rationale for selecting 100 PCs? Also, why is log transformation written after PCA? The log normalisation should be performed before PCA.
Apart from model performance, the cost of running a model is also a key consideration in practice. Would an ensemble model significantly increase the computational resource requirement? The authors should report the running time and memory usage of Subsemble with other models.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

single-cell, spatial omics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

10 Views

25 Apr 2023 | for Version 1

Saskia Freytag, Personalised Oncology, Walter and Eliza Hall Institute (WEHI), Department of Medical Biology, University of Melbourne, Parkville, Vic, Australia

10 Views Cite this report Responses(0)

Approved With Reservations

Chen and Shooshtari present a benchmarking study of automatic cell classifiers. The benchmarking study is technically sound but could be improved in a number of places to increase the study's value to end-users:

I think for this type of study it would be important to compare the computational requirements of each of the methods. This would allow users to get a feeling for how much extra compute is required to get potentially modest gains in performance.
One common way of using automatic classifiers is to annotate cells with the most common label assigned by the classifier for other cells belonging to the same cluster. It would be interesting to see how this approach changes the performance.
It was unclear on how historical performance was used to weight the different methods used in the Subsemble classifier. Where did these come from?
I really liked the last investigation of changing the reference dataset, as this is more closely related to the situation we face in reality. In fact, I would recommend an expansion of this type of investigation and also use reference datasets that are unrelated to dataset of interest (i.e. different study investigating the same disease). This would be much closer to the situation that practitioners face in real life. The drop observed in performance in the test with the hold-out patient points to the effect of the reference dataset on performance and should therefore be discussed and further investigated.
In general, all datasets used in this benchmarking study are on the small side and do no longer reflect the standard in the field.
It should be clarified that the original labels in the study where taken as the ground truth and the limitations of this assumption should be discussed.
It was unclear how exactly the data was processed prior to PCA (i.e. log-transformation of the counts or PCA). I would therefore recommend adding additional clarification.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Single cell and spatial transcriptomics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] Akhter MP, Zheng J, Afzal F, et al.: Supervised ensemble learning methods towards automatically filtering Urdu fake news within social media. PeerJ. Comput. Sci. 2021; 7: e425. PubMed Abstract | Publisher Full Text | Free Full Text

[2] Alquicira-Hernandez J, Sathe A, Ji HP, et al.: Scpred: Accurate supervised method for cell-type classification from single-cell RNA-seq data. Genome Biol. 2019; 20(1): 264. PubMed Abstract | Publisher Full Text | Free Full Text

[3] Asada K, Takasawa K, Machino H, et al.: Single-cell analysis using machine learning techniques and its application to medical research. Biomedicine. 2021; 9(11): 1513. PubMed Abstract | Publisher Full Text | Free Full Text

[4] Blagus R, Lusa L: SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 2013; 14(1): 106. PubMed Abstract | Publisher Full Text | Free Full Text

[5] Buterez D, Bica I, Tariq I, et al.: CellVGAE: An unsupervised scRNA-seq analysis workflow with graph attention networks. Bioinformatics. 2022; 38(5): 1277–1286. PubMed Abstract | Publisher Full Text | Free Full Text

[6] Chawla NV, Bowyer KW, Hall LO, et al.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002; 16(1): 321–357. Publisher Full Text

[7] Chen D: Analysis of machine learning methods for Covid-19 detection using serum raman spectroscopy. Appl. Artif. Intell. 2021; 35(14): 1147–1168. Publisher Full Text

[8] Chen D: Code for Subsemble Cell Type Classifier. Zenodo. [Code].2023. Publisher Full Text

[9] Chen D, Shoostari P: Data used to generate figures. [DATA]. 2022a. Publisher Full Text

[10] Chen D, Shoostari P: Supplementary tables 1-5. [DATA]. 2022b. Publisher Full Text

[11] Chicco D, Warrens MJ, Jurman G: The matthews correlation coefficient (Mcc) is more informative than cohen’s kappa and brier score in binary classification assessment. IEEE Access. 2021; 9: 78368–78381. Publisher Full Text

[12] Christensen E, Naidas A, Chen D, et al.: TMExplorer: A tumour microenvironment single-cell RNAseq database and search tool. PLoS One. 2022; 17: e0272302. PubMed Abstract | Publisher Full Text | Free Full Text

[13] Dietterich TG: Ensemble methods in machine learning. Multiple Classifier Systems. 2000; pp. 1–15. Publisher Full Text

[14] Dong X, Chowdhury S, Victor U, et al.: Semi-supervised deep learning for cell type identification from single-cell transcriptomic data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022; 1–1. PubMed Abstract | Publisher Full Text

[15] Geddes TA, Kim T, Nan L, et al.: Autoencoder-based cluster ensembles for single-cell RNA-seq data analysis. BMC Bioinformatics. 2019; 20(19): 660. PubMed Abstract | Publisher Full Text | Free Full Text

[16] Hwang B, Lee JH, Bang D: Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 2018; 50(8): 1–14. PubMed Abstract | Publisher Full Text | Free Full Text

[17] Ianevski A, Giri AK, Aittokallio T: Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat. Commun. 2022; 13(1): 1246. PubMed Abstract | Publisher Full Text | Free Full Text

[18] Johnson JM, Khoshgoftaar TM: Survey on deep learning with class imbalance. J. Big Data. 2019; 6(1): 27. Publisher Full Text

[19] Johnson TS, Wang T, Huang Z, et al.: LAmbDA: Label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detection. Bioinformatics. 2019; 35(22): 4696–4706. PubMed Abstract | Publisher Full Text | Free Full Text

[20] Karamizadeh S, Abdullah SM, Halimi M, et al.: Advantage and drawback of support vector machine functionality. 2014 International Conference on Computer, Communications, and Control Technology (I4CT). 2014; pp. 63–65. Publisher Full Text

[21] Kimmel JC, Kelley DR: Semisupervised adversarial neural networks for single-cell classification. Genome Res. 2021; 31(10): 1781–1793. PubMed Abstract | Publisher Full Text | Free Full Text

[22] Korotcov A, Tkachenko V, Russo DP, et al.: Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery data sets. Mol. Pharm. 2017; 14(12): 4462–4475. PubMed Abstract | Publisher Full Text | Free Full Text

[23] Korsunsky I, Millard N, Fan J, et al.: Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods. 2019; 16(12): 1289–1296. PubMed Abstract | Publisher Full Text | Free Full Text

[24] Krawczyk B: Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016; 5(4): 221–232. Publisher Full Text

[25] Li D, Ding J, Bar-Joseph Z: Unsupervised cell functional annotation for single-cell RNA-seq. Genome Res. 2022; 32: 1765–1775. PubMed Abstract | Publisher Full Text | Free Full Text

[26] Li H, Courtois ET, Sengupta D, et al.: Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet. 2017; 49(5): 708–718. PubMed Abstract | Publisher Full Text

[27] Lu F, Lin Y, Yuan C, et al.: Entssr: A weighted ensemble learning method to impute single-cell RNA sequencing data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021; 18(6): 2781–2787. PubMed Abstract | Publisher Full Text

[28] Lytal N, Ran D, An L: Normalization methods on single-cell RNA-seq data: An empirical survey. Front. Genet. 2020; 11. PubMed Abstract | Publisher Full Text | Free Full Text

[29] Ma F, Pellegrini M: ACTINN: Automated identification of cell types in single cell RNA sequencing. Bioinformatics. 2019; 36: 533–538. PubMed Abstract | Publisher Full Text

[30] Marouf M, Machart P, Bansal V, et al.: Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun. 2020; 11(1): 166. PubMed Abstract | Publisher Full Text | Free Full Text

[31] Matthews BW: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure. 1975; 405(2): 442–451. PubMed Abstract | Publisher Full Text

[32] Mereu E, Lafzi A, Moutinho C, et al.: Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat. Biotechnol. 2020; 38(6): 747–755. PubMed Abstract | Publisher Full Text

[33] Oller-Moreno S, Kloiber K, Machart P, et al.: Algorithmic advances in machine learning for single-cell expression analysis. Curr. Opin. Syst. Biol. 2021; 25: 27–33. Publisher Full Text

[34] Santhanam R, Nishant U, Raman S, et al.: Experimenting XGBoost Algorithm for Prediction and Classification of Different Datasets. Int. J. Control. Theory Appl. 2017; 9.

[35] Sapp S, van der Laan MJ , Canny J: Subsemble: An ensemble method for combining subset-specific algorithm fits. J. Appl. Stat. 2014; 41(6): 1247–1259. PubMed Abstract | Publisher Full Text | Free Full Text

[36] Travaglini KJ, Nabhan AN, Penland L, et al.: A molecular cell atlas of the human lung from single-cell RNA sequencing. Nature. 2020; 587(7835): 619–625. PubMed Abstract | Publisher Full Text | Free Full Text

[37] van Galen P , Hovestadt V, Wadsworth MH, et al.: Single-cell RNA-seq reveals aml hierarchies relevant to disease progression and immunity. Cell. 2019; 176(6): 1265–1281.e24. PubMed Abstract | Publisher Full Text | Free Full Text

[38] VanHorn S, Morris SA: Next-generation lineage tracing and fate mapping to interrogate development. Dev. Cell. 2021; 56(1): 7–21. PubMed Abstract | Publisher Full Text

[39] Wang X, Zheng J: Velo-Predictor: An ensemble learning pipeline for RNA velocity prediction. BMC Bioinformatics. 2021; 22(10): 419. PubMed Abstract | Publisher Full Text | Free Full Text

[40] Wei Z, Zhang S: CALLR: A semi-supervised cell-type annotation method for single-cell RNA sequencing data. Bioinformatics. 2021; 37(Supplement_1): i51–i58. PubMed Abstract | Publisher Full Text | Free Full Text

[41] Wirka RC, Pjanic M, Quertermous T: Advances in transcriptomics. Circ. Res. 2018; 122(9): 1200–1220. PubMed Abstract | Publisher Full Text | Free Full Text

[42] Xiong K-X, Zhou H-L, Lin C, et al.: Chord: An ensemble machine learning algorithm to identify doublets in single-cell RNA sequencing data. Commun. Biol. 2022; 5(1): 1–11. PubMed Abstract | Publisher Full Text | Free Full Text

[43] Yap M, Johnston RL, Foley H, et al.: Verifying explainability of a deep learning tissue classifier trained on RNA-seq data. Sci. Rep. 2021; 11: 2641. PubMed Abstract | Publisher Full Text | Free Full Text

[44] Zhao X, Wu S, Fang N, et al.: Evaluation of single-cell classifiers for single-cell RNA sequencing data sets. Brief. Bioinform. 2020; 21(5): 1581–1595. PubMed Abstract | Publisher Full Text | Free Full Text

Cell-type classification of cancer single-cell RNA-seq data using the Subsemble ensemble-based machine learning classifier

Abstract

Keywords

Introduction

Methods

Datasets

Data preprocessing

Model architecture

Figure 1. Schematic of the Subsemble supervised machine learning classifier architecture.

Performance evaluation

Rank normalized scores

Results

Optimization of the pre-processing pipeline

Figure 2. Subsemble classification performance of colorectal cancer cell types from the Li et al. colorectal cancer dataset using a stratified 10-fold cross-validation scheme repeated for five unique replicates.

Classification performance benchmark of Subsemble and other cell type annotation methods

Figure 3. Benchmark of classification performance of the Subsemble classifier and nine other cell type classification methods tested on the van Galen et al. acute myeloid leukemia dataset using a 10-fold stratified cross-validation scheme.

Figure 4. Benchmark of classification performance of the Subsemble classifier and nine other cell type classification methods tested on the van Galen et al. acute myeloid leukemia dataset using a five-fold, 10-fold, and 20-fold stratified cross-validation scheme.

Discussion

Conclusions

Data availability

Source data

Underlying data

Extended data

Software availability

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated