Keywords
Spatial organization, single cell RNA-seq, Drosophila embryo, clustering, DREAM challenge
Spatial organization, single cell RNA-seq, Drosophila embryo, clustering, DREAM challenge
The title is modified to avoid ambiguity.
The abstract is modified to improve the clarity of method summary.
More citations and description of existing methods are added.
All figures and legends are modified to describe the results more clearly.
More details are added on binarization of scRNA-seq data and prediction of spatial gene expression.
More discussions are added on the interpretation of the results as well as the advantage and limitation of the method.
See the authors' detailed response to the review by Xianwen Ren
See the authors' detailed response to the review by Mark S. Cembrowski and Larissa Kraus
The development of single cell RNA sequencing (scRNA-seq) has provided a powerful solution for building a global transcription landscape of all cell types in tissues, finding new cell types, cell lineage tracing, spatial reconstruction, and combining with other omics1–9. Single cells are originally made from dissociated tissues without spatial information, and spatial gene expression pattern is unknown. In situ hybridization (ISH) and its variants can detect the spatial information of mRNA transcripts and produce gene expression reference atlas. Using enough marker genes, users can reconstruct the spatial position of single cell RNA-seq data by combing the ISH reference atlas10–15. Some works have also combined sequential fluorescence in situ hybridization (seqFISH) and multiplexed error robust fluorescence in situ hybridization (MERFISH) with scRNA-seq data to map cell types to the reference atlas16–18.
Recent methods have successfully mapped cells from scRNA-seq data to the spatial positions using dozens of landmark genes10–15. Nikos et al. developed a DistMap method for mapping the ~1,300 Drosophila embryo cells into ~3,000 bins in the spatial position using 84 marker genes by computing the Matthews Correlation Coefficient (MCC) score between in situ gene expression and binarized scRNA-seq data10. Satija et al. mapped 851 cells of zebrafish embryo into 64 bins of the spatial embryo with 47 genes. They used a bimodal mixture model to binarize the scRNA-seq data and predict a cell originated from the bin by posterior probability11. Kaia et al. computed the correspondence score and mapped 139 cells into a spatial Platynereis dumerilii brain using a set of 98 genes by whole-mount in situ hybridization (WMISH). They binarized the scRNA-seq data and defined a special score to characterize the relation between every cell-voxel combination. They suggested that 50–100 in situ genes were needed to map cells to spatial position with high confidence using simulated spatial reference atlases12. Andreas et al. also reconstructed spatial single enterocytes along the villus axis in 1-D space using 102 bottom and top landmark genes from single-molecule fluorescence in situ hybridization (smFISH19) for 1,383 single cells. They used the summed expression of landmark genes to infer the locations along the 1-D coordinates and tried 50 and 20 landmark genes for prediction. The reconstruction errors were relatively low for 50 genes, but the prediction errors increased remarkably using 20 landmark genes13. Nitzan et al. also proposed a new method named novoSpaRc for spatially mapping the scRNA-seq cells into an existing reference atlas. Based on the optimal transport method, it reported the gene expression and spatial position with at most 2 landmark genes for virtual scRNA-seq data and virtual Drosophila embryo14. The performance of novoSpaRc is still not good for real Drosophila embryo when the marker gene number is smaller than 3014,15. In the methods mentioned above, the dimension and resolution of the spatial region, as well as the number of marker genes, are key factors that affect the recovery of the spatial position. Although these methods demonstrated good performance, they do not focus on markers genes selection in which biological information were used to determine the spatial gene expression patterns.
To explore new algorithms that include biological information contained in the genes to determine embryonic patterning, the DREAM Single Cell Transcriptomics Challenge asked participating teams to predict the positions in the embryo of 1,297 cells using the expression patterns of 60 (sub-challenge 1), 40 (sub-challenge 2), and 20 (sub-challenge 3) driver genes from the in situ hybridization data20. The challenge endeavors to use less marker genes to infer the spatial locations of cells.
In this article, we assume that cells with similar gene expression will have close spatial distances. By clustering the possible spatial bins for a cell, we select the most possible cluster as the predicted spatial position. The details of our method are the following: First, we use hierarchical clustering for gene selection. Then, we map the cells to spatial positions by computing the mapping similarity between the binarized scRNA-seq and in situ hybridization data and cluster the bins. Finally, we select the optimal thresholds by computing the maximum MCC score for spatial gene expression prediction. We detail our solution for the DREAM challenge, and validate the results using the “silver standard” derived from DistMap. The paper is organized as follows: in Methods, we briefly describe the solutions for all the three sub-challenges; in Results, we present the results of the three sub-challenges in the DREAM Single Cell Transcriptomics Challenge; finally, we discuss and summarize our work.
The scRNA-seq dataset is from ~1,000 handpicked stage 6 fly embryos using Drop-seq21. It contains both raw and normalized UMI counts with 1,297 cells and 8,924 genes per cell. In situ hybridization expression patterns of 84 driver genes are from the Berkeley Drosophila Transcription Network Project (BDNTP). The BDNTP reference atlas are binarized. The bin number of one half of the symmetric Drosophila melanogaster embryo at stage 6-pre-gastrulation (which is used in this study) is 3,039. The spatial coordinates of these bins are also specified. The dataset files can be downloaded from the DREAM Single Cell Transcriptomics Challenge after registration with Synapse free of charge (https://www.synapse.org/#!Synapse:syn16782360).
We directly use the normalized scRNA-seq data, the in situ matrix and the geometry of the embryo. The gene names “ E.spl.m5.HLH” and “ Blimp.1” are replaced by “ E(spl)m5-HLH” and “ Blimp-1”.
The DistMap “silver standard” is the best available cell position reference, which is determined by calculating the maximum similarity for cells and spatial bins using all the 84 in situ driver genes in BDNTP. It is shown that the 84 driver gens are sufficient to uniquely and individually label most of the 1,297 cells10.
We use a hierarchical clustering method to select 60, 40, and 20 driver genes from the 84 genes based on the normalized scRNA-seq data (Figure 1). Specifically, based on the belief that the scRNA-seq gene expression pattern is driven by the driver genes’ expression pattern, we propose to select the essential driver genes based on the information provided by scRNA-seq data. If two genes have high correlation in the scRNA-seq data, we choose only one of them without losing too much of the information. To find the correlated genes, we perform hierarchical clustering on the normalized scRNA-seq data to separate all 84 genes into 60 clusters (with the Euclidean distance and the Mcquitty linkage). The Mcquitty linkage gives more weights for objects in small clusters than those in large clusters in calculating the distance between two clusters. Thus, it is suitable for situations with many small clusters. Since the numbers of clusters are fairly large in the sub-challenge 1 and 2, we opt to use the Mcquitty linkage for distance calculation. In sub-challenge 3, since the total number of clusters is shrunk to 20, which is smaller than sub-challenge 1 and 2, we choose to use the ward linkage in the hierarchical clustering part to obtain larger-sized clusters from the data. After this step, the gene selection process remains the same as sub-challenge 1 and 2.
(1) Use hierarchical clustering (HC) to select 60/40/20 driver genes from the given 84 driver genes based on scRNA-seq data with the Euclidean distance and the Mcquitty (for 60 and 40 genes) and ward linkage (for 20 genes). (2) Apply the binarization method for the selected genes by minimizing root-mean-square error between of Pearson correlation matrix between the binarized scRNA-seq and the in situ matrix, and use the same criteria to binarize the remaining 24/44/64 predicted genes. (3) Calculate mapping similarity matrix from single cells to spatial bins based on the proportion of matched gene expression between each cell and each bin. (4) For each cell, select top bins based on mapping similarity values by quantile thresholding. Then do HC on the spatial distance of these top bins and select the number of clusters based on the Silhouette score. The top 10 predicted bins are the ones surrounding the largest cluster center. (5) For each gene, we compute the MCC score between predicted gene expression and binarize gene expression given different thresholds. The optimal threshold is corresponding to the maximum MCC. Finally, we apply the optimal threshold for the computeVISH() method in DistMap to predict gene expression. The black points in the scatter plot are the predicted MCC. The red points are Pearson correlation between predicted in situ gene expression with in situ gene expression.
After getting the clusters, we pick the most representative gene of each cluster by calculating the distance between each member gene and the cluster center based on the Euclidean distance and selecting the closest one.
We perform binarization on the normalized scRNA-seq data for the selected genes based on the “binarizeSingleCellData()” function in DistMap (https://github.com/rajewsky-lab/distmap). The details of binarization is as follows: for each quantiles from 0.15 to 0.5 with 0.01 as interval, we perform binarization on the scRNA-seq data for 60, 40, and 20 driver genes in sub-challenge 1, 2, 3, respectively. If the gene expression value is larger than the quantile gene expression value, it will be set as 1, otherwise it is 0. Then we compute the difference between the correlation matrix of the binarized scRNA-seq data and that of the in situ matrix based on the root-mean square error. Finally, we select the quantile threshold which has the smallest root mean square error.
Next, to predict the spatial gene expression for all the 84 in situ genes, we used a similar method to binarize the remaining 24, 44, and 64 in situ genes in sub-challenges with the binarized 60, 40, and 20 binarized gene expression. First, we compute the Pearson correlation matrix C1 between the selected genes (60, 40, and 20) and the remaining genes (24, 44, and 64) using the normalized scRNA-seq data. Second, we binarize the remaining genes based on a series of quantiles from 0.15 to 0.5 with 0.01 as interval. Furthermore, we calculate the Pearson correlation matrix C2 between the binarized selected genes and the remaining genes. Finally, we select the best quantile for binarizing the remaining genes by minimizing the root mean square error between the Pearson correlation matrices C1 and C2.
Given the binarized scRNA-seq data and the in situ matrix, we calculate the similarity matrix between the cells and bins based on the selected driver genes. Here, we denote the number of selected driver genes is ng. The similarity between a cell ci (i ∈ [1,1,297]) and a bin bj (j ∈ [1,3,039]) is calculated as follows:
where is the number of the same gene expression values (0 or 1) in the two binarized vectors corresponding to cell ci and bin bj, and ng is the in situ matrix for cell ci and bin bj.
Select top bins. The likelihood of a cell mapped to a bin is determined by the gene expression in the bin and the cell. The bins with the higher similarity are possibly the potential cell position. To generate stable results, we select enough bins (see below) based on the similarity values. Then we use clustering to determine a more refined cell position.
To select the potential bins for predicting cell position, we check the number of bins with high similarity values (> 0.8) in the similarity matrix for each cell from sub-challenge 1, 2, and 3 (Figure 2). The average number of bins with mapping similarity > 0.8 increases as the number of driver genes decreases. In sub-challenge 3, there are more bins with high similarity values produced. The small similarity difference among these bins makes it harder to determine real spatial bins where the cell originated. Thus, as the number of driver genes decreases, the accuracy of bins with high similarity values also decreases. Here, we use the third quartile of the similarity values to select the top bins in sub-challenge 1 with 60 driver genes. We use the first quartile of similarity values in sub-challenge 2 with 40 driver genes. And we use all bins in sub-challenge 3 with 20 driver genes. If the number of selected bins is 0, the bin which has the maximum similarity will be the predicted position. If the number of the selected bins is larger than 100, then only the top 100 bins will be kept based on the similarity values.
For bins with high similarity values, we need to perform clustering to select the cluster which has the maximum sum of similarity as the predicted cell position. Here, we use hierarchical clustering on the selected bins based on their three-dimensional distance. The cluster number is determined by the silhouette score, which measures the average distance of a point to other points in its cluster compared to the smallest average distance to other clusters. The silhouette score ranges from -1 to +1. The higher the silhouette score, the closer the point is closer to its own cluster and the farther it is away from other clusters. We use the average silhouette score across all selected bins to select the optimal clustering number. Here, we use the NbClust package22 to perform hierarchical clustering with the “centroid” method. Based on the silhouette score, we obtain the number of clusters.
We compute the sum of similarities for each cluster and select the cluster with the maximum value. Finally, we assign 10 nearest bins to the selected cluster center as the top 10 most possible cell positions.
To predict the final spatial in situ gene expression, we use the method computeVISH() in DistMap. It multiplies the normalized gene expression matrix (gene X cell) with the mapping similarity matrix (cell X bin) and sets a threshold to binarize final spatial gene expression. For sub-challenge 1, 2, 3, we select an optimal threshold. First, we generate a series of thresholds between 0 and 1. For each threshold, we compute the predicted spatial gene expression using computeVISH(), then compute the MCC score between the predicted gene expression and the binarized expression of the driver genes. Finally, we select the best threshold based on the largest MCC score to compute the final spatial gene expression.
To estimate the performance of our method to predict spatial gene expression, we compute the Pearson correlation between the in situ gene expression and the predicted gene expression.
To evaluate the performance of our method, we use the three performance scores in the DREAM challenge (https://www.synapse.org/#!Synapse:syn17091286, https://github.com/dream-sctc/Scoring/blob/master/dream_scoring_clean.R). The first score (score 1) is the primary score to evaluate the precision of the assignment for the single cells. The second score (score 2) is the average of the relative assignment precision over all the single cells which is used when the first scores are equal for two methods. The third score (score 3) is to compare predictions of gene patterns.
Ambiguous cells: If the predicted top 1 and top 2 positions are the same in the DistMap results, the prediction position will be ambiguous, and the cell will not be computed in the score 1–3. In this challenge, the number of ambiguous cells derived from DistMap are 287.
We calculated the sums of gene expression values in the in situ matrix for each bin of the 3,039 bins. Each bin has at least one gene expressed in the in situ matrix. It suggests that our selected driver genes cover all bins as at least one driver gene is expressed in each bin. As the number of expressed driver genes decreases when the number of selected driver genes decreases (from 60 to 40 to 20), the prediction accuracy of cell spatial positions also decreases (Figure 3(a)). We also compared the overlaps when varying the number of the driver genes in Figure 3(b). Among the 40 driver genes of the sub-challenge 2, only one driver gene is not in the selected 60 driver genes of the sub-challenge 1. There are 14 overlapped driver genes between the sub-challenge 1, 2 and 3. It suggests our method is consistent on selecting different number of driver genes.
(a) Compare number of expressed driver genes in each bin in the in situ matrix in sub-challenge 1, 2, and 3 under different selected driver genes scenarios. (b) The Venn diagram of the selected 60, 40, and 20 gene in the three sub-challenges 1, 2, 3. There are 14 common genes in three sub-challenges 1, 2 and 3 (labeled as gene60, gene40, and gene20).
We used the score 1–3 to evaluate our method for different number of selected driver genes. Figure 4(a) shows the scores of our submitted results for the sub-challenge 1, 2, 3. The blue bar is the score for the silver standard method using 84 driver genes from DistMap. For score 1, the primary score for cell position prediction, the performance of our method is close to the silver standard when using 60 driver genes. The performance decreases when using 40 and 20 driver genes. For score 2, our method shows high scores when using 60 and 40 driver genes and lower score when using 20 driver genes. As Score 2 is the average relative precision for all cells, it suggests that our method is robust for predicting the right position using 60 and 40 driver genes, probably because our algorithm uses the top 10 bins around the largest cluster centers. The score 3 estimates the predicted gene expression patterns. It shows a small difference in our method when using 60 and 40 driver genes. To check the effect of threshold on the prediction results, we test our method under different thresholds. Figure 4(b)–(d) shows the consistency of the score 1–3 over a range of thresholds in the scenarios of different numbers of driver genes. Overall, the results show that our method performs well when using 60 and 40 genes.
(a) Comparing the score 1, 2, 3 for sub-challenge 1 (60 driver genes), 2 (40 driver genes) and 3 (20 driver genes) with the silver standard (84 driver genes). (b-d) Comparing the score 1, 2, 3 using different thresholds (min, 1st quantile, median, mean, 3rd quantile, max value) for (b) sub-challenge 1 with 60 genes; (c) sub challenge 2 with 40 genes; (d) sub-challenge 3 with 20 genes.
As shown in Figure 5, we used boxplots to present the spatial gene expression prediction accuracy of the 84 driver genes by computing the Pearson correlation between the predicted and in situ gene expression. Consistent with Figure 4(a), the average Pearson correlation of our prediction (60 or 40 driver genes) is higher than that of the silver standard (84 driver genes by DisMap). When the number of driver genes decreases to 20, the average Pearson correlation of our prediction becomes smaller but close to that of the silver standard, which used 84 driver genes. From the results, it suggests that our optimal threshold selection may improve the spatial gene expression prediction. It is feasible to use less genes for spatial gene expression prediction.
For each gene, we calculate the Pearson correlation between the predicted spatial gene expression and the in situ spatial gene expression. The average Pearson correlation value of DistMap (84 genes), sub-challenge 1, sub-challenge 2, and sub-challenge 3 are 0.532, 0.622, 0.561, and 0.422, respectively.
With high quality Drosophila dataset, Karaiskos et al. mapped single cells to the 3D embryo positions using the DistMap method10. It assumed that transcriptomes of neighboring cells should show similar patterns in spatial tissues. We further assumed that cells with similar gene expression will have close spatial distance. We applied hierarchical clustering to select a small number of driver genes and group similar cells to be mapped to the 3D embryo. After selecting the largest cluster, we determined the most possible predicted spatial cell position around the cluster center.
We assessed the performance of our method on the selected 60, 40 or 20 driver genes using score 1–3 and compared it with the silver standard (DistMap results). For 60 driver genes, our results show close performance to the silver standard in score 1 and 3, and higher performance in score 2. It suggests that our method applied to 60 driver genes can achieve the same or higher level of performance as the silver standard to predict cell position. Our method obtains better performance for gene expression pattern prediction with 60 and 40 genes. When using 40 driver genes, the score 1 of our method decreases a little but the score 3 is very close to the silver standard. It suggests that our method can still predict gene expression pattern reasonably well using 40 genes. Fore 20 driver genes, the performance of our method is lower than but still close to the silver standard in both cell position and gene expression prediction.
Overall, our method achieves good performance to predict spatial cell position and gene expression with 60 and 40 driver genes. The primary reason may be that our method selects the most representing genes in the gene clusters for prediction. Also, for potential spatial bins, our method utilizes bin clusters and selects the cluster center as predicted position. The good performance of our method is consistent to our assumption that cells with close spatial position show similar gene expression patterns. The performance of our method decreases when the number of driver genes reduces to 20. This is probably because the similarities between cells and spatial bins have larger variations when measured using smaller number of genes. In this case, it becomes harder to find the most possible bin position based on cell-bin similarities. In Nitzan’s work14,15, they successfully reconstructed the virtual Drosophila embryo with one marker gene and the structural information between gene expression space and physical space. It suggested that gene-gene correlation and spatial correlation contain useful information for gene expression and spatial cell position prediction. In addition, our method can be extended using an iterative approach by updating the mapping similarity matrix at each step.
The dataset associated with the DREAM Single Cell Transcriptomics Challenge is available at https://www.synapse.org/#!Synapse:syn16782360 and https://www.synapse.org/#!Synapse:syn18632189. Limited by the sharing settings of Synapse, users should register in Synapse (free of charge; https://www.synapse.org/) using their email address, and agree to the dataset conditions of use (https://www.synapse.org/#!Synapse:syn15665609/wiki/583240).
Source code implementation for the method presented in this article and used in the DREAM Single Cell Transcriptomics Challenge is available at: https://github.com/ouyang-lab/SCTC-Challenge-zho_team. The scoring scripts are available at: https://github.com/dream-sctc/Scoring/blob/master/dream_scoring_clean.R.
Archived source code as at time of publication can be accessed at: https://doi.org/10.5281/zenodo.3592532
License: GLP 3.0
All authors actively contributed to the results and discussed the procedures. They have been involved in the test, design and evaluation of various approaches; many of the explored techniques were not used in the final workflow due to strong overfitting. DM: gene selection and coding. YC: data pre-processing, clustering, and writeup. ZO: supervision, realization of workflows, and writeup. YZ: supervision, realization of workflows, and writeup.
Data used in this publication were generated by Prof. Dr. Nikolaus Rajewsky, Max Delbrück at the Center for Molecular Medicine, and these results were obtained as part of the DREAM Single Cell Transcriptomics Challenge project through Synapse ID (syn15665609). We thank Pablo Meyer and Jovan Tanevski for providing the documents, code and suggestions for score 1–3.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Single cell omics, Genomics and Bioinformatics
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
No source data required
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Drosophila Neurogenetics, Single-cell RNA Transcriptomics, Confocal and Super-resolution Microscopy
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Partly
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
No source data required
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Partly
References
1. Tanevski J, Nguyen T, Truong B, Karaiskos N, et al.: Gene selection for optimal prediction of cell position in tissues from single-cell transcriptomics data.Life Sci Alliance. 2020; 3 (11). PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Genomics, developmental biology, biostatistics
Competing Interests: No competing interests were disclosed.
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Partly
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Transcriptomics
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
No source data required
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatics
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||||
---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | |
Version 2 (revision) 07 Jun 21 |
read | read | read | read | |
Version 1 19 Feb 20 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)