Keywords
Spatial Transcriptomics, Wavelet Transformation, Empirical Bayes Matrix Factorization, Factor Gene
This article is included in the Bioinformatics gateway.
This article is included in the Artificial Intelligence and Machine Learning gateway.
This article is included in the Machine Learning in Genomics collection.
Spatial Transcriptomics, Wavelet Transformation, Empirical Bayes Matrix Factorization, Factor Gene
Spatial resolved transcriptomics (ST) is a technology measuring spatial variation in gene expression. Different technological platforms support different genome coverage and spatial resolution (tissue-level measurement to subcellular measurement). Examples of ST platforms include Spatial Transcriptomics (Ståhl et al., 2016; Xia et al., 2019), 10x Genomics Visium, Slide-seq (Rodriques et al., 2019), sci-Space (Srivatsan et al., 2021), and MERFISH (Chen et al., 2015). Spatial transcriptomics allows visualization and quantitative analysis of the transcriptome with spatial resolution in individual tissue sections. A series of studies combining gene expression and spatial information has been brought to generate new insight in biological analysis (Kuppe et al., 2020; Shah et al., 2016; Berglund et al., 2018). Quantification of gene expression has wide applications in transcriptomics. Understanding the spatial distribution of gene expression has helped to answer fundamental questions in developmental biology (Asp et al., 2019; Rödelsperger et al., 2021), cancer (Thrane et al., 2018; Moncada et al., 2020), and neuroscience (Moffitt et al., 2016; Close et al., 2021). Two widely-used methods for gene expression quantification are fluorescent in situ hybridization (FISH) and next-generation sequencing.
While these approaches have been made to measure gene expression while preserving spatial information, there are statistical challenges in analyses combining the gene and spatial information. Specifically, gene dimensionality reduction guided by spatial information is still an active area of study (Abu-Jamous and Kelly, 2018; Kiselev et al., 2017, 2018; Zhu and Sabatti, 2020; Shang and Zhou, 2022; Velten et al., 2022; Townes and Engelhardt, 2021). In this work, we conduct a dimensionality reduction on spatially varying gene data. After transformation, gene expression over locations is a square matrix, which allows us to view it as an image. Therefore, tools from image processing can be adapted – our method incorporates spatial information by wavelet transformation, a multi-scale analysis decompose sequence into orthonormal series. We apply wavelet transformation for each gene expression over locations. This is a common technique in denoising the image. We thus apply techniques analogous to image analysis. We evaluate performance using reconstruction error and comparing resulting visualizations. Our analysis pipeline builds from off-the-shelf tools, and code is available for reproducing our results. We also attach visualizations in our pipeline, giving example interpretations of intermediate result.
A natural idea in spatial transcriptomic analysis is to identify gene types across spatial locations (Abu-Jamous and Kelly, 2018). For example, in clustering the representations in each cluster achieve dimensionality reduction under the assumption that genes within one cluster have the same type. A parallel technique is clustering based on cell type instead. General clustering methods can be combined into more sophisticated pipelines tailored toward spatial single-cell analysis. SC3 (single-cell consensus clustering 3) (Kiselev et al., 2017) is an ensemble clustering method. It calculates distance matrices across cell locations using the Euclidean distance, then applies spectral clustering, and assigns membership (Kiselev et al., 2018). To identify the cell type in each cluster, one can perform differential expression analysis between all pairs of clusters. scGeneFit uses a label-aware compression method to find marker genes (Dumitrascu et al., 2021). Given the cell-by-gene expression matrix and cell clustering membership, scGeneFit maps cells to lower-dimensional space where cells within the same cluster are closer. For gene dimensionality reduction, (Zhu and Sabatti, 2020) constructs a neighborhood graph from the spatial coordinates, then applies a graph-based feature selection procedure to determine spatially varying genes. They also provide the option to infer a latent graph embedding for cells based on selected genes, applying spline models to fit the gene’s expression on the latent embedding. Then they leverage the fitted coefficients to reduce the dimensionality of each gene. (Svensson et al., 2018) proposed pipelines using mixed-effect models incorporating spatial information. The model contains two random effect terms: a spatial variance term that parametrizes gene expression covariance by pairwise distance between samples and a noise term that models nonspatial variability. The model leverages efficient inference methods previously developed for linear mixed models, and it is computationally efficient.
Our setting is closely aligned with the following recent works: (Shang and Zhou, 2022) developed SpatialPCA, applying probabilistic principal component analysis (PCA) on ST data for dimensionality reduction. They assume data are given as a location-by-genes matrix and construct a regression model similar to factor analysis, where the prior covariance matrix of factor genes is a distance matrix constructed with a Gaussian kernel. (Velten et al., 2022) proposed MEFISTO, combining factor analysis with the non-parametric framework of Gaussian processes to model spatio-temporal dependencies in the latent space. (Townes and Engelhardt, 2021) developed nonnegative spatial factorization (NSF), combining a Gaussian process prior over spatial locations and a Poisson or negative binomial likelihood for count data, identifying generalizable spatial patterns of gene expression. All these works impose spatial structure on the prior of the factor genes. While these methods offer new dimensionality reduction techniques to cluster the genes, the complex model structure and a large number of hyperparameters introduce uncertainty and noise. Instead of imposing structural assumptions on the prior of the factor genes, we impose structure on the factor gene itself. Our contributions are the following:
• We propose an approach, based on techniques from matrix decomposition and image signal processing to perform gene dimensionality reduction that retains inferred spatial structure.
• We run simulations showing that wavelet-guided dimensionality reduction performs better estimation than the singular value decomposition (SVD) under low signal-to-noise (SNR) regime.
• We perform real data experiments to identify the connection between wavelet techniques and fluctuation of the gene expression, which would be useful in selecting spatially related genes based on reconstruction error.
• We provide a gene extraction pipeline capturing the global information of spatially related genes. We provide smoother visualization of gene factors via wavelet methods. We develop an R package waveST and share the workflow generating reproducible quantitative results and gene visualization.
The diagram for workflow can be seen in Figure 1. The paper is organized in following: In Section Background we introduce the background on required techniques, including the wavelet transformation and matrix decompositions. In Section Problem Setup, we formally define our problem under this setting, and Section Methods introduces our algorithms and analysis pipeline. In Section Simulation, we implement simulations showing the effect of wavelet transformation in reducing error. In Section Real Data Experiment we conduct our method on data from Weber (2021), showing the reconstruction error in dimension reduction and visualization of lower-dimensional representations.
and are the estimators for C and Z. F and L are the factor matrix and loading matrix construct as Z = FL, Fc and Lc are the factor matrix and loading matrix construct as C = FcLc. and are the estimators for F and Fc, similiar for Z and . We will specify the details in Section Background and Problem Setup.
In this section, we will cover background of methods we used in data pre-processing and analysis.
Density estimation and function approximation is a fundamental problem in statistics and machine learning. Non-parametric methods such as spline regression (Perperoglou et al., 2019), Fourier transformation (Cochran et al., 1967) and Wavelet transformation (Nason, 2008) have been used in such scenarios.
Consider a model with one predictor: y = f(x) + ϵ, where . We want to estimate f, the trend of the function. We assume the predictors are ordered x1 < x2 < … < xn. We have a signal or frequency to estimate.
Consider the Haar mother wavelet:
satisfying . Unlike the Fourier basis, wavelets oscillate and decay fast, only contributing to a certain local area and zero elsewhere. We can generate wavelets from the Haar mother wavelet for integer j, k. These wavelets form orthonormal sets. We can decompose the trend as , where are the wavelet coefficients.Wavelet based methods have several advantages among other non-parametric methods, especially dealing with sparse data. One property wavelet has is localization. If a sequence has a discontinuity, this will only influence the wavelet basis around it. In contrast, for a Fourier basis consisting of sine and cosine functions at different frequencies, every basis element will interact with this discontinuity, hence influencing every Fourier coefficient.
The simplest discrete wavelet transformation calculates the difference and sums between each adjacent pair. Suppose we have one vector with length n, where n is dyadic (n = 2J). We computed dJ−1, k = y2k − y2k−1 as finest-level detail and cJ−1, k = y2k + y2k−1 as the finest-level averages. We have containing n/2 = 2J−1 as our finest level coefficients. To obtain the next coarsest coeffients we set:
We continue this procedure until j reaches one. The set of details across levels are our wavelet coefficients.
We use J to denote the scale of wavelets, with larger J we have finer scale and approximation. However, finer scale sometimes introduces more parameters to capture minor details of the sequence (overfitting). The trade-off in choosing J is crucial in the wavelet transformation.
Regularization and smoothing can be used to prevent overfitting. Regularization is usually conducted by shrinking wavelet coefficients. The concept of wavelet shrinkage was first proposed by Donoho and Johnstone (1994). The motivation behind wavelet shrinkage is straightforward. Consider empirical wavelet coefficients, the large coefficients usually contain true signal and noise, whereas the small coefficients only contain noise. The shrinkage is often used when the wavelet coefficients are assumed to be a sparse vector.
Wavelet shrinkage is often conducted by setting a threshold (denoted by δ) and only keeping the coefficients above the threshold. To choose a threshold, a natural metric is the squared error between estimated function and the truth: and . Donoho et al. (1994) proposed a universal threshold , which induced . Donoho and Johnstone (1995) also proposed Stein’s unbiased risk estimation (SURE) threshold based on Stein’s (1981) unbiased risk estimator. The optimal SURE threshold can be obtained in operations. Donoho and Johnstone (1995) also noted that SURE sometimes failed when the true signal coefficients are highly sparse, they proposed a hybrid scheme called SureShrink, combining the SURE and universal thresholds, using them depending on certain situations.
The extension of wavelet methods to 2D regularly spaced data (images) and such data in higher dimensions was proposed by Mallat (1989). We only consider 2-D wavelet transformation since we decompose 2-D spatial gene expression data. Suppose we have n × n matrix A where n = 2J is dyadic. A simple discrete wavelet transformation on A first applies procedure (1) and (2) to the rows of the matrix. We then have two matrices of size , called H and G. Then we apply the same procedures to both the columns of H and G, resulting in four matrices HH, GH, HG, and GG each of size . These are our finest level coefficients. HH is the local average of the original matrix used for the next level procedure.
Clustering and dimensionality reduction are widely used in genomics. In a gene-by-sample matrix, genes are often grouped into profiles, where genes from the same profile have a similar function. Statistically, we can treat them as correlated variables. We use the term factor genes as the principal component in our gene-by-sample data. Factor genes compose the linear combinations of genes. We consider each gene as a variable, and we aim to find variables (f1, …, fK) such that each gene gi has form gi = a1f1 + a2f2 + ⋯aKfK, where aj are coefficients.
Suppose we have gene(P)-by-sample(N) matrix with sample covariance matrix , where à is the column centered A. Consider the SVD on à = UΛVT. Then we have SV = VΛ2. The columns of V are called eigenarrays (Alter et al., 2000). The first K columns of Z = AV are the factor genes. The factor genes capture the mutual underlying information of genes.
Matrix Factorization is often used in capture factor genes. We have a formulation similar to Wang and Stephens (2021), consider the factorization model on observed samples by gene data
where denotes the factors, denotes the loadings for each factor, and denotes Gaussian noise with zero mean. Typical formulations would treat factors and loadings as fixed effects and use Maximum Likelihood Estimation (MLE). In our high-dimensional setting, penalty-based regularizers are often considered. Considering a prior for L and F under Bayesian setting has a similar effect and several advantages over adding a regularizer alone, such as simplifying hyperparameter search and selection of the number of factors K.One feature of empirical Bayes approaches in matrix factorization proposed by Bishop (1999) is that the methods automatically select the number of factors K. In Wang and Stephens (2021), the author add factors with prior one at a time, estimating priors at each step until convergence. If the computed prior of the newly added factor is almost point mass on 0, the algorithm eliminates this factor and returns.
Following Wang and Stephens (2021), we consider Empirical Bayes in our setting, where we set a prior with unknown parameters of L and F. Empirical Bayes is not strictly Bayes, since the prior parameters are directly estimated by the MLE of the data. One example would be normal distribution of F and of L where coordinates are independent, which is conjugate prior. We estimate parameters in and by the maximize the marginal likelihood calculated by integral out L and F. Then we computed the posterior distribution of L and F.
Consider , where each row represents a sample and each column represents a gene. Further, let store spatial coordinates of each sample. We assume there are spatially-related genes among all genes. We assume the genes fall into several profiles. The genes in the same profile have the same expression over the sampled spatial context. For each profile, we can summarize the gene expression for such group by one representative factor. Consider there are K profiles, we can then form our model as:
where is the factor gene in profile k capturing the gene expression pattern in that group, indicates the loading coefficients of each factor. We assume there is random noise in observations and . The goal is to select spatially-related genes based on spatial information in S. We also extract the gene factors which capture information across all genes based on feature extraction. We aim to find a latent gene space that respects spatial structure.In this section, we build up pipelines and models to achieve gene dimentionality reduction while retaining spatial structure. We adapt matrix decomposition methods to incorporate spatial information. R is a statistical programming language and will be used for the analysis. We first preprocess gene expression to make them amenable for wavelet filtering. We then use the Daubechies D4 Wavelet Transform (Daubechies, 1992) as wavelet filter with scale of wavelet J = 5. We develop the package and share the workflow generating reproducible quantitative results and gene visualization. The package is available from Software availability.
We leverage a pre-processing step, which we call input generation, to combine spatial and expression data. We have both gene expression measurements and spatial coordinates for each sample. Each gene has an expression over each sample, and we can draw the sample over a 2D map by their coordinates.
As shown in Figure 2, the gene with the least sparsity in expression shows varying levels of expression across different spatial regions. Intuitively, we may consider the spatial expression pattern from this gene to be spatially related. However, most genes are sparse and do not show fluctuations in gene expression. Therefore, we first filter out those genes using a kOverA filter: we only keep genes that have an expression measure above A in at least k samples (Gentleman et al., 2021). This has the effect of removing genes that are rarely active, though it is possible that strongly expressed genes still show no spatial relationships.
We want gene expression data to have a neat grid structure that can be expressed as a matrix form. However, at first, expression measurements are roughly staggered over the spatial locations, as shown in Figure 3. This is a consequence of the sampling strategy adopted by the 10x genomics Visium platform. To obtain an evenly resampled version of the expression pattern, we divide the two-dimensional space into several partitions and compute the local average of each partition. Detailes are shown in Algorithm 1. Each gene has an expression over the grid and we now can use a matrix to represent it. The matrix Z becomes the new input for analysis. We then obtain the same formulation as in Section Problem Setup, where we take n = D2 and each location corresponds to a new observation. We have as the factor genes, computed by vectorizing the gene expression matrix.
The choice of D depends on the wavelet scale used in the next section. We choose D to be dyadic to prevent handling edge cases in the wavelet transformation (recalling Subsection Wavelet Transformation). The dyadic size is also a natural choice in image analysis. In particular, we choose D = 2J, where J is the level of scale in the wavelet transformation. Larger J allows finer recovery, but also requires more parameters and may result in overfitting.
We apply a wavelet transformation with shrinkage to denoise the gene expression matrix and smooth observed spatial expression patterns. In simulations, we find that this technique gives more accurate recovery as the signal-to-noise ratio decreases and yields less noisy visualizations for the spatially-related genes.
We implement wavelet transformation and shrinkage on the processed data matrix . Each column of Z is a vectorized expression matrix. For each column, we first reshape it into a D × D matrix and apply the wavelet transformation. We have a coefficient list with W coefficients associated with each gene. Then we conduct wavelet shrinkage on the coefficient list using the threshold strategies described in Subsection Wavelet Transformation. Finally, we vectorize the coefficient list of each gene into a length W vector for each gene. Stacking all vectors together, we have a coefficient matrix . Note that the wavelet scale is specified by the size D of the input matrix, i.e., we have D = 2J, where J is the scale. The details are specified in Algorithm 2.
The matrix of the shrunk wavelet coefficients C gives a summary of the denoised matrix. This allows improved reconstruction of gene expression data. To obtain a low-dimensional approximation, we apply matrix factorization on the coefficient matrix C after transformation. We use the same notation in Section Problem Setup, with the subscript c to denote the decomposition on coefficient matrix C.
The resulting singular vectors can be used to estimate spatially structured factor genes (Subsection Factor Gene). We use SVD as a frequentist approach to estimating and . We also conduct EBMF (Section Empirical Bayes Matrix Factorization), using the posterior expectation of Fc and Lc to estimate and . EBMF can select the number of factors K by itself, whereas K must be manually specified in the SVD. The choice of K is informed by inspecting the scree plot, as in spectral clustering and PCA. We choose the number of factors when current factors explain sufficient information and the diminishing returns of additional factors are no longer worth the additional cost. Given and , we can compute the estimated coefficient matrix as .
To transfer the coefficient matrix back to the estimated location-by-gene matrix , we apply the inverse wavelet transformation on columns of , each of which is a vectorized coefficient list for one gene. This results in a D2 expression matrix for each gene. Then we vectorize the matrix and stack all vectors together. This yields the reconstructed matrix . Details are given in Algorithm 3. For visualization, we also conduct a similar process for each column of . In this case, each column is associated with an factor gene. By applying the inverse wavelet transformation to each factor gene, we can build spatial gene expression matrices M1, …, MK representing gene factors.
Require Sample by gene matrix , spatial matrix , size of gene expression matrix D
1: Compute the range of x, y coordinates from S, compute the coordinates of vertices of big rectangle map B cover all N samples spatially.
2: Partition interval x and interval y into D equal length interval, together get D2 partitions over rectangle B.
3: while i ≤ P do ⊳ Consider gene i expression over map B
4: Select i-th column of gene matrix Y
5: Compute the local average of gene expression of gene i in each partition
6: Get D2 matrix Gi as gene expression of gene i
7: Vectorize Gi into a vector gi with length D2
8: end while
9: Stacking all together into matrix
10: return matrix Z ⊳ A transformed gene expression matrix
Require Location by gene matrix , threshold method, optional threshold parameter τ. ⊳ Apply threshold on wavelet coefficient if threshold method is specified
1: while i ≤ P do ⊳ Consider gene i expression as matrix Gi
2: Select ith gi column of gene matrix Z
3: Form gi into expression matrix Gi with size D2
4: Apply 2-D discrete wavelet transformation over Gi, get coefficient list Ci, with number of coefficients W
5: Apply wavelet shrinkage over Ci with optional parameter τ
6: Vectorize Ci into a long vector ci with length W
7: end while
8: Stacking all together into matrix
9: return coefficient matrix C ⊳ Column i of C store the coefficient of from gene i
Require: (reconstructed) Coefficient matrix C
1: while i ≤ P do
2: Select ith ci column of gene matrix C
3: Form ci into coefficient list Ci with number of coefficients W
4: Apply 2-D inverse wavelet transformation over Ci, get post-wavelet expression matrix
5: Vectorize into a long vector with length D2
6: end while
7: Stacking all together into matrix
8: return matrix ⊳ Post processing reconstructed gene expression matrix
We evaluate our proposal both quantitatively and qualitatively. Our quantitative results measures reconstruction error between the estimated and Z using the Frobenius norm .
This evaluation is also used for hyperparameter tuning, such as the scale J of the wavelet and the choice of wavelet thresholding method. Our qualitative result is given by visualization of the estimated factor gene expression matrices, with emphasis on capturing global spatial structure.
We use cross-validation in computing reconstruction error and calculating gene-wise errors. This evaluation is helpful in selecting spatially related genes. We also found a simple connection between the gradient of gene expression and spatial contribution, discussed in Section Real Data Experiment. This discussion requires a measure of spatial expression smoothness. To this end, a simple step computing the fluctuation of gene expression would give a spatial gene selection that coincides with the reconstruction error selection. For calculating successive differences, consider Z as the input matrix and let δjk = (zj, k+1−zjk)2, k = 1, …, D2. Then we define the gradient by .
We setup simulations to see whether representing spatial structure with a wavelet basis supports denoising and visualization of gene expression. We will see improved recovery in the low signal-to-noise ratio regime. Qualitatively, we also find factor gene visualizations to be more spatially consistent. We use use the Daubechies D4 Wavelet Transform (Daubechies, 1992) as wavelet filter with scale of wavelet J = 5. We develop the package and share the workflow generating reproducible quantitative results and gene visualization. The package is available from Software availability.
We first describe the simulation mechanism. We the set number of factors to K = 9. We generate K gene expression matrices , each representing a factor gene. These factor gene expression patterns are shown in Figure 4. We vectorize the patterns Mk into factor genes fk, which are then scaled so ∥fk∥ = 1. We obtain by stacking all the fk together. We generate loadings by drawing coordinates independently from N(0, 1). We set D = 32 and wavelet scale J = 5. We similarly stack lk to obtain . We use Z = FLT as the ground-truth signal matrix and add noise E to yield data matrix Zd = FLT + E, where entries in E are zero-mean normal noise that corrupt the underlying signal. In our analysis, we use a Daubechies D4 Wavelet as the wavelet filter, and for coefficient shrinkage we use hybrid thresholding.
We use the pipeline from Section Methods. The generated data Zd already reflects the structure produced by the processing of Subsection Gene Expression Over Location. For comparison, we also directly decompose the matrix Zd with the SVD without any wavelet transformation. We call the resulting factors and reconstruction . We also have the same quantitative reconstruction error and qualitative visualization of factor gene . We also measure the size of the gradients across the estimated gene expression matrix . The gradient is computed by successive difference of neighboring image pixels. We calculated the sum of the squares of the gradient. This measurement shows whether the gene expression matrix has been smoothed. This property is of interest, since smoother estimates are often more visually appealing.
We denote the signal-to-noise ratio as , where sd stands for standard deviation. Let . We specify 19 evenly spaced settings of r from 1 to 10. For each setting, we run 100 replicates with different simulated Zd and apply wavelet and SVD-based dimensionality reduction. The resulting average errors across r are shown in the Figure 5. The wavelet and shrinkage technique has better performance when r is larger than 5, i.e., the low signal-to-noise ratio regime. The gradient of the gene expression matrix under two methods is shown in Figure 5. The wavelet and shrinkage approach smooths edges in the factor gene expression image, giving a more interpretable visualization and lower error in this low SNR setting.
The x-axis shows the rate of The left subplot shows the gradient changes. The gradient of gene expression is always lower with the wavelet technique, a consequence of its smoothing property. The right subplot shows the error – the wavelet method has lower error as the magnitude of noise increase. The underlying data can be found as simulation_data in package from Software availability. SVD, singular value decomposition.
The factor genes are shown in Figure 6. The SVD without wavelet technique has erratic, outlying pixels, especially in the last three genes. The visualization is sensitive to outliers. In contrast, the SVD combined with wavelet has smoother patterns. Like the SVD-based method, it appears to have mixed several of the true underlying factors in each of the recovered ones. Moreover, the sharp boundaries visible in the SVD factors become smoothed over in the wavelet-decomposition. The wavelet method applies a decomposition on coefficients space after thresholding, while SVD operates on individual pixels. The SVD capture more information, but also emphasize nuisance information induced by errors. Wavelet method also reduces model complexity, improving estimation accuracy. Other non-parametric methods, such as Fourier transformation and Gaussian Process-based methods also operate on coefficients space. Still, they would struggle to capture sharp transitions, since their bases are smooth functions.
(a) Implements SVD on Zd, the visualization of column of . (b) Implements SVD on coefficient matrix, the visualization of . We use Z = FLT as the ground-truth signal matrix and add noise E to yield data matrix Zd = FLT + E, where entries in E are zero-mean normal noise that corrupt the underlying signal. For comparison, we also directly decompose the matrix Zd with the SVD without any wavelet transformation. We call the resulting factors and reconstruction . We also have the same quantitative reconstruction error and qualitative visualization of factor gene . (authors own visualization using ggplot2 package in R).
Nonetheless, both the SVD and wavelet-based visualizations reflect spatial trends in the true factor genes (in Figure 4).
In this section, we show that wavelet and shrinkage technique reduces reconstruction error quantitatively. We ran our method on a public spatially resolved transcriptomics data (Weber, 2021). The dataset can be accessed in R package STexampleData with version 3.15. The dataset represents a single biological sample from the human brain dorsolateral prefrontal cortex (DLPFC) region, measured with the 10x Genomics Visium platform. Further, we identify a simple connection between the gradient of gene expression and quantitative error. A simple step computing the fluctuation of gene expression alone (calculating successive difference of the gene expression image) selects genes that have reduced reconstruction error when using wavelet-guided dimensionality reduction (Zhuoyan, 2022).
We first process the ST data through our pipeline. The ST data contains 4992 observations with 33538 genes. Expressions from most genes are sparse. We implement the pipeline from Section Methods. We pre-process as in Subsection Gene Expression Over Location and then apply a kOverA filter to select genes. We find k = 3, A = 7 gives us 721 genes with average expression as 2.71. Then we apply Algorithm 1 to transfer the sample-by-gene matrix to a grid-by-gene matrix. We set D = 64 and wavelet scale J = 6. We obtain an image heatmap for each gene like in Figure 7.
We then vectorize P = 721 image expression and stacking vectors together, we have generated input data . We run our pipelines with and without wavelet transformation for evaluation. We first implement SVD and EBMF on Z or on coefficient matrix C after applying wavelet transformation and thresholding each column. We choose the number of factors K by examining singular values. If the singular value is larger than 500, we keep it as a factor. In EBMF, we set upper bound that K to be smaller than the corresponding K in SVD, then let algorithm choose K itself.
Quantitative evaluation metric is conducted by cross-validation on non-zero entries. We set 5 folds cross-validation with replacement. In particular, we select random non-zero entries in matrix Z and set them to zero, then we save matrix as the masked matrix (train data) Ztrain. We store the values and position of the masked entries as test data Ztest. We then ran methods on Ztrain. The result shows a difference in whether to use the wavelet technique. The result from SVD and EBMF are close to each other coordinate-wise, hence we only show one of them in some results below. We include the comparison between SVD and EBMF in the Section Comparison between SVD and EBMF. As in Section Simulation, we have and from the wavelet approach and and on the original data.
We compute the reconstruction loss,
where N is the number of test entries. We evaluate the loss only on test entries. We first tune parameters. We set up three settings: decompose the raw data with SVD or EBMF; wavelet transformation with hybrid thresholding; wavelet transformation with manual thresholding with threshold τ = 10, 20, …, 100. In each setting, we ran 100 replicates. The reconstruction error shown in Figure 8.The upper plot contains the result from all settings, the leftmost τ = −1 result is the decomposition without wavelet transformation. τ = 0 indicates the hybrid thresholding. The bottom plot zooms into the experiment under manual threshold for . SVD, singular value decomposition; EBMF, Empirical Bayes Matrix Factorization.
As shown in the upper panel of Figure 8, wavelet thresholding with manually set τ reduces error compared to decomposing raw data. The wavelet has the most positive effect when τ = 40, as shown in the bottom panel. We use τ = 40 in our following analysis.
We then evaluate how method performance varies for each gene by calculating the genewise reconstruction error. This reveals genes whose strong spatial expression structure leads to improved performance when using a wavelet basis. We still hold out of the entries at random as a test set. We compute the reconstruction error of test entries on each column of . We calculate entry-wise loss across 100 replicates and estimate the average loss. We compare genewise errors with or without wavelet transformation. The SVD and EBMF show the same result regarding the decomposition method. We only show result of EBMF here in Figure 9. We show the result of SVD in Figure 13b.
As we can see, most of the genes have lower reconstruction error when directly applying the SVD or EBMF. However, for some genes, wavelet smoothing reduces error. For example, this is seen in genes 712, 713, 716, 719, 715, 710, 711, 721, 717, 596, 373, and 289. We conjecture that these genes have acute fluctuations as well as clear spatial patterns. The expression matrix of these genes would have a larger gradient and distinct edges. To verify our conjecture, we calculated the sum of squares of the gradient of each gene expression matrix. We show the result in Figure 10.
The y-axis is the index of each gene. Each bar’s color shows whether that gene has better performance when using a wavelet basis. The M on the x-axis stands for “millions”. EBMF, Empirical Bayes Matrix Factorization.
The result verifies our conjecture: genes with larger gradients have better reconstruction under the wavelet-guided decomposition. This suggests a pre-processing step for selecting wavelet-suited genes by calculating the gradient of each gene. The spatially related genes would have a lower reconstruction error. Among these spatially related genes, one possibility is that we can divide them into two groups, one for decomposition directly and the other for the wavelet technique.
Now we visualize the top factor genes and genes with high loadings on these factor genes. In Figure 11a, we decompose Z using the SVD and plot the first factor gene and matrix slides of genes with the largest 5 loadings on that factor. The factor gene captured the same patterns as the original genes. To improve visualization, we find the analogous wavelet-based factors (we use manual thresholding with τ = 40), shown in Figure 11b.
The wavelet thresholding approach smooths over edges in the original visualization. The first factor gene captures the global spatial expression. We have a similar result for the second factor gene, as shown in Figure 12.
The second factor gene is orthogonal to the first one, capturing different spatial structures in gene expression. Different factor genes capture global and local structure, and using a wavelet decomposition provides denoised spatial expression visualization.
In conclusion, we can use this dimensionality reduction technique for spatial gene selection and extraction. We select wavelet-suited genes based on the calculated gradient. Then, we can select spatially related genes based on reconstruction error via cross-validation. Alternatively, we can extract factor gene to capture spatial information and use factor genes for visualization and further analysis.
We provide a small code block to show vignette of generating Figure 11 and Figure 12, shown in code block 1.
kOverA_ST transforms the original spatial expression context into an image. This is processed by waveST. The final dimensionality reduction step is performed by decompose.
We have made a new package waveST containing the workflow we developed. The package is available at GitHub (see Software availability). The kOverA_ST function reduces data from an original (Weber, 2021) class input using the kOverA technique. Then we use waveST function to construct a S4 class waveST, containing input generated by Algorithm 1. This object-oriented approach stores properties of the original spatial experiment and simplifies downstream calls, like decomposition and visualization. We use the decompose function to apply all decomposition methods. In line 6, we apply the SVD to our original data, setting the number of factors to 5. In line 11, we apply the wavelet-based reduction and apply a manual threshold with τ = 40. We use plot and k=1 to visualize the first factor gene and matrix slides of the genes with the largest 5 loadings on that factor.
This section we provide a comparison between the reconstruction results when using SVD and EBMF. In general, we find little difference between reduction using the two methods. We first provide an element-wise comparison between the reconstruction of SVD and of EBMF with wavelet technique in Figure 13a.
SVD, singular value decomposition; EBMF, Empirical Bayes Matrix Factorization.
The results are similar when applied to real data. In particular, Figure 13b shows the reconstuction error per gene result for SVD. Our findings are similar to those for Figure 9. Similar findings are visible when using SVD and EBMF with and without an initial wavelet decomposition Figure 14. As we can see, SVD and Empirical Bayes Matrix Factorization (EBMF) have the very similar result. This is perhaps a consequence of EMBF being initialized usign the SVD.
We have proposed a pipeline for dimensionality reduction that respects spatial structure. Both simulations and real data experiments demonstrate that wavelet and shrinkage techniques show positive results in spatially resolved transcriptomics data. We highlight the idea of combining image processing techniques and statistical methods for application in a spatial genomics context. One future direction is splitting genes into a groups suited or not for wavelet-based decomposition, and implementing decomposition with or without wavelet. Another direction is to focus input generation on only those genes that are thought to be spatially related. For genes not related to spatial information, we may perform regular decomposition on original data , abandoning the spatial information of those genes. We expect this to improve reconstruction performance. In further analysis, it is worth considering other wavelet smoothing techniques and wavelet filters. The current methods incorporate little biological information. Bringing more domain knowledge will require further techniques, but is expected to yield better results. The input generation computes local average over even grids; however, it is possible to apply the wavelet method for irregularly spaced data (Nason, 2008). We hope wavelet methods will be useful in adapting existing methods for statistical genomics to the spatial setting.
We ran our method on a public spatially resolved transcriptomics data (Weber, 2021). The dataset can be accessed in R package STexampleData with version 3.15. The dataset represents a single biological sample from the human brain dorsolateral prefrontal cortex (DLPFC) region, measured with the 10x Genomics Visium platform. The data used for this study can be accessed though our R package available at Zhuoyan (2022). The data in Simulation is labelled simulation_data and, the data in Real Data Experiment is labelled raws. The data source for generating raws can be accessed though function Visium_humanDLPFC in R package STexampleData with version 3.15.
Zenodo: OliverXUZY/waveST: waveST. https://doi.org/10.5281/zenodo.6983923 (Zhuoyan, 2022)
This project contains the following underlying data:
raws.rda
simulation_data.rda
Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).
Source code available from: https://github.com/OliverXUZY/waveST.
Archived source code at time of publication: https://doi.org/10.5281/zenodo.6983923.
License: The software is licensed under MIT
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Partly
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Single-cell multiomics analysis
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Partly
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: bioinformatics; genomics analysis; artificial intelligence
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
No
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational Biology, Single-cell technologies, Spatial-omics, Multiplexed imaging
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Partly
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Yes
References
1. Righelli D, Weber LM, Crowell HL, Pardo B, et al.: SpatialExperiment: infrastructure for spatially-resolved transcriptomics data in R using Bioconductor.Bioinformatics. 2022; 38 (11): 3128-3131 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Analysis of gene transcriptomics data
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
Version 1 12 Sep 22 |
read | read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)