ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

SASCRiP: A Python workflow for preprocessing UMI count-based scRNA-seq data

[version 1; peer review: 2 approved with reservations]
PUBLISHED 15 Feb 2022
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Cell & Molecular Biology gateway.

This article is included in the Bioinformatics gateway.

This article is included in the Python collection.

Abstract

In order to reduce the impact of technical variation inherent in single-cell RNA sequencing (scRNA-seq) technologies on biological interpretation of experiments, rigorous preprocessing and quality control is required to transform raw sequencing reads into high-quality, gene and transcript counts. While hundreds of tools have been developed for this purpose, the vast majority of the most widely used tools are built for the R software environment. With an increasing number of new tools now being developed using Python, it is necessary to develop integrative workflows that leverage tools from both platforms. We have therefore developed, SASCRiP (Sequencing Analysis of Single-Cell RNA in Python), a modular single-cell preprocessing workflow that integrates functionality from existing, widely used R and Python packages, and additional custom features and visualizations, to enable preprocessing of scRNA-seq data derived from technologies that use unique molecular identifier (UMI) sequences in a single Python analysis workflow. We describe the utility of SASCRiP using datasets derived from peripheral blood mononuclear cells sequenced using droplet-based, 3′-end sequencing technology. We highlight SASCRiP’s diagnostic visualizations and fully customizable functions, and demonstrate how SASCRiP provides a highly flexible, integrative Python workflow for preparing unprocessed UMI count-based scRNA-seq data for subsequent downstream analyses. SASCRiP is freely available through PyPi or from the GitHub page.

Keywords

single-cell RNA-seq, gene expression, workflow, Python, R

Introduction

Since a method for single-cell RNA sequencing (scRNA-seq) was first proposed (Tang et al., 2009), a number of diverse technologies have emerged for studying gene expression at single-cell resolution. Among these, droplet-based, 3′-end sequencing technologies such as Drop-seq (Macosko et al., 2015), inDrop (Klein et al., 2015), and in particular, 10X Genomics Chromium (Zheng et al., 2017) have become increasingly popular for studying gene expression across different cell types and cellular states due to their lower sequencing cost per cell, and higher throughput relative to other available technologies. These technologies rely on the use of short unique molecular identifier (UMI) sequences that are used to both quantify gene expression and reduce technical variation inherently present in scRNA-seq datasets (Islam et al., 2014; Kivioja et al., 2011). However, datasets derived using these UMI count-based technologies need to undergo several preprocessing and quality control steps before they can be used to address biological hypotheses.

As a result, hundreds of software tools have been developed to perform these preprocessing and quality control steps, with the vast majority of these software tools having historically been built for the R software environment (Zappia et al., 2018). One of the most popular R packages is undoubtedly Seurat (Stuart et al., 2019), which has become an integral component of many scRNA-seq workflows due to the wide range of analysis methods it offers and the utility of its custom data structures. However recently, an increasing number of new tools are being developed using the Python programming language (Zappia & Theis, 2021). One such tool is kb-python (Bray et al., 2016; Melsted et al., 2021), a wrapper for the kallisto | bustools scRNA-seq workflow (Melsted et al., 2021). Therefore there is a growing need to develop tools that enable integration of widely used existing R-based tools like Seurat and newer Python-based tools and analysis workflows like kb-python.

We therefore introduce SASCRiP (Sequencing Analysis of Single-Cell RNA in Python), a flexible and modular Python package designed to integrate kb-python, selected Seurat features, and additional custom data processing and visualization functions in a way that simplifies and streamlines preprocessing and quality control of UMI count-based scRNA-seq data in preparation for downstream analyses like clustering, data integration, and differential gene expression analysis.

Methods

Implementation

SASCRiP (Moonsamy, 2021a) implements preprocessing of scRNA-seq datasets through four main functions, namely kallisto_bustools_count, seurat_matrix, run_cqc, and sctransform_normalize (Figure 1).

bbaaac6a-e187-4494-8028-566a9cf6415c_figure1.gif

Figure 1. Overview of the Sequencing Analysis of Single-Cell RNA in Python (SASCRiP) workflow.

The SASCRiP workflow begins with the input of FASTQ files, and a Seurat object and mtx matrix containing gene expression values are produced as output. All SASCRiP functions, namely kallisto_bustools_count, seurat_matrix, run_cqc, and sctransform_normalize, produce intermediate files that are used as input into the following function.

The kallisto_bustools_count function wraps functions from the kb-python package, which itself wraps functions from the kallisto | bustools workflow. This function takes as input unprocessed paired-end FASTQ files derived from UMI-based scRNA-seq experiments. These are then, by default, pseudoaligned to a user-specified indexed transcriptome. The user also has the option to generate a new index and transcript-to-genes mapping file if these are not available. All aligned sequences are stored within a barcode UMI set (BUS) file that contains the barcode and UMI sequences, as well as the aligned transcripts. The kallisto_bustools_count function uses BUStools (Melsted et al., 2019) to remove polymerase chain reaction (PCR) duplicates and correct the barcode sequences that differ by 1 hamming distance from a barcode whitelist. The kallisto_bustools_count function includes barcode whitelists for 10x Genomics Chromium v1, v2 and v3, as well as for InDrops v3 data. For other UMI-based technologies, kallisto_bustools_count includes parameters that allow for the use of user-supplied barcode whitelists. By default, the BUS file is filtered to remove barcodes with no corresponding transcript information. Gene-level count matrices are then returned in the matrix market (mtx) format, where cells and genes are represented in rows and columns, respectively.

import sascrip
from sascrip import sascrip_functions

sascrip_functions.kallisto_bustools_count(
list_of_fastq_files, # list of all the input FASTQ files in the required order
single_cell_technology, # type of single-cell tech used to generate the data
output_directory_path, # path where all output files will be saved
species_index, # path to the kallisto index for the species of interest
species_t2g # path to the transcripts-to-genes mapping file for the species of interest
)

SASCRiP also includes an additional, optional function, edit_10xv1_fastq, that transforms scRNA-seq data obtained from experiments using 10x Genomics Chromium v1 technology into a format compatible with the kallisto | bustools workflow. Although these types of files are already supported by the kallisto | bustools workflow, most FASTQ files stored in public sequencing data repositories/databases are not provided in the required format. In situations where only two FASTQ files (one containing the transcript and UMI sequences, and the other containing the barcode sequences) are provided, the edit_10vx1_fastq function separates the UMI and transcript sequences into two new FastQ files, so that the three files required (containing the transcript, UMI, and barcode sequences) are available as input for the kallisto | bustools workflow.

import sascrip
from sascrip import sascrip_functions

sascrip_functions.edit_10xv1_fastq(
directory_with_fastqs, # path to directory containing 10xv1 FastQ files
output_directory # path where edited FastQ files will be stored
)

Count matrices obtained using the kallisto | bustools workflow cannot be directly imported into Seurat. Therefore the SASCRiP function, seurat_matrix, converts the output obtained from BUStools (where genes are represented as rows and cells as columns) into a format compatible with Seurat, where cells are represented as rows and genes as columns. The seurat_matrix function also converts gene identifiers in the transcript-to-gene index file from the kallisto | bustools workflow from ENSEMBL IDs to their corresponding HGNC gene symbols, as required by Seurat.

import sascrip
from sascrip import sascrip_functions

sascrip_functions.seurat_matrix(
mtx_bustools_matrix, # mtx matrix generated by kallisto_bustools_count
bustools_gene_index, # genes index file generated with the mtx matrix
bustools_barcode_index # barcode index file generated with the mtx matrix
transcript_to_genes_file # path to transcript-to-genes files used/generated
                           # with kallisto_bustools_count
output_directory # path where Seurat-compatible files will be saved
)

Gene-level count matrices obtained from either Seurat or SASCRiP’s seurat_matrix function can then be used as input for SASCRiP’s run_cqc function, which leverages Seurat’s custom data structures to identify low quality cells. The run_cqc function classifies cells as being of low quality based on two per-cell metrics: the total number of genes detected, and the percentage of sequencing reads that map to mitochondrial genes. These metrics are calculated automatically when a Seurat object is created. By default, run_cqc first removes cells with fewer than 200 genes (Ilicic et al., 2016), then cells where more than 10% of sequencing reads map to mitochondrial genes Osorio & Cai (2021). However, all parameters used to distinguish low quality cells from high quality cells can be defined by the user.

The run_cqc function also calculates the median absolute deviation (MAD) of the total number of genes per cell, allowing for the identification and removal of cell doublets. SASCRiP prioritises the removal of heterotypic doublets, such that, by default, cells with a total gene count higher than six MAD values from the median are classified as outliers and removed. The MAD values and outlier thresholds are calculated using the following equation:

x˜=median(x)

MAD=median(|xix˜|)+1.483

nMAD=n*(MAD)

flag as outlier if: total gene count>x˜+nMAD

Here, x represents the total gene count for all cells, xi is the gene count for a given cell, and n is the number of MAD values used to classify a cell as an outlier. The MAD value cut-off threshold can be manually defined by the user. Alternatively, other outlier detection methods, such as standard deviation, are also included within SASCRiP and can be selected as a substitute to MAD.

import sascrip
from sascrip import sascrip_functions

sascrip_functions.run_cqc(
input_files, # directory containing Seurat-compatible files
"sample", # user-given name of sample
output_directory # path where all output files will be saved
)

Multiple output files are produced by the run_cqc function. Cell quality control metrics and UMI count data are returned as Seurat objects, in rds format, and a log file is created. The cell quality control metrics can also be returned in tsv format, to facilitate data visualization; and UMI counts can be returned as an mtx matrix, for use with alternative analysis tools. In this way, output from SASCRiP can be easily integrated into alternative workflows and analysis pipelines that require different file formats.

SASCRiP performs normalization and variance stabilization through the sctransform_normalize function, which serves as a wrapper for sctransform (Hafemeister & Satija, 2019). In this way, technical variation present in the dataset due to differences in sequencing depth between cells is reduced. All parameters incorporated within the original sctransform::vst function can be modified through the sctransform_normalize function. Corrected UMI counts are log-normalized in order to obtain gene expression values and the 2000 most variable genes in the dataset are identified and returned. Normalized gene expression values or UMI counts may be returned, either as a Seurat object or as an mtx file.

import sascrip
from sascrip import sascrip_functions

sascrip_functions.sctransform_normalize(
seurat_object, # path to saved subset Seurat object
“sample”, # user-given name of sample
output_directory # path where normalised data are saved
)

The modularity of SASCRiP allows all functions to be implemented independently, but the package also includes an all-in-one function, sascrip_preprocess, that implements the entire workflow from start to finish. In order to ensure that appropriate quality control measures are applied at each stage of the workflow, quality control metrics are provided at multiple checkpoints in the SASCRiP workflow when the sascrip_preprocess function is executed. These metrics can be printed to the screen while running the workflow and/or written to file for later use.

Operation

All SASCRiP functions are designed and implemented in Python 3 (v3.7 or higher), and the package is available on PyPi. As SASCRiP also incorporates a number of R packages (including Seurat), R (v3.6 or higher) is also required. Any additional R packages required can then be installed through SASCRiP, if needed. SASCRiP was developed primarily for use on Unix-based operating systems, however the source code can be adapted for use on Windows platforms (as described in the SASCRiP documentation). SASCRiP can be used to process scRNA-seq datasets consisting of up to 10 000 cells on a standard laptop with 8 Gb RAM. At least 16 Gb of RAM is recommended for larger datasets.

Use cases

In order to demonstrate the utility of the SASCRiP workflow, we reanalyzed three scRNA-seq datasets derived from peripheral blood mononuclear cells (PBMCs) obtained from healthy individuals (Zheng et al., 2017). Cell counts range from 2000 to 10 000 cells per donor. SASCRiP’s sascrip_preprocess function was used to perform all preprocessing steps required to generate a high-quality gene expression dataset that could then be readily input into Seurat for cell clustering and annotation. All SASCRiP functions were implemented using the default parameters.

As these datasets were obtained using 10X Genomics Chromium v1 technology, SASCRiP’s edit_10xv1_fastq function was first used to generate the three FASTQ files required by the kallisto_bustools_count function. Gene-level count matrices were then obtained using the kallisto_bustools_count function, and cell quality control metrics were calculated using the run_cqc function (Moonsamy, 2021b). Visualization of these metrics allowed low quality cells (Figure 2) and cell doublets (Figure 3) to be easily identified and removed. These metrics, together with the UMI count data, were then input as Seurat objects, into the sctransform_normalize function. This allowed for normalization and variance stabilization of these UMI counts (Figure 4), so that gene expression could be quantified. In addition, the 2000 most highly variable genes, based on standardized variance, were identified (Figure 5). The normalization data was then output as a Seurat object, in preparation for clustering.

bbaaac6a-e187-4494-8028-566a9cf6415c_figure2.gif

Figure 2. Sequencing Analysis of Single-Cell RNA in Python allows for calculation and visualization of cell quality control metrics, including the total number of genes detected per cell, and the total number of sequencing reads that align to mitochondrial genes per cell.

Cells that fall (A) below the given threshold for gene count (in this case, 200) and (B) above the maximum threshold for mitochondrial percentage (in this case, 10%) are shown in red.

bbaaac6a-e187-4494-8028-566a9cf6415c_figure3.gif

Figure 3. Sequencing Analysis of Single-Cell RNA in Python allows for the detection and visualization of outliers using median absolute deviation (MAD).

Potential cell doublets (black dots) are identified relative to the median total gene count per cell. Cells with a total gene count more than six MADs from the median are flagged for removal.

bbaaac6a-e187-4494-8028-566a9cf6415c_figure4.gif

Figure 4. Sequencing Analysis of Single-Cell RNA in Python allows for normalization and variance stabilization of UMI-based count data.

The total number of UMIs relative to the total number of unique genes detected is visualized before (blue) and after (pink) normalization and variance stabilization using regularized negative binomial regression, as implemented in sctransform (Hafemeister & Satija, 2019). Each point represents a single cell.

bbaaac6a-e187-4494-8028-566a9cf6415c_figure5.gif

Figure 5. Sequencing Analysis of Single-Cell RNA in Python identifies the top 2000 most highly variable genes.

The top 2000 most highly variable genes (purple) are identified based on standardized variance. Each dot represents a gene with its standardized variance shown on the y-axis and the average gene expression (calculated using the normalized gene counts) on the x-axis. The 10 genes with the greatest variance across all cells are labelled.

Finally, to demonstrate the quality of the preprocessed data obtained from SASCRiP, Seurat v3.2.0 was used to cluster the two datasets identified as being of high quality (donors B and C) following processing with SASCRiP. The RunPCA function was used to perform dimension reduction (Stuart et al., 2019), and graph-based clustering was performed using the FindNeighbours function (Xu & Su, 2015). In both datasets, the preprocessed data obtained from SASCRiP produced distinct clusters (Figure 6A), corresponding to major PBMC cell types (Figure 6B), suggesting that SASCRiP produces high quality, preprocessed data, suitable for downstream scRNA-seq analyses and applications.

bbaaac6a-e187-4494-8028-566a9cf6415c_figure6.gif

Figure 6. Data preprocessed using Sequencing Analysis of Single-Cell RNA in Python can be used to identify clusters corresponding to distinct cell types in peripheral blood mononuclear cells.

Uniform manifold approximation and projection of ~8000 cells derived from a single donor is shown, coloured according to (A) the clusters identified in the dataset, and (B) the expression of known cell type markers in each cluster. CD3D indicates T cell clusters, MS4A1 B cell clusters, CD68 monocyte clusters, and FCER1A dendritic cell clusters.

Conclusions

SASCRiP is a Python package that provides a simple, streamlined workflow for preprocessing UMI count-based scRNA-seq data through a series of parameterized functions. To ensure flexibility, these functions allow users to either adopt a set of clearly defined default parameters or to modify any or all of these parameters as they see fit. Also, due to its modular design, SASCRiP’s four major functions (kallisto_bustools_count, seurat_matrix, cell_cqc, sctransform_normalize) can be executed either independently to produce custom visualizations and/or intermediary files that can be used as input for other scRNA-seq tools, or all-in-one to produce high-quality, normalized gene expression datasets that are ready to use for downstream analyses. Collectively, these functions seamlessly integrate the functionality of the widely used R packages, Seurat and sctransform, into a custom Python workflow built around kb-python’s implementation of the kallisto | bustools workflow. In this way, pseudoalignment, quantification of gene expression, removal of low quality cells and cell doublets, and normalization and variance stabilization can be performed within a single scRNA-seq analysis pipeline.

Data availability

Extended data

Zenodo: SASCRiP Supporting Data https://doi.org/10.5281/zenodo.5899870 (Moonsamy, 2021b)

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Software availability

Software available from: https://pypi.org/project/SASCRiP

Source code available from: https://github.com/Darisia/SASCRiP

Archived source code at time of publication: https://doi.org/10.5281/zenodo.5554770 (Moonsamy, 2021a)

License: GNU GPLv3

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 15 Feb 2022
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Moonsamy D and Gentle N. SASCRiP: A Python workflow for preprocessing UMI count-based scRNA-seq data [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:190 (https://doi.org/10.12688/f1000research.75243.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 15 Feb 2022
Views
22
Cite
Reviewer Report 14 Mar 2022
Fangming Xie, University of California Los Angeles, California, CA, USA 
Approved with Reservations
VIEWS 22
Moonsamy and Gentle presented Sequencing Analysis of Single-Cell RNA in Python (SASCRiP), a modular analysis pipeline that processes UMI-based scRNA-seq data from raw sequencing reads to transcript counts, normalization, and visualizations. The software integrated existing tools, including kallisto-bustools (kb python ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Xie F. Reviewer Report For: SASCRiP: A Python workflow for preprocessing UMI count-based scRNA-seq data [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:190 (https://doi.org/10.5256/f1000research.79090.r125623)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
24
Cite
Reviewer Report 22 Feb 2022
Fabiola Curion, Helmholtz Munich, Technische Universität München, Munich, Germany 
Luke Zappia, Helmholtz Munich, Helmholtz Zentrum München, Munich, Germany 
Approved with Reservations
VIEWS 24
The authors have presented a workflow/package to deal with common steps of the pre-processing of single-cell RNA-seq data from raw reads until the normalization step, producing data that can be readily used by downstream tools. Whilst we appreciate it is ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Curion F and Zappia L. Reviewer Report For: SASCRiP: A Python workflow for preprocessing UMI count-based scRNA-seq data [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:190 (https://doi.org/10.5256/f1000research.79090.r123750)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 15 Feb 2022
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.