ZARP: A user-friendly and versatile RNA-seq analysis workflow

Maria Katsantoni; Foivos Gypas; Christina J Herrmann; Dominik Burri; Maciej Bąk; Paula Iborra; Krish Agarwal; Meriç Ataman; Máté Balajti; Noè Pozzan; Niels Schlusser; Youngbin Moon; Aleksei Mironov; Anastasiya Börsch; Mihaela Zavolan; Alexander Kanitz

doi:10.12688/f1000research.149237.1

Home Browse ZARP: A user-friendly and versatile RNA-seq analysis workflow

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

ZARP: A user-friendly and versatile RNA-seq analysis workflow

[version 1; peer review: 2 approved with reservations]

Maria Katsantoni^1,2, Foivos Gypas^1-3, Christina J Herrmann^1,2, [...] Dominik Burri^1,2, Maciej Bąk^1,2, Paula Iborra^1,2, Krish Agarwal¹, Meriç Ataman^1,2, Máté Balajti^1,2, Noè Pozzan^1,2, Niels Schlusser^1,2, Youngbin Moon^1,2, Aleksei Mironov¹, Anastasiya Börsch^1,2, Mihaela Zavolan^1,2, Alexander Kanitz ^1,2

Maria Katsantoni^1,2, Foivos Gypas^1-3, [...] Christina J Herrmann^1,2, Dominik Burri^1,2, Maciej Bąk^1,2, Paula Iborra^1,2, Krish Agarwal¹, Meriç Ataman^1,2, Máté Balajti^1,2, Noè Pozzan^1,2, Niels Schlusser^1,2, Youngbin Moon^1,2, Aleksei Mironov¹, Anastasiya Börsch^1,2, Mihaela Zavolan^1,2, Alexander Kanitz ^1,2

PUBLISHED 24 May 2024

Author details Author details

¹ Biozentrum, University of Basel, Basel, Basel-Stadt, 4056, Switzerland
² Swiss Institute of Bioinformatics, Lausanne, Vaud, 1015, Switzerland
³ Friedrich Miescher Institute for Biomedical Research, Basel, Basel-Stadt, 4058, Switzerland

Maria Katsantoni
Roles: Conceptualization, Project Administration, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Foivos Gypas
Roles: Conceptualization, Data Curation, Project Administration, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Christina J Herrmann
Roles: Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Dominik Burri
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Maciej Bąk
Roles: Data Curation, Project Administration, Software, Writing – Original Draft Preparation

Paula Iborra
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Krish Agarwal
Roles: Software, Writing – Review & Editing

Meriç Ataman
Roles: Data Curation, Validation, Writing – Original Draft Preparation

Máté Balajti
Roles: Software, Writing – Review & Editing

Noè Pozzan
Roles: Validation, Writing – Review & Editing

Niels Schlusser
Roles: Validation, Writing – Review & Editing

Youngbin Moon
Roles: Validation, Writing – Review & Editing

Aleksei Mironov
Roles: Validation, Writing – Review & Editing

Anastasiya Börsch
Roles: Validation, Writing – Original Draft Preparation

Mihaela Zavolan
Roles: Conceptualization, Funding Acquisition, Project Administration, Resources, Software, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Alexander Kanitz
Roles: Conceptualization, Data Curation, Project Administration, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

Abstract

Background

RNA sequencing (RNA-seq) is a widely used technique in many scientific studies. Given the plethora of models and software packages that have been developed for processing and analyzing RNA-seq datasets, choosing the most appropriate ones is a time-consuming process that requires an in-depth understanding of the data, as well as of the principles and parameters of each tool. In addition, packages designed for individual tasks are developed in different programming languages and have dependencies of various degrees of complexity, which renders their installation and execution challenging for users with limited computational expertise. Workflow languages and execution engines with support for virtualization and encapsulation options such as containers and Conda environments facilitate these tasks considerably. The resulting computational workflows can then be reliably shared with the scientific community, enhancing reusability and the reproducibility of results as individual analysis steps are becoming more transparent and portable.

Methods

Here we present ZARP, a general purpose RNA-seq analysis workflow that builds on state-of-the-art software in the field to facilitate the analysis of RNA-seq datasets. ZARP is developed in the Snakemake workflow language and can run locally or in a cluster environment, generating extensive reports not only of the data but also of the options utilized. It is built using modern technologies with the ultimate goal to reduce the hands-on time for bioinformaticians and non-expert users and serve as a template for future workflow development. To this end, we also provide ZARP-cli, a dedicated command-line interface that may make running ZARP on an RNA-seq library of interest as easy as executing a single two-word command.

Conclusions

ZARP is a powerful RNA-seq analysis workflow that is easy to use even for beginners, built using best software development practices, available under a permissive Open Source license and open to contributions by the scientific community.

Keywords

bioinformatics, data analysis, high-throughput, transcriptomics, RNA-seq, computational workflow, reproducibility, usability

Corresponding authors: Mihaela Zavolan, Alexander Kanitz

Competing interests: No competing interests were disclosed.

Grant information: The work was supported by Swiss National Science Foundation grant #189063 to MZ and by the NCCR RNA & Disease (grant #182880) funding to MZ. MK, DB and MBak were supported by the "Biozentrum PhD Fellowships" program of the University of Basel.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2024 Katsantoni M et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Katsantoni M, Gypas F, Herrmann CJ et al. ZARP: A user-friendly and versatile RNA-seq analysis workflow [version 1; peer review: 2 approved with reservations]. F1000Research 2024, 13:533 (https://doi.org/10.12688/f1000research.149237.1) First published: 24 May 2024, 13:533 (https://doi.org/10.12688/f1000research.149237.1) Latest published: 24 May 2024, 13:533 (https://doi.org/10.12688/f1000research.149237.1)

Introduction

Recent years have seen an exponential growth in new bioinformatics tools,¹ a large proportion of which are dedicated to High Throughput Sequencing (HTS) data analysis. For example, there are tools to quantify the expression level of transcripts and genes from RNA-seq data,²^,³ identify RNA-binding protein (RBP) binding sites from crosslinking and immunoprecipitation (CLIP) data,⁴ identify alternative polyadenylation sites and/or quantify their usage,⁵^,⁶ or analyze single cell RNA-seq data.⁷ Such tools are written in different programming languages (e.g., Python, R, C, Rust) and have distinct library requirements and dependencies. In most cases, the tools expect the input to be in one of the widely used genomics file formats (e.g., FASTQ,⁸ BAM⁹), but custom formats are also frequently used. Combining such tools into an analysis protocol is a time-consuming and error-prone process. But as the analysis of HTS data (and other Big Data in the life sciences) has become such a common problem,¹⁰^,¹¹ and as the datasets continue to increase in size and complexity, there is an urgent need for expertly curated, well-tested, maintained and easy-to-use reusable computational workflows and registries to share and access them.¹²^–¹⁴

A number of modern, feature-rich workflow specification languages and corresponding management systems,¹⁵^,¹⁶ like Snakemake¹⁷ and Nextflow,¹⁸ are now gaining popularity in the life sciences, as they make the analyses easier to develop, test, share, deploy and execute. To facilitate the installation and execution of workflows across different hardware architectures and host operating systems, modern workflow management systems make use of virtualization and encapsulation techniques relying on containers (e.g., Docker¹⁹ and Singularity/Apptainer²⁰) and/or package managers, Conda in particular. An added advantage of using workflow specification languages is that the metadata and general provenance information is stored along with the main outcomes. These may be invaluable for a more fine-grained attribution of researcher contributions, reproducibility of analysis results and insights, and cost optimization of computing resources. Ongoing work at formalizing these metadata artifacts and adding support for such standardized schemas in workflow management systems²¹ is very much in line with current efforts to FAIRify research data and software.²²^,²³ All of these advantages ultimately lead to more reusable code and reproducible results, while fostering cooperativity and collaboration, both on scientific projects and on the development of Open Source Software.

A number of workflows for the analysis of bulk RNA-seq data have been developed by the community.²⁴^–³⁰ However, out of these, some are too complex for many users to deploy, run and/or interpret, while others do not easily allow customizations and/or the analyses of large numbers of samples. Here we present ZARP, a flexible, easy-to-use workflow for bulk RNA-seq data processing that enables rapid initial insights into the data. The inclusion of the most widely used and best performing tools for the various processing steps minimizes time spent by users on making tool choices. The implementation in a widely used workflow language ensures reproducibility and reliable execution of each analysis and facilitates (meta) data management and reporting. A dedicated command-line interface tool (ZARP-cli) allows for even easier execution of the workflow and management of ZARP runs. ZARP is useful to experimental biologists who want to rapidly assess the results of their high-throughput sequencing experiments and to bioinformaticians who can not only use the preliminary results in their downstream analyses, but also adapt the analysis workflow according to their needs. Note that a preprint of this article is deposited at BioRxiv.³¹

Methods

Implementation

ZARP (Zavolan-Lab’s Automated RNA-seq Pipeline) is a general purpose RNA-seq analysis workflow that allows users to carry out the most common steps in the analysis of Illumina RNA-seq libraries with minimum effort. The workflow is developed in Snakemake,¹⁷ a widely used workflow language,¹⁵ and follows best practice recommendations for workflow development.³² Importantly, ZARP relies on publicly available bioinformatics tools that each can be considered state-of-the-art in their specific tool class,³³ and it handles bulk, stranded, single- or paired-ended RNA-seq data.

Workflow inputs

ZARP requires two distinct inputs: (i) A “sample table”, a tab-delimited file with sample-specific information, such as paths to the sequencing data (FASTQ format), reference genome sequence (FASTA format), transcriptome annotation (GTF format) and additional experiment protocol- and library-preparation specifications like adapter sequences or fragment size; (ii) A configuration file in YAML format containing workflow-related parameters, such as results and log directory paths, as well as user-related information, like a contact email address. Advanced users can take advantage of ZARP’s flexible design to provide tool-specific configuration parameters via an optional third input file, which allows for adjusting the behavior of the workflow to their specific needs. Detailed information on the input files can be found in the documentation available in the ZARP repository on GitHub.

Analysis steps and workflow outputs

A general schema of the ZARP workflow in its current version is presented in Figure 1. For a more comprehensive representation that includes all steps, see Figure 2. Table 1 below lists the main functionalities of ZARP and the tools that provide these functionalities, roughly in the order in which they are executed:

Figure 1. Schematic overview of the ZARP workflow.

Given the input files specified in a sample table, ZARP preprocesses and maps RNA-seq reads, quantifies RNA expression levels, assesses the quality, and returns various outputs. Color legend: inputs - yellow, outputs - blue, analysis tools - gray, tools for quality control - turquoise.

Figure 2. ZARP workflow schema.

Graph-based representation of the ZARP workflow, including all of its steps (“rules”), as produced by running Snakemake with the --rulegraph option. Steps for both the single- and the paired-ended workflows are shown.

Table 1. ZARP functionalities and the tools providing them.

See main text for more information on use cases for each tool and why we chose those tools to be included in ZARP and ZARP-cli. Versions at time of writing are indicated.

Tool	Description	Reference
FastQC 0.12.1	Generates quality control metrics from raw FASTQ data.	GitHub
Cutadapt 4.6	Trims 5’ and 3’ adapter sequences, as well as poly(A) and/or poly(T) tails/stretches.	³⁵
STAR 2.7.11b	Aligns reads to the reference genome.	³⁶
tin-score-calculation 0.6.3	Calculates a transcript integrity number (TIN) for each transcript, a measure to reflect the degree of RNA degradation, based on aligned reads.	GitHub
ALFA 1.1.1	Provides functional annotation information for the sample based on read alignments and gene/transcript annotations.	⁴¹
kallisto 0.48.0	Estimates gene/transcript expression levels.	⁴³
Salmon 1.10.2	Estimates gene/transcript expression levels.	⁴⁴
zpca 0.8.3.post1	Performs principal component analyses of gene/transcript expression level estimates across samples included in a given workflow run.	GitHub
MultiQC 1.10.1	Aggregates tool results and generates interactive reports.	⁴⁶
SRA Toolkit 3.0.10 (ZARP-cli)	Fetches sequencing libraries from the Sequence Read Archive (47) and converts them to FASTQ	GitHub
HTSinfer 0.11.0 (ZARP-cli)	Infers sample metadata from RNA-seq libraries	GitHub
genomepy 0.16.1 (ZARP-cli)	Fetches genomes and gene model annotations	⁴⁹

Per-sample statistics obtained by applying FastQC directly to the input files (FASTQ format) provide a quick assessment of the overall sample quality. Statistics include GC content, overrepresented sequences, and adapter content. An excessive bias in GC content may affect downstream analyses and may have to be corrected for.³⁴ Overrepresented sequences may reflect contamination or PCR duplication artifacts, which, if excessive, could lead to sparse coverage and skewed expression estimates of the transcripts of interest. Information about adapter content can help identify issues that may occur during sample preparation. For more information on the metrics that FastQC reports and how they can be interpreted, please refer to the FastQC documentation. Trimming of 5’ and/or 3’ adapters as well as poly(A/T) stretches with Cutadapt³⁵ helps determine the proportion of informative sequences in the library and ensures a more reliable alignment of reads to a set of reference sequences to determine their biological origin. The adapters and poly(A/T) stretches to be removed need to be provided by the user as part of the sample table.

STAR³⁶ infers the genomic origin of the reads by aligning them against an appropriate reference genome and a database of splice junctions that STAR prepares based on the corresponding gene model annotations. The reference genome and gene annotations need to be provided by the user as part of the sample table. STAR has been chosen for its very competitive performance in terms of accuracy, compared to other aligners,³⁷ as well as for its extensive feature set and documentation. For enabling fast random access of the resulting Binary Alignment Map (BAM) alignment files, e.g., to explore them in genome browsers, ZARP then uses SAMtools⁹ to sort alignments by their genomic position and index them.

The sorted, indexed BAM files are further converted into the BigWig format with BedGraphtoBigWig from the UCSC tools suite.³⁸ BigWig files allow for easy library normalization and are considerably smaller than the corresponding BAM files, thus making them convenient for visualizing and comparing coverages across multiple samples. The aligned reads are also used to calculate per-transcript transcript integrity numbers (TIN),³⁹ a metric to assess the degree of RNA degradation in the sample, on a per-transcript basis. This is done with tin-score-calculation, which is based on a script originally included in the RSeQC package,⁴⁰ but modified by us to enable multiprocessing for increased performance. To provide a high-level topographical and functional annotation of the gene segments (e.g., coding regions, untranslated regions, intergenic) and biotypes (e.g., protein coding genes, rRNA) that are represented by the reads in a given sample, ZARP includes ALFA.⁴¹

Estimating transcript abundances from RNA-seq data not only allows for subsequent differential analysis on the transcript level but also improves gene level inferences⁴² and is preferable to counting read alignments.² Out of the various tools developed for that purpose,²^,³ the alignment-free or “pseudoalignment”-based tools kallisto⁴³ and Salmon⁴⁴ have emerged as the state of the art for the inference of transcript abundance, due to their similarly compelling performance, ease of use, fast runtimes and memory efficiency.³ As both tools are widely used and are fairly low on resource requirements, we have decided to include both for the quantification of transcript and gene level expression. The main output metrics provided by either tool are estimates of normalized transcript/gene expression, in Transcripts Per Million (TPM),⁴⁵ as well as raw read counts. For convenience, ZARP aggregates transcript and gene expression estimates across all samples with the aid of Salmon and merge_kallisto, generating summary tables that can be plugged into a variety of available tools, e.g., for differential gene/transcript expression, differential transcript usage or gene set enrichment analyses.

Within ZARP, TPM estimates are used for performing principal component analyses (PCA) with the help of zpca, a tool we created for use in ZARP, but packaged separately so that it can be easily used on its own or as part of other workflows. PCAs on gene/transcript expression levels can help users assess the relationship between samples, to detect batch effects and inform the way in which samples should be compared in downstream analyses.

Quality assessments, read alignments, genome coverages, and transcript and gene abundances can be used in downstream applications in a myriad ways, particularly when different cohorts are compared, e.g., in differential gene expression analyses. But given that such analyses require more complex inputs (e.g., experiment design tables) and are difficult to configure generically for a wide range of experiments, we deliberately decided not to include these in ZARP.

ZARP produces two user-friendly, browser-based, interactive reports: one with a summary of sample-related information generated by MultiQC,⁴⁶ the other with estimates of utilized computational resources generated by Snakemake itself. Note that for the tin-score-calculation, ALFA and zpca tools, we have created plugins that enable the interactive exploration of their respective results through MultiQC.

The ZARP command line interface

ZARP comes with ZARP-cli, a dedicated command line interface to further simplify its usage by giving it the look and feel of a single executable.⁴⁷ See Figure 3 for a general schematic outline of its functionalities and Table 1 for a list of the core tools providing them.

Figure 3. Schematic overview of the ZARP-cli functionalities.

ZARP-cli consolidates different input references and compiles a sample table and a configuration file, i.e., the inputs to ZARP. Along the way, where necessary, it fetches remote libraries, attempts to infer sample metadata, including the sample source organism, and downloads the corresponding genome resources.

At the core of ZARP-cli lies its cascading configuration management: Upon first use, users set up default parameters that include sample metadata to be used if no better source is provided (or can be inferred). Configuration parameters can be overridden, on a per-run basis, with command line arguments, which in turn can be overridden, on a per-sample basis, by entries in a sample table. This design ensures that the most accurate information can be easily passed and consumed.

Once configured, ZARP-cli compiles the set of samples to analyze from the positional command line arguments specified by the user. These “sample references” can point to sequencing libraries available either locally or on the Sequence Read Archive,⁴⁸ or they can point to partially or fully complete ZARP sample tables. Figure 4 illustrates all of the different ways that sample references can be passed to ZARP-cli.

Figure 4. ZARP-cli example calls.

Once the sample set has been assembled, ZARP-cli iteratively completes a sample table by filling in any missing information. To that effect, ZARP-cli first makes use of the SRA Toolkit to fetch any remote sequencing libraries from the Sequence Read Archive⁴⁸ and convert them to FASTQ, yielding the local paths for both single- and paired-ended libraries.

Next, ZARP-cli triggers HTSinfer, a tool developed by us for the inference of RNA-seq metadata that is important for downstream analyses from the sequencing data itself. While still in an experimental/pre-release stage, HTSinfer is often able to extract, infer or assert the following: (i) whether two FASTQ files contain “mate pairs” of the same paired-ended sequencing library, (ii) the source organism from which the library is derived (out of currently more than 400 supported organisms), (iii) the relative orientation of reads with respect to the RNA transcript from which they originate and, for paired-ended libraries, one another, (iv) the frequency of commonly encountered 3’ adapters, and (v) read length statistics. Importantly, ZARP-cli uses the derived information only where such metadata is not provided by the user as part of the configuration or sample table.

Of note, both the SRA download and metadata inference steps are packaged as independent Snakemake workflows that are currently available along with the main ZARP workflow in the same GitHub repository.

After inferring sample metadata, ZARP-cli uses genomepy⁴⁹ to fetch the genomes and gene model annotations from Ensembl.⁵⁰ The genome version to be fetched for a given organism is pre-configured in a data file shipped with ZARP-cli; it can be easily updated or amended manually, via a text editor. To ensure consistent use of the same gene annotation version in all analyses of RNA-seq libraries within a given group, time frame or project, a default gene model annotation version can be pre-configured during ZARP-cli initialization. If no default version is set, genome resources (annotations and genomes!) are always fetched (and never reused across runs). Gene model annotation versions can also be set via a dedicated command line argument, overriding any default version set in the user configuration.

Finally, ZARP-cli assembles a sample table and a configuration file and triggers the ZARP run. In extreme cases, a ZARP-cli command can be as short as two words, zarp (the name of the program) and a single sample reference. However, making it easier to start ZARP runs is not the only benefit of using the command line interface: ZARP-cli also enforces a file and directory structure that makes it easy to find information from older ZARP runs and prevents accidental workflow re-runs for samples that have already been analyzed. Likewise, genome resources and remote libraries can be configured to be fetched only once. Finally, group- or project-specific ZARP configurations can be easily enforced, e.g., to make sure that sequencing libraries are consistently analyzed with the same set of genome resources.

Development principles and best practices

ZARP development follows best practices for scientific analysis workflows.³² First of all, we chose to develop the workflow in Snakemake,¹⁷ a widely adopted workflow management system¹⁶ that allows for robust and scalable execution of computational analyses. To further enhance reusability and portability of the workflow, i.e., that it can be deployed in various scenarios, such as “on premise” High Performance/Throughput Computing (HPC/HTC) clusters or in the cloud,⁵¹ for each step (referred to as “rule” in Snakemake) of the workflow we have defined both a “rule-specific” Conda environment recipe file and a container image. The corresponding tool binaries and images are primarily hosted by the Bioconda⁵² and BioContainers⁵³ registries for improved access metrics and long term preservation plans. For container images, Snakemake converts the indicated Docker images to the Singularity Image Format²⁰ on the fly, where needed, enabling seamless execution of the workflow in environments with limited privileges (e.g., HPC/HTC clusters). Users can choose between Conda- and container-based⁵⁴ execution by selecting or preparing an appropriate profile when/before running a workflow. In the current version, we include profiles only for the Slurm job scheduler. However, we will gladly accept new profiles provided by the community, and therefore we encourage users to propose the addition of their own profiles back to the original ZARP repository via raising pull requests. To reduce the storage footprint of data generated by ZARP, intermediate files can be optionally cleaned up. On the other hand, as they use a Snakemake workflow, ZARP runs generate extensive log files and workflow run metadata. This information is relevant for accounting, telemetry and provenance purposes, with current efforts ongoing at standardizing such information for improved reproducibility of workflow runs.²¹

The software described here follows the 4OSS Recommendations⁵⁵: ZARP and ZARP-cli are hosted in their own GitHub repositories, are published under the permissive Apache, Version 2.0 free and open source software license, and follow Semantic Versioning, with each release accompanied by extensive, up-to-date documentation. We have deposited ZARP at the WorkflowHub, a registry specialized in hosting computational analysis workflows.¹³ It is also available in the “Standardized Usage” section of the Snakemake Workflow Catalog. ZARP-cli and any custom tools/scripts that may be useful beyond the scope of ZARP, have been largely developed according to best practice recommendations⁵⁶ and deposited at Bioconda,⁵² which ensures that they are also available on BioContainers.⁵³ We embrace community contributions to all of our software and try our best to ensure that all contributions are duly acknowledged in our communication. Finally, to facilitate attribution and guard against software regression and degradation, we are committed to software development best practices, such as the use of version control systems and automated continuous integration, testing and deployment workflows. In particular, for both ZARP-cli and ZARP, we provide and routinely execute extensive sets of unit and integration tests.

Operation

ZARP and ZARP-cli require a Linux operating system (tested with CentOS 8, Debian 11, 12; Ubuntu 20.4, 22.4). To install dependencies, we strongly recommend installing a flavor of Conda, e.g., Miniforge. See the installation instructions for ZARP and ZARP-cli for further details. Of note, the current ZARP version is not compatible with Snakemake 8+, because the latest Snakemake major version introduced a number of breaking changes that we are not yet supporting.

For running individual workflow steps with Conda, we further recommend installing Mamba, in line with Snakemake recommendations. Conversely, for running workflow steps inside containers, SingularityCE or Apptainer is required.

For analyzing large numbers of samples of any size and from any source, we strongly recommend running ZARP on an HPC/HTC, as the horizontal scaling and extensive multithreading will significantly reduce the wall clock time. A standard laptop or desktop machine (8 GB RAM, 4-core CPU) will be sufficient for small to medium runs (<100 samples) if only used as a Snakemake “head node”, i.e., the machine that coordinates the execution of workflow runs but leaves demanding computations to an HPC/HTC or the cloud (untested). If run entirely on a single machine, we recommend at least 32 GB (processing samples derived from certain organisms may not be possible), but preferably 128+ GB of RAM. But even then, the throughput will be limited due to reduced potential for horizontal scaling and multithreading. See section “Use Cases” below for further details on performance.

Use cases

ZARP is very well suited to analyze large RNA-seq experiments or even run meta-analyses across multiple different experiments. To demonstrate how ZARP can be used to gain meaningful insights into typical RNA-seq experiments, we tested it on an RNA-seq dataset that was generated by Ham et al. (GEO accession number GSE139213) while analyzing the role of mTORC1 signaling in the age-related loss of muscle mass and function in mice.⁵⁷ The dataset consists of 20 single-ended RNA-seq libraries (read length: 101 nt, gzipped FASTQ file sizes ranging from 0.8 to 3.2 Gb; library sizes ranging from 18.5 to 75.3 million reads), corresponding to four cohorts of 5 biological replicates each of muscle tissue from 3-months old mice: (i) wild-type, (ii) rapamycin-treated, (iii) tuberous sclerosis complex 1 (TSC1) knockout and (iv) rapamycin-treated TSC1 knockout. The reads were mapped against Ensembl’s⁵⁰ GRCm39 genome primary assembly and corresponding gene annotations (release: 111) for standard mouse chromosomes. Other parameters for populating the ZARP sample table were mined from the metadata provided in the GEO accession entry for the sample set. Sample tables, instructions on how to start the ZARP run either via a Snakemake call or via the ZARP-cli command line interface, and select results for the test run are publicly available.⁵⁸

In Figure 5, we present a subset of the outputs that ZARP generated for this dataset. One observation is the presence of slightly AU-rich reads (Figure 5A), although all samples pass the FastQC-defined threshold for GC bias, and the GC content does not exhibit a strong bias across samples. Transcript integrity across samples is also uniform and high (Figure 5B), with the highest density of expressed transcripts at TIN scores of 75 to 85. There is also no evidence of extensive sequencing of residual adapters (“adapter contamination”; Figure 5C), as less than 1% of reads in each sample have been discarded for being too short after adapter trimming. Similarly, alignment statistics as reported by STAR are also consistently high (Figure 5D), with rates of reads mapped uniquely against the mouse genome of more than 72% across all samples (<4% unmapped), irrespective of sequencing depth. As expected, ALFA analysis of transcript categories shows that uniquely mapped reads overwhelmingly originated from protein-coding genes (over 86% for all samples) (Figure 5E). Taken together, these metrics indicate that all samples are of sufficiently high quality for downstream analysis.

Figure 5. Selection of metrics reported by ZARP.

Shown are (A) GC content, (B) Transcript integrity numbers (TIN), (C) adapter removal report, (D) alignment statistics, and (E) biotypes for the test run described in the main text. Figures have been slightly edited for visibility and accessibility: some labels have been increased, cohort names have been simplified, samples have been grouped according to cohorts, and cohorts are highlighted in different colors via the MultiQC “Highlight” functionality. Additionally, some biotypes have been omitted from (E) as they are not meaningfully represented. Note that in (B), transcripts that are not expressed are assigned a TIN score of 0. The complete raw HTML report can be found at Zenodo.⁵⁸ Cohort names correspond to the mouse sample cohorts described above as follows, CTRL: wild-type, CTRL RAPA: rapamycin-treated, TSCmKO: tuberous sclerosis complex 1 (TSC1) knockout, TSCmKO RAPA: rapamycin-treated TSC1 knockout.

In addition to sample-specific metrics, ZARP also performs principal component analyses across samples (Figure 6). For the test run, the distribution of samples in the space of the first two principal components shows a clustering by condition, with a clear separation between knockout and wild type, as well as between the untreated and rapamycin-treated TSC1 knockout mice. This separation is more pronounced at the gene expression level (Figure 6A), but is also present at the transcript level (Figure 6B). This shows that the differences across conditions are more pronounced than any replicate biases (multiplicative noise, sequencing errors), which strongly increases the likelihood that any subsequent analyses (e.g., differential gene/transcript expression analysis) will provide targets of biological importance.

Figure 6. Principal component analysis.

PCA at the (A) gene and (B) transcript level. PC1 and PC2 correspond to the first and second principal components, respectively. Variances explained by each of them are stated in the parentheses of the corresponding axes labels. Expression levels used in this figure are those reported by kallisto, but ZARP also generates corresponding PCA plots for Salmon-based quantifications. The figure has been slightly edited for visibility and accessibility: some labels have been increased, cohort names have been simplified and different colors for cohorts have been chosen via MultiQC’s “Highlight” functionality.

The total wall clock time to execute the entire test run was over one hour (1 h 12 min; see Figure 7) for all 20 samples on our institution’s Slurm-managed HPC cluster, where we could make heavy use of ZARP’s parallelization capabilities. This translates to a total CPU time of 71.65 h, out of which 8.32h were not sample-specific, i.e., jobs that had to be executed only once for all samples. The accumulated sample-specific CPU time used for each sample varied between 1.6h and 5h (mean: 3.2h). While the actual runtime may differ considerably across different compute environments, we project that most users would be able to run even large-scale analyses with dozens to hundreds of samples in less than a day on an HPC/HTC cluster or in the cloud, with very little hands-on time. Maximum memory usage for any of the steps and across all samples was 30.2 Gb for rule map_genome_star, which is in accordance with STAR requirements.³⁶ With <32 Gb, this indicates that ZARP is suitable for execution on state of the art desktop and and even laptop computers, albeit at considerably higher runtimes due to the need for serial execution of individual steps and limited use of multithreading. Also, steps requiring a lot of memory are typically those that index or otherwise process genomes and/or gene annotations, so that memory requirements are higher for samples sourced from organisms with larger genomes or higher resolution annotations. None of the jobs took longer than ~35 min (wall clock time) for any of the samples (Figure 7). Among the most time-consuming steps are the creation of indices for STAR, Salmon, and kallisto (from 10min to 35 min), which, however, typically have to be performed only once per set of genome resources - and would be reused across additional runs if the same working directory is used (use ZARP-cli to enforce the consistent use of the same directory for your ZARP runs). Among the sample-specific steps, the calculation of the transcript integrity number (TIN) was the most time-consuming. However, we had already considerably reduced its runtime by adding parallelization capabilities to the original script (see Methods for further details).

Figure 7. Runtime statistics.

Left: Creation date of the files of each of the rules as generated by Snakemake’s reporting functionality. Right: Runtime (in seconds; wall clock time) of the different steps (“rules”) of the workflow run are depicted for each sample, as generated by Snakemake’s reporting functionality. Axis sizes were modified using Vega Editor, a functionality available directly from the Snakemake report. Differences observed across samples are a function of sample sizes, but also include sources of variation such as the time that individual jobs spent queuing on our Slurm-managed HPC cluster, as well as the hardware specifications of the cluster nodes these jobs eventually ended up on.

Triggering the test case run with an appropriate Snakemake call is easy enough, yet still requires users to fetch sample libraries and genome resources manually, from the Sequence Read Archive⁴⁸ and Ensembl,⁵⁰ respectively. These steps can be automated by passing the sample table to ZARP-cli instead – slightly modified to replace sample names with SRA run identifiers, and leaving the sample library and genome resource path columns empty. The modified sample table for use with ZARP-cli is also provided in our example data repository.⁵⁸ Preparing a sample table is straightforward when the necessary metadata is readily available. However, in our own experience, due to a lack of widely adopted reporting standards that would reliably enable at least the most common types of downstream analyses for RNA-seq data, this is rarely the case when data is fetched from external sources. Especially for large sample sets from different cohorts, preparing a correct sample table will at the very least require painstaking literature mining. To address this issue, and to make ZARP even more accessible, we have included in ZARP-cli our RNA-seq “metadata sniffer” HTSinfer, which attempts to infer important (and widely underreported) metadata, such as the read orientation and 3’ adapters, from the sample libraries themselves. Our supplementary data repository⁵⁸ includes an example of how this functionality allows ZARP users to work with a sparse sample table with parts or even all of the metadata missing, across samples from multiple organisms and prepared according to different protocols, in a single run. For quick checks, and in its most simple invocation, ZARP-cli can be called with a “two word” command like zarp SRR23590181. Examples of how ZARP-cli can be called are summarized in Figure 4.

We note that HTSinfer is still experimental, as its attempts at inferring metadata will often fail, most commonly because the data is compatible with more than one outcome and there is insufficient resolution in our inference metrics to distinguish them (e.g., a sample is compatible with two or more source organisms). Nevertheless, even the partial information that HTSinfer does provide saves substantial time on the part of the user. Moreover, ZARP(-cli) will continuously be updated with the latest version of HTSinfer to ensure that the most accurate metadata inference is made available to its users.

In summary, our test cases demonstrate how ZARP can be used to quickly gain informative insights (Figures 5 & 6) from non-trivial real-world RNA-seq datasets in a reasonable timeframe (Figure 7) and how the dedicated ZARP command line interface can be used to trigger RNA-seq analyses with unprecedented ease (Figure 4).

Conclusions/Discussion

ZARP is a general purpose, easy-to-use, reliable and efficient RNA-seq processing workflow that can be used by molecular biologists with minimal programming experience and bioinformatics experts alike. Scientists with access to a UNIX-based computer with at least 32 GB RAM (Linux and 128+ GB RAM preferred), or to a computing cluster or commercial cloud service, can run the workflow to get an initial view of their data on a relatively short time scale. For laboratories frequently generating RNA-seq data, running ZARP on incoming data can be easily incorporated into an automated data acquisition workflow through ZARP-cli’s Python API. Similarly, ZARP-cli is particularly well suited for analyzing large-scale systematic studies that rely on hundreds or even thousands of sequencing libraries. ZARP has been specifically fine-tuned to process bulk RNA-seq datasets, allowing users to run it out of the box with default parameters. At the same time, ZARP allows advanced users to customize workflow behavior, thereby making it a helpful and flexible tool for edge cases, where a more generic analysis with default settings is unsuitable. The outputs that ZARP provides can serve as entry points for other project-specific analyses, such as differential gene and transcript expression analyses. ZARP and ZARP-cli are publicly available under a permissive open source license (Apache License, Version 2.0), and contributions from the bioinformatics community are welcome. Please address all development-related inquiries as issues at the official GitHub repositories for ZARP and ZARP-cli.

ZARP’s ease of use, coupled with its versatility, the use of state-of-the-art tools and modern technologies, its high level of adherence to best software development, Open Source and Open Science practices, and its strict focus on a well defined problem make it stand out among its competition, in particular with respect to analyzing large numbers of samples in a reproducible manner.

Ethics and consent

Ethical approval and written informed consent were not required.

Data and software availability

Underlying data

• Gene Expression Omnibus⁵⁹: The neuromuscular junction is a focal point of mTORC1 signaling in sarcopenia [TSCmKO data set] (Mus musculus; samples: GSM4134108 to GSM4134127). Accession number GSE139213; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE139213.⁶⁰
• Gene Expression Omnibus: SM_STG1_T0_2 (Zea mays; series: Nitrogen fixation and mucilage production on maize aerial roots is controlled by aerial root development and border cell functions). Accession number GSM5137669; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5137669.⁶¹
• Gene Expression Omnibus: Naive_Propy_20uM_Myeloid TAGCGCTC_ATAGCCTT (Danio rerio; series: RNAseq analysis of the herbicide propyzamide on a pre-clinical model of IBD on Danio rerio model). Accession number GSM5835373; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5835373.⁶²
• Gene Expression Omnibus: C. elegans, with bacteria, 0h, rep1 (Caenorhabditis elegans; series: Long-term imaging reveals behavioral plasticity during C. elegans dauer exit). Accession number GSM6601040; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6601040.⁶³
• Gene Expression Omnibus: 95Cb.del_rep2 (Drosophila melanogaster; series: Deletions of singular U1 snRNA gene significantly interfere with transcription and 3′-end mRNA formation rather than pre-mRNA splicing). Accession number GSM7051046; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM7051046.⁶⁴
• Gene Expression Omnibus: B-P-D6-1h-9_S36 (Mus musculus; series: Transmission of stimulus-induced epigentic changes through cell division are coupled to changes in transcription factor activity). Accession number GSM7058404; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM7058404.⁶⁵

Extended data

Zenodo.⁶⁶ ZARP: Supplementary materials. 10.5281/zenodo.10797372.⁵⁸

This project contains the following extended data:

• zarp_use_cases.zip: Includes instructions, convenience scripts and input data required to reproduce the use cases described in this article.
• mouse_sarcopenia_example_outputs.zip: Contains the cross-sample outputs (gene/transcript expression tables, PCA, MultiQC report and Snakemake report) produced by a ZARP run for the described mouse sarcopenia RNA-seq experiment (not including indexes), as well as the sample-specific outputs for the smallest of the 20 samples.
• zarp_cli_example_outputs.zip: Contains a representative fraction of the outputs produced by the described ZARP-cli runs; in particular, it contains the artifacts generated by the SRA download workflow for a single C. elegans sample (SRR21711080), the outputs produced by the HTSinfer workflow for all 25 samples (20 from the mouse sarcopenia dataset and 5 from the metadata inference demonstration run), the genome resources created by genomepy for the C. elegans genome WBcel235, and the C. elegans-specific ZARP workflow results, including indexes.

Data and code/scripts are available under the terms of the CC-BY 4.0 and Apache License, Version 2.0 licenses, respectively. More detailed license information is provided in the record itself.

Software

Source code available from: https://github.com/zavolanlab/zarp (ZARP) and https://github.com/zavolanlab/zarp-cli (ZARP-cli)

Archived source code at time of publication: 10.5281/zenodo.10797025 (ZARP v1.0.0-rc.1)⁶⁷ and 10.5281/zenodo.10789819 (ZARP-cli v1.0.0-rc.1)⁶⁸

License: Apache License, Version 2.0.

On both GitHub (source code) and Zenodo (code archives), the software is accessible for anyone to download without prior registration. For most purposes, we recommend fetching the software from GitHub, as it will host the most up-to-date code.

Acknowledgements

Calculations were performed at the sciCORE scientific computing center at the University of Basel. We would like to thank the sciCORE team for their time and efforts to aid us in this project. We would also like to express our deepest gratitude towards all members of the Zavolan Lab who contributed to this work with numerous pieces of advice and feedback, during the initial development, as well as by testing the workflow in later stages. Finally, we would like to thank the Bioconda community for helping us package and distribute some of the custom tools we developed.

References

1. Levin C, Dynomant E, Gonzalez BJ, et al.: A data-supported history of bioinformatics tools. arXiv [cs. DL]. 2018 Jul. Publisher Full Text
2. Kanitz A, Gypas F, Gruber AJ, et al.: Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome Biol. 2015 Jul 23; 16: 150. PubMed Abstract | Publisher Full Text | Free Full Text
3. Teng M, Love MI, Davis CA, et al.: A benchmark for RNA-seq quantification pipelines. Genome Biol. 2016 Apr 23; 17: 74. PubMed Abstract | Publisher Full Text | Free Full Text
4. Hafner M, Katsantoni M, Köster T, et al.: CLIP and complementary methods. Nat. Rev. Methods Primers. 2021 Mar 4; 1(1): 1–23. Publisher Full Text
5. Herrmann CJ, Schmidt R, Kanitz A, et al.: PolyASite 2.0: a consolidated atlas of polyadenylation sites from 3′ end sequencing. Nucleic Acids Res. 2019 Oct 16; 48(D1): D174–D179. PubMed Abstract | Publisher Full Text | Free Full Text
6. Bryce-Smith S, Burri D, Gazzara MR, et al.: Extensible benchmarking of methods that identify and quantify polyadenylation sites from RNA-seq data. RNA. 2023 Dec; 29(12): 1839–1855. PubMed Abstract | Publisher Full Text | Free Full Text
7. Zappia L, Theis FJ: Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol. 2021 Oct 29; 22(1): 301. PubMed Abstract | Publisher Full Text | Free Full Text
8. Cock PJA, Fields CJ, Goto N, et al.: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2009 Dec 16; 38(6): 1767–1771. PubMed Abstract | Publisher Full Text | Free Full Text
9. Li H, Handsaker B, Wysoker A, et al.: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15; 25(16): 2078–2079. PubMed Abstract | Publisher Full Text | Free Full Text
10. Muir P, Li S, Lou S, et al.: Erratum to: The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biol. 2016 Apr 28; 17: 78. PubMed Abstract | Publisher Full Text | Free Full Text
11. Fillinger S, de la Garza L , Peltzer A, et al.: Challenges of big data integration in the life sciences. Anal. Bioanal. Chem. 2019 Oct; 411(26): 6791–6800. Publisher Full Text
12. Ewels PA, Peltzer A, Fillinger S, et al.: The nf-core framework for community-curated bioinformatics pipelines. Nat. Biotechnol. 2020 Mar; 38(3): 276–278. PubMed Abstract | Publisher Full Text
13. Goble C, Soiland-Reyes S, Bacall F, et al.: Implementing FAIR Digital Objects in the EOSC-Life workflow collaboratory. Zenodo. 2021 Mar 12. Publisher Full Text
14. Yuen D, Cabansay L, Duncan A, et al.: The Dockstore: enhancing a community platform for sharing reproducible and accessible computational protocols. Nucleic Acids Res. 2021 Jul 2; 49(W1): W624–W632. PubMed Abstract | Publisher Full Text | Free Full Text
15. Perkel JM: Workflow systems turn raw data into scientific knowledge. Nature. 2019 Sep; 573(7772): 149–150. PubMed Abstract | Publisher Full Text
16. Wratten L, Wilm A, Göke J: Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat. Methods. 2021 Sep 23; 18: 1161–1168. PubMed Abstract | Publisher Full Text
17. Mölder F, Jablonski KP, Letcher B, et al.: Sustainable data analysis with Snakemake. F1000Res. 2021 Jan 18; 10(33): 33. PubMed Abstract | Publisher Full Text | Free Full Text
18. Di Tommaso P, Chatzou M, Floden EW, et al.: Nextflow enables reproducible computational workflows. Nat. Biotechnol. 2017 Apr 11; 35(4): 316–319. PubMed Abstract | Publisher Full Text
19. Boettiger C: An introduction to Docker for reproducible research. Oper Syst Rev. 2015 Jan 20; 49(1): 71–79. Publisher Full Text
20. Kurtzer GM, Sochat V, Bauer MW: Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11; 12(5): e0177459. PubMed Abstract | Publisher Full Text | Free Full Text
21. Leo S, Crusoe MR, Rodríguez-Navas L, et al.: Recording provenance of workflow runs with RO-Crate. arXiv [cs. DL]. 2023 Dec. Publisher Full Text
22. Wilkinson MD, Dumontier M, Aalbersberg IJJ, et al.: The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016 Mar 15; 3: 160018. PubMed Abstract | Publisher Full Text | Free Full Text
23. Barker M, Chue Hong NP, Katz DS, et al.: Introducing the FAIR Principles for research software. Sci. Data. 2022 Oct 14; 9(1): 622. PubMed Abstract | Publisher Full Text | Free Full Text
24. Sahraeian SME, Mohiyuddin M, Sebra R, et al.: Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nat. Commun. 2017 Jul 5; 8(1): 59. PubMed Abstract | Publisher Full Text | Free Full Text
25. Cornwell M, Vangala M, Taing L, et al.: VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis. BMC Bioinformatics. 2018 Apr 12; 19(1): 135. PubMed Abstract | Publisher Full Text | Free Full Text
26. Orjuela S, Huang R, Hembach KM, et al.: ARMOR: An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data. G3. 2019 Jul 9; 9(7): 2089–2096. PubMed Abstract | Publisher Full Text | Free Full Text
27. Sundararajan Z, Knoll R, Hombach P, et al.: Shiny-Seq: advanced guided transcriptome analysis. BMC. Res. Notes. 2019 Jul 18; 12(1): 432. PubMed Abstract | Publisher Full Text | Free Full Text
28. Kohen R, Barlev J, Hornung G, et al.: UTAP: User-friendly Transcriptome Analysis Pipeline. BMC Bioinformatics. 2019 Mar 25; 20(1): 154. PubMed Abstract | Publisher Full Text | Free Full Text
29. Zhang X, Jonassen I: RASflow: an RNA-Seq analysis workflow with Snakemake. BMC Bioinformatics. 2020 Mar 18; 21(1): 110. PubMed Abstract | Publisher Full Text | Free Full Text
30. Sun S, Xu L, Zou Q, et al.: BP4RNAseq: a babysitter package for retrospective and newly generated RNA-seq data analyses using both alignment-based and alignment-free quantification method. Bioinformatics. 2021 Jun 9; 37(9): 1319–1321. PubMed Abstract | Publisher Full Text
31. Katsantoni M, Gypas F, Herrmann CJ, et al.: ZARP: An automated workflow for processing of RNA-seq data. BioRxiv. 2021. 2021.11.18.469017. Publisher Full Text
32. de Visser C , Johansson LF, Kulkarni P, et al.: Ten quick tips for building FAIR workflows. PLoS Comput. Biol. 2023 Sep; 19(9): e1011369. PubMed Abstract | Publisher Full Text | Free Full Text
33. Conesa A, Madrigal P, Tarazona S, et al.: A survey of best practices for RNA-seq data analysis. Genome Biol. 2016 Jan 26; 17: 13. PubMed Abstract | Publisher Full Text | Free Full Text
34. Benjamini Y, Speed TP: Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012 May; 40(10): e72. PubMed Abstract | Publisher Full Text | Free Full Text
35. Martin M: Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011 May 2; 17(1): 10–2. Publisher Full Text
36. Dobin A, Davis CA, Schlesinger F, et al.: STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1; 29(1): 15–21. PubMed Abstract | Publisher Full Text | Free Full Text
37. Baruzzo G, Hayer KE, Kim EJ, et al.: Simulation-based comprehensive benchmarking of RNA-seq aligners. Nat. Methods. 2016 Dec 12; 14(2): 135–139. PubMed Abstract | Publisher Full Text | Free Full Text
38. Kuhn RM, Haussler D, Kent WJ: The UCSC genome browser and associated tools. Brief. Bioinform. 2013 Mar; 14(2): 144–161. PubMed Abstract | Publisher Full Text | Free Full Text
39. Wang L, Nie J, Sicotte H, et al.: Measure transcript integrity using RNA-seq data. BMC Bioinformatics. 2016 Feb 3; 17: 58. PubMed Abstract | Publisher Full Text | Free Full Text
40. Wang L, Wang S, Li W: RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012 Aug 15; 28(16): 2184–2185. PubMed Abstract | Publisher Full Text
41. Bahin M, Noël BF, Murigneux V, et al.: ALFA: annotation landscape for aligned reads. BMC Genomics. 2019 Mar 29; 20(1): 250. PubMed Abstract | Publisher Full Text | Free Full Text
42. Soneson C, Love MI, Robinson MD: Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res. 2015 Dec 30; 4: 1521. Publisher Full Text
43. Bray NL, Pimentel H, Melsted P, et al.: Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016 May; 34(5): 525–527. PubMed Abstract | Publisher Full Text
44. Patro R, Duggal G, Love MI, et al.: Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods. 2017 Mar 6; 14: 417–419. PubMed Abstract | Publisher Full Text | Free Full Text
45. Wagner GP, Kin K, Lynch VJ: Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 2012 Dec; 131(4): 281–285. PubMed Abstract | Publisher Full Text
46. Ewels P, Magnusson M, Lundin S, et al.: MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1; 32(19): 3047–3048. PubMed Abstract | Publisher Full Text | Free Full Text
47. Roach MJ, Pierce-Ward NT, Suchecki R, et al.: Ten simple rules and a template for creating workflows-as-applications. PLoS Comput. Biol. 2022 Dec; 18(12): e1010705. PubMed Abstract | Publisher Full Text | Free Full Text
48. Katz K, Shutov O, Lapoint R, et al.: The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 2022 Jan 7; 50(D1): D387–D390. PubMed Abstract | Publisher Full Text | Free Full Text
49. Frölich S, van der Sande M , Schäfers T, et al.: genomepy: genes and genomes at your fingertips. Bioinformatics. 2023 Mar 1; 39(3). PubMed Abstract | Publisher Full Text | Free Full Text
50. Howe KL, Achuthan P, Allen J, et al.: Ensembl 2021. Nucleic Acids Res. 2021 Jan 8; 49(D1): D884–D891. PubMed Abstract | Publisher Full Text | Free Full Text
51. Kensche PR, Kanitz A, Topolsky I, et al.: Executing workflows in the cloud with WESkit. BioHackrXiv. 2023 Feb 20. Publisher Full Text
52. Grüning B, Dale R, Sjödin A, et al.: Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods. 2018 Jul; 15(7): 475–476. PubMed Abstract | Publisher Full Text | Free Full Text
53. Bai J, Bandla C, Guo J, et al.: BioContainers Registry: Searching Bioinformatics and Proteomics Tools, Packages, and Containers. J. Proteome Res. 2021 Apr 2; 20(4): 2056–2061. PubMed Abstract | Publisher Full Text | Free Full Text
54. Moreau D, Wiebels K, Boettiger C: Containers for computational reproducibility. Nat. Rev. Methods Primers. 2023 Jul 13; 3(1): 1–16. Publisher Full Text
55. Jiménez RC, Kuzak M, Alhamdoosh M, et al.: Four simple recommendations to encourage best practices in research software. F1000Res. 2017 Jun 13; 6: 876. PubMed Abstract | Publisher Full Text | Free Full Text
56. Brack P, Crowther P, Soiland-Reyes S, et al.: Ten simple rules for making a software tool workflow-ready. PLoS Comput. Biol. 2022 Mar; 18(3): e1009823. PubMed Abstract | Publisher Full Text | Free Full Text
57. Ham DJ, Börsch A, Lin S, et al.: The neuromuscular junction is a focal point of mTORC1 signaling in sarcopenia. Nat. Commun. 2020 Sep 9; 11(1): 4510. PubMed Abstract | Publisher Full Text | Free Full Text
58. Katsantoni M, Bąk M, Ataman M, et al.: ZARP: Supplementary materials (v2.0.0). Zenodo. 2024 Mar 8. Publisher Full Text
59. Barrett T, Wilhite SE, Ledoux P, et al.: NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013 Jan; 41(Database issue): D991–D995. PubMed Abstract | Publisher Full Text | Free Full Text
60. Tintignac L, Mittal N, Guridi M, et al.: The neuromuscular junction is a focal point of mTORC1 signaling in sarcopenia. [TSCmKO data set]. Gene Expression Omnibus. 2020. Reference Source
61. Pankievicz V: SM_STG1_T0_2. Gene Expression Omnibus. 2022. Reference Source
62. Li Z: Naive_Propy_20uM_Myeloid TAGCGCTC_ATAGCCTT. Gene Expression Omnibus. 2022. Reference Source
63. Preusser F: C. elegans, with bacteria, 0h, rep1. Gene Expression Omnibus. 2022. Reference Source
64. Liang AM: 95Cb.del_rep2. Gene Expression Omnibus. 2023. Reference Source
65. Sun SJ: B-P-D6-1h-9_S36. Gene Expression Omnibus. 2023. Reference Source
66. European Organization For Nuclear Research, OpenAIRE. Zenodo. CERN.2013. Publisher Full Text
67. Katsantoni M, Kanitz A, Herrmann CJ, et al.: ZARP: The Zavolab Automated RNA-seq Pipeline (v1.0.0-rc.1). Zenodo. 2024 Mar 6. Publisher Full Text
68. Kanitz A, Herrmann CJ, Katsantoni M, et al.: ZARP-cli: A user-friendly command-line interface for the ZARP RNA-seq analysis workflow (v1.0.0-rc.1). Zenodo. 2024 Mar 6. Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 24 May 2024

Author details Author details

¹ Biozentrum, University of Basel, Basel, Basel-Stadt, 4056, Switzerland
² Swiss Institute of Bioinformatics, Lausanne, Vaud, 1015, Switzerland
³ Friedrich Miescher Institute for Biomedical Research, Basel, Basel-Stadt, 4058, Switzerland

Maria Katsantoni
Roles: Conceptualization, Project Administration, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Foivos Gypas
Roles: Conceptualization, Data Curation, Project Administration, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Christina J Herrmann
Roles: Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Dominik Burri
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Maciej Bąk
Roles: Data Curation, Project Administration, Software, Writing – Original Draft Preparation

Paula Iborra
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Krish Agarwal
Roles: Software, Writing – Review & Editing

Meriç Ataman
Roles: Data Curation, Validation, Writing – Original Draft Preparation

Máté Balajti
Roles: Software, Writing – Review & Editing

Noè Pozzan
Roles: Validation, Writing – Review & Editing

Niels Schlusser
Roles: Validation, Writing – Review & Editing

Youngbin Moon
Roles: Validation, Writing – Review & Editing

Aleksei Mironov
Roles: Validation, Writing – Review & Editing

Anastasiya Börsch
Roles: Validation, Writing – Original Draft Preparation

Mihaela Zavolan
Roles: Conceptualization, Funding Acquisition, Project Administration, Resources, Software, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Alexander Kanitz
Roles: Conceptualization, Data Curation, Project Administration, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The work was supported by Swiss National Science Foundation grant #189063 to MZ and by the NCCR RNA & Disease (grant #182880) funding to MZ. MK, DB and MBak were supported by the "Biozentrum PhD Fellowships" program of the University of Basel.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 24 May 2024, 13:533

https://doi.org/10.12688/f1000research.149237.1

Copyright

© 2024 Katsantoni M et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Katsantoni M, Gypas F, Herrmann CJ et al. ZARP: A user-friendly and versatile RNA-seq analysis workflow [version 1; peer review: 2 approved with reservations]. F1000Research 2024, 13:533 (https://doi.org/10.12688/f1000research.149237.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 24 May 2024

Views

30

Reviewer Report 15 Jul 2024

Bo Li, Chongqing Normal University, Chongqing, China

Approved with Reservations

https://doi.org/10.5256/f1000research.163676.r293730

The manuscript titled "ZARP: A User-Friendly and Versatile RNA-seq Analysis Workflow" presents ZARP, an RNA-seq analysis workflow developed to streamline the process of RNA-seq data analysis. The authors emphasize ZARP's ease of use, flexibility, and integration of state-of-the-art tools, aiming ... Continue reading

The manuscript titled "ZARP: A User-Friendly and Versatile RNA-seq Analysis Workflow" presents ZARP, an RNA-seq analysis workflow developed to streamline the process of RNA-seq data analysis. The authors emphasize ZARP's ease of use, flexibility, and integration of state-of-the-art tools, aiming to minimize hands-on time for both bioinformaticians and non-experts.
Major Comments:
(1) The manuscript highlights the challenges in RNA-seq data analysis, particularly the complexities of tool dependencies and the need for user-friendly workflows. The development of ZARP in Snakemake, with a command-line interface, is a significant contribution to the field. However, it would be beneficial to include a comparative analysis with other existing workflows to quantitatively demonstrate ZARP's improvements in usability and performance.
(2) The author mentions that among the many published workflows, some are too complex for users to deploy, run, and/or interpret, while others are not easy to customize and/or analyze large numbers of samples. Please specify which workflows exhibit these characteristics.
(3) The description of the methods is comprehensive, detailing each step of the workflow from quality control to expression quantification. However, the manuscript would benefit from a more detailed explanation of the modifications made to the tin-score-calculation script for multiprocessing. Additionally, providing benchmarking results for ZARP's performance on different hardware configurations could enhance the manuscript's impact.
(4) The authors have made commendable efforts to ensure reproducibility and accessibility through the use of containers and Conda environments. But, providing a detailed user guide that includes information about the installation process, potential pitfalls, and troubleshooting tips is essential for users to successfully deploy ZARP.
(5) The manuscript provides an overview of ZARP's functionalities and the tools integrated within the workflow. However, the authors should enrich their manuscript with more comprehensive case studies or real-world applications to strengthen the validation of the workflow. Furthermore, applying the workflow to diverse RNA-seq datasets, including single-cell RNA-seq, would enhance its utility and appeal.
(6) It would be better to provide a detailed table in the operation section of the manuscript, including minimum specifications, recommended configurations, and installation guidelines for different environments to improve user-friendliness.
(7) The figures and tables included are informative but could be improved for clarity. For instance, Figure 1 provides a schematic overview of the workflow, but a more detailed flowchart with step-by-step annotations would be beneficial. Tables listing the tools and their versions are helpful, but adding columns for the specific tasks each tool performs and any dependencies would provide a more comprehensive overview.
Minor Comments:
(1) The abstract is well-written but could be more concise. Reducing redundancy and focusing on the key contributions of ZARP would make it more impactful.
(2) I suggest splitting the first column (tool) in Table 1 into two columns, namely the software (tool) and version (version) columns, which is clearer.
(3) The references are appropriate and cover relevant literature. Ensuring all references are up-to-date and including any recently published related works would be beneficial.
(4) When authors cite datasets stored on Zenodo in their paper and provide corresponding reference information (such as reference 58), please indicate the DOI number to facilitate readers to search and repeat experiments.
(5) I strongly suggest that the author give more example codes, such as the installation of ZARP and each subsequent step of operation, which is very important for beginners. Only by doing this can it be called truly easy-to-use.
Overall, the manuscript presents a valuable tool for RNA-seq data analysis. Addressing the comments above will strengthen the manuscript, making it more robust and impactful.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Single-cell and spatial transcriptomics data analysis methodology, Bioinformatics algorithm evaluation and software development, Computational drug repurposing

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

33

Reviewer Report 10 Jul 2024

Silvia Liu, University of Pittsburgh, Pittsburgh, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.163676.r293726

The authors have established an RNA-seq workflow by integrating several state-of-the-art software in the field. This workflow is designed for easy install and compatible with multiple running environments and large sample size. However, here are some major concerns.
... Continue reading

The authors have established an RNA-seq workflow by integrating several state-of-the-art software in the field. This workflow is designed for easy install and compatible with multiple running environments and large sample size. However, here are some major concerns.

1. In the background introduction, the author described that: “A number of workflows for the analysis of bulk RNA-seq data have been developed by the community. (citation 24–30).” The rational is unclear and weak. The authors didn’t list some existing platforms/ workflows and pinpoint their detailed limitations. Besides the RNA workflows listed in citation 24-30, there are a lot of popular platforms, such as nf-core (https://nf-co.re/rnaseq/3.14.0/), Illumina Engine, CLC Genomics Workbench (QIAGEN), Partek Flow (Partek), etc. Most of these platforms provide both graphical user interface and command-line interface for both biologists and bioinformaticians. Besides, these platforms will provide more tool options. For example, only Cutadapt is embedded in ZARP for data trimming and only STAR is integrated for reference-based alignment, but no other alternative popular tools are available. In addition, in the result section, there is no comparison with the current workflows. What’s the rationale and advantages of the proposed ZARP workflow over the existing ones?
Version control. If some tools embedded with the ZARP workflow are updated to the new version, can it be easily updated in the ZARP workflow? In case the users want to apply some customized versions of the tools, is it easy to select?
Parameter adjustment. Will the workflow only run on default parameter settings of each tool, or the users can customize the settings?
Do the users have to run all the tools in the workflow, or optional? For example, can users only run reference-based analysis (STAR), but not run the other branch (Salmon/ Kallisko)? Or the other way around?
Can the authors specify computing sources (such as number of cores/ CPUs, maximum memory usage, etc) to be applied?

Is the rationale for developing the new software tool clearly explained?

Partly
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Genomic data analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 24 May 2024

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 24 May 24	read	read

Silvia Liu, University of Pittsburgh, Pittsburgh, USA
Bo Li, Chongqing Normal University, Chongqing, China

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

30 Views

15 Jul 2024 | for Version 1

Bo Li, Chongqing Normal University, Chongqing, China

30 Views Cite this report Responses(0)

Approved With Reservations

The manuscript titled "ZARP: A User-Friendly and Versatile RNA-seq Analysis Workflow" presents ZARP, an RNA-seq analysis workflow developed to streamline the process of RNA-seq data analysis. The authors emphasize ZARP's ease of use, flexibility, and integration of state-of-the-art tools, aiming to minimize hands-on time for both bioinformaticians and non-experts.
Major Comments:
(1) The manuscript highlights the challenges in RNA-seq data analysis, particularly the complexities of tool dependencies and the need for user-friendly workflows. The development of ZARP in Snakemake, with a command-line interface, is a significant contribution to the field. However, it would be beneficial to include a comparative analysis with other existing workflows to quantitatively demonstrate ZARP's improvements in usability and performance.
(2) The author mentions that among the many published workflows, some are too complex for users to deploy, run, and/or interpret, while others are not easy to customize and/or analyze large numbers of samples. Please specify which workflows exhibit these characteristics.
(3) The description of the methods is comprehensive, detailing each step of the workflow from quality control to expression quantification. However, the manuscript would benefit from a more detailed explanation of the modifications made to the tin-score-calculation script for multiprocessing. Additionally, providing benchmarking results for ZARP's performance on different hardware configurations could enhance the manuscript's impact.
(4) The authors have made commendable efforts to ensure reproducibility and accessibility through the use of containers and Conda environments. But, providing a detailed user guide that includes information about the installation process, potential pitfalls, and troubleshooting tips is essential for users to successfully deploy ZARP.
(5) The manuscript provides an overview of ZARP's functionalities and the tools integrated within the workflow. However, the authors should enrich their manuscript with more comprehensive case studies or real-world applications to strengthen the validation of the workflow. Furthermore, applying the workflow to diverse RNA-seq datasets, including single-cell RNA-seq, would enhance its utility and appeal.
(6) It would be better to provide a detailed table in the operation section of the manuscript, including minimum specifications, recommended configurations, and installation guidelines for different environments to improve user-friendliness.
(7) The figures and tables included are informative but could be improved for clarity. For instance, Figure 1 provides a schematic overview of the workflow, but a more detailed flowchart with step-by-step annotations would be beneficial. Tables listing the tools and their versions are helpful, but adding columns for the specific tasks each tool performs and any dependencies would provide a more comprehensive overview.
Minor Comments:
(1) The abstract is well-written but could be more concise. Reducing redundancy and focusing on the key contributions of ZARP would make it more impactful.
(2) I suggest splitting the first column (tool) in Table 1 into two columns, namely the software (tool) and version (version) columns, which is clearer.
(3) The references are appropriate and cover relevant literature. Ensuring all references are up-to-date and including any recently published related works would be beneficial.
(4) When authors cite datasets stored on Zenodo in their paper and provide corresponding reference information (such as reference 58), please indicate the DOI number to facilitate readers to search and repeat experiments.
(5) I strongly suggest that the author give more example codes, such as the installation of ZARP and each subsequent step of operation, which is very important for beginners. Only by doing this can it be called truly easy-to-use.
Overall, the manuscript presents a valuable tool for RNA-seq data analysis. Addressing the comments above will strengthen the manuscript, making it more robust and impactful.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Single-cell and spatial transcriptomics data analysis methodology, Bioinformatics algorithm evaluation and software development, Computational drug repurposing

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

33 Views

10 Jul 2024 | for Version 1

Silvia Liu, University of Pittsburgh, Pittsburgh, USA

33 Views Cite this report Responses(0)

Approved With Reservations

The authors have established an RNA-seq workflow by integrating several state-of-the-art software in the field. This workflow is designed for easy install and compatible with multiple running environments and large sample size. However, here are some major concerns.

1. In the background introduction, the author described that: “A number of workflows for the analysis of bulk RNA-seq data have been developed by the community. (citation 24–30).” The rational is unclear and weak. The authors didn’t list some existing platforms/ workflows and pinpoint their detailed limitations. Besides the RNA workflows listed in citation 24-30, there are a lot of popular platforms, such as nf-core (https://nf-co.re/rnaseq/3.14.0/), Illumina Engine, CLC Genomics Workbench (QIAGEN), Partek Flow (Partek), etc. Most of these platforms provide both graphical user interface and command-line interface for both biologists and bioinformaticians. Besides, these platforms will provide more tool options. For example, only Cutadapt is embedded in ZARP for data trimming and only STAR is integrated for reference-based alignment, but no other alternative popular tools are available. In addition, in the result section, there is no comparison with the current workflows. What’s the rationale and advantages of the proposed ZARP workflow over the existing ones?
Version control. If some tools embedded with the ZARP workflow are updated to the new version, can it be easily updated in the ZARP workflow? In case the users want to apply some customized versions of the tools, is it easy to select?
Parameter adjustment. Will the workflow only run on default parameter settings of each tool, or the users can customize the settings?
Do the users have to run all the tools in the workflow, or optional? For example, can users only run reference-based analysis (STAR), but not run the other branch (Salmon/ Kallisko)? Or the other way around?
Can the authors specify computing sources (such as number of cores/ CPUs, maximum memory usage, etc) to be applied?

Is the rationale for developing the new software tool clearly explained?

Partly
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Genomic data analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. Levin C, Dynomant E, Gonzalez BJ, et al.: A data-supported history of bioinformatics tools. arXiv [cs. DL]. 2018 Jul. Publisher Full Text

[2] 2. Kanitz A, Gypas F, Gruber AJ, et al.: Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome Biol. 2015 Jul 23; 16: 150. PubMed Abstract | Publisher Full Text | Free Full Text

[3] 3. Teng M, Love MI, Davis CA, et al.: A benchmark for RNA-seq quantification pipelines. Genome Biol. 2016 Apr 23; 17: 74. PubMed Abstract | Publisher Full Text | Free Full Text

[4] 4. Hafner M, Katsantoni M, Köster T, et al.: CLIP and complementary methods. Nat. Rev. Methods Primers. 2021 Mar 4; 1(1): 1–23. Publisher Full Text

[5] 5. Herrmann CJ, Schmidt R, Kanitz A, et al.: PolyASite 2.0: a consolidated atlas of polyadenylation sites from 3′ end sequencing. Nucleic Acids Res. 2019 Oct 16; 48(D1): D174–D179. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Bryce-Smith S, Burri D, Gazzara MR, et al.: Extensible benchmarking of methods that identify and quantify polyadenylation sites from RNA-seq data. RNA. 2023 Dec; 29(12): 1839–1855. PubMed Abstract | Publisher Full Text | Free Full Text

[7] 7. Zappia L, Theis FJ: Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol. 2021 Oct 29; 22(1): 301. PubMed Abstract | Publisher Full Text | Free Full Text

[8] 8. Cock PJA, Fields CJ, Goto N, et al.: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2009 Dec 16; 38(6): 1767–1771. PubMed Abstract | Publisher Full Text | Free Full Text

[9] 9. Li H, Handsaker B, Wysoker A, et al.: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15; 25(16): 2078–2079. PubMed Abstract | Publisher Full Text | Free Full Text

[10] 10. Muir P, Li S, Lou S, et al.: Erratum to: The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biol. 2016 Apr 28; 17: 78. PubMed Abstract | Publisher Full Text | Free Full Text

[11] 11. Fillinger S, de la Garza L , Peltzer A, et al.: Challenges of big data integration in the life sciences. Anal. Bioanal. Chem. 2019 Oct; 411(26): 6791–6800. Publisher Full Text

[12] 12. Ewels PA, Peltzer A, Fillinger S, et al.: The nf-core framework for community-curated bioinformatics pipelines. Nat. Biotechnol. 2020 Mar; 38(3): 276–278. PubMed Abstract | Publisher Full Text

[13] 13. Goble C, Soiland-Reyes S, Bacall F, et al.: Implementing FAIR Digital Objects in the EOSC-Life workflow collaboratory. Zenodo. 2021 Mar 12. Publisher Full Text

[14] 14. Yuen D, Cabansay L, Duncan A, et al.: The Dockstore: enhancing a community platform for sharing reproducible and accessible computational protocols. Nucleic Acids Res. 2021 Jul 2; 49(W1): W624–W632. PubMed Abstract | Publisher Full Text | Free Full Text

[15] 15. Perkel JM: Workflow systems turn raw data into scientific knowledge. Nature. 2019 Sep; 573(7772): 149–150. PubMed Abstract | Publisher Full Text

[16] 16. Wratten L, Wilm A, Göke J: Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat. Methods. 2021 Sep 23; 18: 1161–1168. PubMed Abstract | Publisher Full Text

[17] 17. Mölder F, Jablonski KP, Letcher B, et al.: Sustainable data analysis with Snakemake. F1000Res. 2021 Jan 18; 10(33): 33. PubMed Abstract | Publisher Full Text | Free Full Text

[18] 18. Di Tommaso P, Chatzou M, Floden EW, et al.: Nextflow enables reproducible computational workflows. Nat. Biotechnol. 2017 Apr 11; 35(4): 316–319. PubMed Abstract | Publisher Full Text

[19] 19. Boettiger C: An introduction to Docker for reproducible research. Oper Syst Rev. 2015 Jan 20; 49(1): 71–79. Publisher Full Text

[20] 20. Kurtzer GM, Sochat V, Bauer MW: Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11; 12(5): e0177459. PubMed Abstract | Publisher Full Text | Free Full Text

[21] 21. Leo S, Crusoe MR, Rodríguez-Navas L, et al.: Recording provenance of workflow runs with RO-Crate. arXiv [cs. DL]. 2023 Dec. Publisher Full Text

[22] 22. Wilkinson MD, Dumontier M, Aalbersberg IJJ, et al.: The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016 Mar 15; 3: 160018. PubMed Abstract | Publisher Full Text | Free Full Text

[23] 23. Barker M, Chue Hong NP, Katz DS, et al.: Introducing the FAIR Principles for research software. Sci. Data. 2022 Oct 14; 9(1): 622. PubMed Abstract | Publisher Full Text | Free Full Text

[24] 24. Sahraeian SME, Mohiyuddin M, Sebra R, et al.: Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nat. Commun. 2017 Jul 5; 8(1): 59. PubMed Abstract | Publisher Full Text | Free Full Text

[25] 25. Cornwell M, Vangala M, Taing L, et al.: VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis. BMC Bioinformatics. 2018 Apr 12; 19(1): 135. PubMed Abstract | Publisher Full Text | Free Full Text

[26] 26. Orjuela S, Huang R, Hembach KM, et al.: ARMOR: An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data. G3. 2019 Jul 9; 9(7): 2089–2096. PubMed Abstract | Publisher Full Text | Free Full Text

[27] 27. Sundararajan Z, Knoll R, Hombach P, et al.: Shiny-Seq: advanced guided transcriptome analysis. BMC. Res. Notes. 2019 Jul 18; 12(1): 432. PubMed Abstract | Publisher Full Text | Free Full Text

[28] 28. Kohen R, Barlev J, Hornung G, et al.: UTAP: User-friendly Transcriptome Analysis Pipeline. BMC Bioinformatics. 2019 Mar 25; 20(1): 154. PubMed Abstract | Publisher Full Text | Free Full Text

[29] 29. Zhang X, Jonassen I: RASflow: an RNA-Seq analysis workflow with Snakemake. BMC Bioinformatics. 2020 Mar 18; 21(1): 110. PubMed Abstract | Publisher Full Text | Free Full Text

[30] 30. Sun S, Xu L, Zou Q, et al.: BP4RNAseq: a babysitter package for retrospective and newly generated RNA-seq data analyses using both alignment-based and alignment-free quantification method. Bioinformatics. 2021 Jun 9; 37(9): 1319–1321. PubMed Abstract | Publisher Full Text

[31] 31. Katsantoni M, Gypas F, Herrmann CJ, et al.: ZARP: An automated workflow for processing of RNA-seq data. BioRxiv. 2021. 2021.11.18.469017. Publisher Full Text

[32] 32. de Visser C , Johansson LF, Kulkarni P, et al.: Ten quick tips for building FAIR workflows. PLoS Comput. Biol. 2023 Sep; 19(9): e1011369. PubMed Abstract | Publisher Full Text | Free Full Text

[33] 33. Conesa A, Madrigal P, Tarazona S, et al.: A survey of best practices for RNA-seq data analysis. Genome Biol. 2016 Jan 26; 17: 13. PubMed Abstract | Publisher Full Text | Free Full Text

[34] 34. Benjamini Y, Speed TP: Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012 May; 40(10): e72. PubMed Abstract | Publisher Full Text | Free Full Text

[35] 35. Martin M: Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011 May 2; 17(1): 10–2. Publisher Full Text

[36] 36. Dobin A, Davis CA, Schlesinger F, et al.: STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1; 29(1): 15–21. PubMed Abstract | Publisher Full Text | Free Full Text

[37] 37. Baruzzo G, Hayer KE, Kim EJ, et al.: Simulation-based comprehensive benchmarking of RNA-seq aligners. Nat. Methods. 2016 Dec 12; 14(2): 135–139. PubMed Abstract | Publisher Full Text | Free Full Text

[38] 38. Kuhn RM, Haussler D, Kent WJ: The UCSC genome browser and associated tools. Brief. Bioinform. 2013 Mar; 14(2): 144–161. PubMed Abstract | Publisher Full Text | Free Full Text

[39] 39. Wang L, Nie J, Sicotte H, et al.: Measure transcript integrity using RNA-seq data. BMC Bioinformatics. 2016 Feb 3; 17: 58. PubMed Abstract | Publisher Full Text | Free Full Text

[40] 40. Wang L, Wang S, Li W: RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012 Aug 15; 28(16): 2184–2185. PubMed Abstract | Publisher Full Text

[41] 41. Bahin M, Noël BF, Murigneux V, et al.: ALFA: annotation landscape for aligned reads. BMC Genomics. 2019 Mar 29; 20(1): 250. PubMed Abstract | Publisher Full Text | Free Full Text

[42] 42. Soneson C, Love MI, Robinson MD: Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res. 2015 Dec 30; 4: 1521. Publisher Full Text

[43] 43. Bray NL, Pimentel H, Melsted P, et al.: Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016 May; 34(5): 525–527. PubMed Abstract | Publisher Full Text

[44] 44. Patro R, Duggal G, Love MI, et al.: Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods. 2017 Mar 6; 14: 417–419. PubMed Abstract | Publisher Full Text | Free Full Text

[45] 45. Wagner GP, Kin K, Lynch VJ: Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 2012 Dec; 131(4): 281–285. PubMed Abstract | Publisher Full Text

[46] 46. Ewels P, Magnusson M, Lundin S, et al.: MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1; 32(19): 3047–3048. PubMed Abstract | Publisher Full Text | Free Full Text

[47] 47. Roach MJ, Pierce-Ward NT, Suchecki R, et al.: Ten simple rules and a template for creating workflows-as-applications. PLoS Comput. Biol. 2022 Dec; 18(12): e1010705. PubMed Abstract | Publisher Full Text | Free Full Text

[48] 48. Katz K, Shutov O, Lapoint R, et al.: The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 2022 Jan 7; 50(D1): D387–D390. PubMed Abstract | Publisher Full Text | Free Full Text

[49] 49. Frölich S, van der Sande M , Schäfers T, et al.: genomepy: genes and genomes at your fingertips. Bioinformatics. 2023 Mar 1; 39(3). PubMed Abstract | Publisher Full Text | Free Full Text

[50] 50. Howe KL, Achuthan P, Allen J, et al.: Ensembl 2021. Nucleic Acids Res. 2021 Jan 8; 49(D1): D884–D891. PubMed Abstract | Publisher Full Text | Free Full Text

[51] 51. Kensche PR, Kanitz A, Topolsky I, et al.: Executing workflows in the cloud with WESkit. BioHackrXiv. 2023 Feb 20. Publisher Full Text

[52] 52. Grüning B, Dale R, Sjödin A, et al.: Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods. 2018 Jul; 15(7): 475–476. PubMed Abstract | Publisher Full Text | Free Full Text

[53] 53. Bai J, Bandla C, Guo J, et al.: BioContainers Registry: Searching Bioinformatics and Proteomics Tools, Packages, and Containers. J. Proteome Res. 2021 Apr 2; 20(4): 2056–2061. PubMed Abstract | Publisher Full Text | Free Full Text

[54] 54. Moreau D, Wiebels K, Boettiger C: Containers for computational reproducibility. Nat. Rev. Methods Primers. 2023 Jul 13; 3(1): 1–16. Publisher Full Text

[55] 55. Jiménez RC, Kuzak M, Alhamdoosh M, et al.: Four simple recommendations to encourage best practices in research software. F1000Res. 2017 Jun 13; 6: 876. PubMed Abstract | Publisher Full Text | Free Full Text

[56] 56. Brack P, Crowther P, Soiland-Reyes S, et al.: Ten simple rules for making a software tool workflow-ready. PLoS Comput. Biol. 2022 Mar; 18(3): e1009823. PubMed Abstract | Publisher Full Text | Free Full Text

[57] 57. Ham DJ, Börsch A, Lin S, et al.: The neuromuscular junction is a focal point of mTORC1 signaling in sarcopenia. Nat. Commun. 2020 Sep 9; 11(1): 4510. PubMed Abstract | Publisher Full Text | Free Full Text

[58] 58. Katsantoni M, Bąk M, Ataman M, et al.: ZARP: Supplementary materials (v2.0.0). Zenodo. 2024 Mar 8. Publisher Full Text

[59] 59. Barrett T, Wilhite SE, Ledoux P, et al.: NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013 Jan; 41(Database issue): D991–D995. PubMed Abstract | Publisher Full Text | Free Full Text

[60] 60. Tintignac L, Mittal N, Guridi M, et al.: The neuromuscular junction is a focal point of mTORC1 signaling in sarcopenia. [TSCmKO data set]. Gene Expression Omnibus. 2020. Reference Source

[61] 61. Pankievicz V: SM_STG1_T0_2. Gene Expression Omnibus. 2022. Reference Source

[62] 62. Li Z: Naive_Propy_20uM_Myeloid TAGCGCTC_ATAGCCTT. Gene Expression Omnibus. 2022. Reference Source

[63] 63. Preusser F: C. elegans, with bacteria, 0h, rep1. Gene Expression Omnibus. 2022. Reference Source

[64] 64. Liang AM: 95Cb.del_rep2. Gene Expression Omnibus. 2023. Reference Source

[65] 65. Sun SJ: B-P-D6-1h-9_S36. Gene Expression Omnibus. 2023. Reference Source

[66] 66. European Organization For Nuclear Research, OpenAIRE. Zenodo. CERN.2013. Publisher Full Text

[67] 67. Katsantoni M, Kanitz A, Herrmann CJ, et al.: ZARP: The Zavolab Automated RNA-seq Pipeline (v1.0.0-rc.1). Zenodo. 2024 Mar 6. Publisher Full Text

[68] 68. Kanitz A, Herrmann CJ, Katsantoni M, et al.: ZARP-cli: A user-friendly command-line interface for the ZARP RNA-seq analysis workflow (v1.0.0-rc.1). Zenodo. 2024 Mar 6. Publisher Full Text

ZARP: A user-friendly and versatile RNA-seq analysis workflow

Abstract

Background

Methods

Conclusions

Keywords

Introduction

Methods

Implementation

Workflow inputs

Analysis steps and workflow outputs

Figure 1. Schematic overview of the ZARP workflow.

Figure 2. ZARP workflow schema.

Table 1. ZARP functionalities and the tools providing them.

The ZARP command line interface

Figure 3. Schematic overview of the ZARP-cli functionalities.

Figure 4. ZARP-cli example calls.

Development principles and best practices

Operation

Use cases

Figure 5. Selection of metrics reported by ZARP.

Figure 6. Principal component analysis.

Figure 7. Runtime statistics.

Conclusions/Discussion

Ethics and consent

Data and software availability

Underlying data

Extended data

Software

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated