Keywords
FastQC, high-throughput sequencing, quality control
This article is included in the Cell & Molecular Biology gateway.
This article is included in the Bioinformatics gateway.
FastQC, high-throughput sequencing, quality control
This article has been updated to address reviewer responses. Changes to the text were made in all sections. Tables 3 and 4 were expanded to include time measurements for FastQC on long-read samples. No other table or figured
was altered from the first version of the manuscript. The accompanying code for Falco has also undergone updates for this review. Falco version 0.2.4 was used ion this revised manuscript. The code changes that relate to the core computations were not altered since version 0.1.0 (used in the previous version of the manuscript), and we have verified that the times reported in Table 3 remain the same in both versions.
The main changes to the manuscript are listed below:
(1) The abstract was changed to highlight the memory comparison between QC software tools, and no longer mentions that FastQC does not run on long-read samples.
(2) The "Introduction" section includes more detail about quality control applications.
(3) The "Implementation choices" subsection under "Methods" now highlights that Falco does not contain a user interface, and that Falco was designed for UNIX systems.
(4) The "Methods" section now contains a "system requirements" subsection that describes the memory and disk requirements to run Falco.
(5) The subsection "Falco scales for larger nanopore reads" has been removed, and instead replaced with an additional paragraph on section "Falco is faster than popular QC tools", where the memory usage of each tool in each tested sample is discussed
(6) Instructions to report bugs and errors is reported in the "Software availability" section
(7) Formatting corrections were performed across the manuscript: "Falco" is now written in uppercase, superfulous
line breaks were removed, reference formatting and the usage of the Oxford comma were standardized, links were separated from punctiation and two references were added.
See the authors' detailed response to the review by Weihong Qi
See the authors' detailed response to the review by R Henrik Nilsson
High-throughput sequencing is routinely used to profile copy number variations in cancers1, assemble genomes of microbial organisms2,3, quantify gene expression4, identify cell populations from single-cell transcriptomes in a variety of tissues5, and track epigenetic changes in developing organisms and diseases6, among numerous other applications. New sequencing protocols are constantly being introduced7,8, and as the cost of sequencing per base decreases, sequencing data is growing in abundance, dataset size, and read length9.
Quality control (QC) is often the first step in high-throughput sequencing data analysis pipelines. The QC step measures a set of statistics in a file of sequenced reads to assess if its content matches the experiment expectations and if the data is suitable for downstream analysis. Common QC tests include counting relative frequency of nucleotides in each position of a set of reads to detect potential deviations from expected frequencies, summarizing the distribution of Phred10 quality scores to identify base positions with globally low quality (suggesting degeneration in the sequencing process), and measuring the frequency of sequencing adapters and contaminants that are not expected to be biological DNA from the sample.
Data that passes specific QC tests then undergoes downstream analysis steps, which may include adapter trimming, filtering contaminants and low-quality reads, and mapping the resulting reads to a reference genome or transcriptome. With the exception of sequence assembly applications, read mapping should be the most computationally expensive step early in analysis pipelines. In comparison, the time and computation required for QC should be negligible. However, the efficiency of mapping algorithms has improved substantially over the past decade, while software for QC has received far less attention. As a consequence, the computation required for QC is appreciable, and can no longer be ignored when considering the total cost of sequencing.
The most commonly used tool for quality control of sequencing data is FastQC11, which, since its release, has incorporated a wide range of QC tests covering multiple use cases. Its analysis reports have become the standard for several QC tools, and automated analysis pipelines often rely on its result as a criterion to proceed with downstream steps or, alternatively, to filter, trim, or ultimately discard the data12,13. FastQC reports ten analysis modules that summarize the content of a sequencing file (Table 1). An input file may pass or fail the tests run in each module, and high-quality sequencing data from most protocols is expected to pass all tests.
In FastQC’s implementation, each module computation is executed sequentially after an input sequence is read. This design allows new modules to be incorporated easily, but it implies that the time required to process each read is the sum of the processing times for each module. If multiple modules compute similar measurements, such as nucleotide content or Phred quality scores, the same calculation will be performed multiple times, causing the total analysis run time to increase.
Several QC software tools have been introduced since FastQC, many focusing on speed improvements, more flexible module visualization, incorporation of paired-end reads, and filtering sequences that failed QC tests. Despite proposing different alternatives to calculate and present QC results, the modules available in these tools are largely similar to FastQC’s (Table 1).
At the same time, FastQC’s analysis results are already part of many standard initial analysis pipelines. If a new QC software tool is incorporated in these pipelines, it is desirable that its results, and its output formats, remain consistent with those of FastQC.
To address potential speed limitations in FastQC’s implementation while retaining its behavior, we developed FastQC Alternative Code (Falco)14, an emulation of the FastQC software tool. We show that Falco generates the same results as FastQC across a wide variety of datasets of different read lengths, sizes, file formats, and library preparation protocols at significantly shorter running times and using less memory. While the text outputs are comparable to FastQC, Falco also provides more flexible interaction with graphical plots in its HTML report using the same visualization standards set by FastQC.
Falco14 is an Open Source C++ implementation of the FastQC software tool built for UNIX-based operating systems. We designed to faithfully emulate FastQC’s calculations, results and text reports. The goal of Falco is to minimize the effort required to replace the command-line behavior of FastQC in the context of larger automated analysis pipelines. We use the same set of command-line arguments, configuration file names, and input file formats as FastQC. We also produce the same plain text format output, and the same report structure, allowing users to take advantage of improved speed without adjusting to different program behaviors. Falco is intended to be used in a command-line environment. Unlike FastQC, Falco cannot be run through a graphical user interface.
There are major differences between the implementations of Falco and FastQC. While FastQC’s code emphasizes modularity, which allows new QC metrics to be added easily and uniformly, Falco’s design centralizes the function to read sequences from the input file and collects the minimum data necessary to subsequently create all modules after file processing. To ensure consistency with FastQC, we wrote each module’s source code based on FastQC’s implementation, adapting the portions that relate to sequence processing and maintaining the postprocessing functions that define how the collected data is used to generate summaries and reports.
Compilation of Falco requires a GNU GCC compiler version 5.0.0 (July 16, 2015; full support for the C++11 standard) or greater. Once compiled, Falco can be run on uncompressed files (FASTQ and SAM) without any additional dependencies. In order to process files in gzip compressed FASTQ and BAM formats, Falco must be compiled with the ZLib15 and HTSLib16 libraries, respectively. The full documentation on how to compile, install dependencies, and run the program is available in the README file in the Falco repository.
Like FastQC, Falco14 can be applied to any file of sequenced reads in the formats accepted by FastQC. The only required command-line argument is the path to the input file. Also like FastQC, a wide range of options can be provided if users only require a given subset of its analysis modules or outputs. The letters and symbols used for command-line arguments were chosen to maintain consistency with FastQC’s options. As mentioned above, this choice is to facilitate integration with larger pipelines that already employ FastQC and depend on its behaviors.
Falco can be run on a FASTQ format file named example.fq with the following simple command:
This will generate three files:
1. fastqc_data.txt: The complete numerical values generated in each module’s individual analysis.
2. fastqc_report.html: A visual page display of the text report’s data and plots generated in modules.
3. summary.txt: A short summary indicating whether the input file passed or failed each module, and whether any warnings were raised.
Default configuration files are contained in a Configuration directory that is included with the program, but Falco also allows users to manually define the thresholds to pass or fail each module, the list of adapters to search for in reads, and the list of contaminants to compare with overrepresented sequences by using configuration files in the same format used by FastQC.
Falco requires little memory and disk space to run, and there are no constraints on the minimum or maximum FASTQ input size or number of reads. Reads are analyzed sequentially, with one read stored in memory at a time, so the amount of memory necessary to run depends on the largest read length in a dataset, but not on the size of the input file. For instance, processing a short-read sample, with reads of length at most 1000 bases, requires 100 MB of available RAM, whereas processing a long-read sample containing at least one read with 1 million bases require 500 MB of RAM. The total disk space necessary to store the three output files generated by Falco is no more than 1 MB.
We compared the output of Falco14 to its FastQC counterpart using 11 datasets (Table 2). The tests consist of Illumina files originating from a range of different library preparation protocols for DNA, RNA, and epigenetic experiments, as well as reads from the nanopore17 technology. For simplicity, Illumina paired-end datasets were only tested on the first read end.
test | accession | reference | file size (FASTQ) | reads | length (bp) | protocol |
---|---|---|---|---|---|---|
1 | SRR10124060 | unpublished | 7.3GB | 25,172,255 | 130 | RNA-Seq |
2 | SRR10143153 | unpublished | 11.0GB | 15,949,900 | 150 | miRNA-Seq |
3 | SRR3897196 | 23 | 4.2GB | 15,570,348 | 100 | BS-Seq |
4 | SRR9624732 | 24 | 1.6GB | 18,807,797 | 150 | ChIP-Seq |
5 | SRR1853178 | 25 | 130.0GB | 510,210,716 | 60 | Drop-Seq |
6 | SRR6387347 | 26 | 20.0GB | 305,434,830 | 100 | 10x genomics |
7 | SRR6954584 | 5 | 56.0GB | 152,853,013 | 150 | Microwell-Seq |
8 | SRR891268 | 27 | 46.0GB | 192,904,649 | 50 | ATAC-Seq |
9 | SRR9878537 | unpublished | 38.0MB | 3,284 | 64,000 | Nanopore |
10 | wgs-FAB49164 | 19 | 8.4GB | 746,333 | 180,000 | Nanopore |
11 | SRR6059706 | unpublished | 1.4GB | 892,313 | 150,000 | Nanopore |
FASTQ files available in the Sequence Read Archive (SRA)18 were downloaded using the fastq-dump command from the SRA toolkit. We used the following flags when running fastq-dump: -skip-technical, -readids, -read-filter pass, -dumpbase, -split-3 and -clip. One dataset was downloaded from the Whole Human Genome Sequencing Project19.
We directly compared the text summary for each output of Falco to FastQC’s output summary files, obtaining the same outputs (pass, warning, or fail) for all tested criteria in all datasets.
To assess if Falco’s output is consistent with FastQC’s format, we used the fastqcr20 R package version 0.1.2 and MultiQC12 version 1.9. Both tools can successfully parse the text reports generated by Falco for the tested files. Differences in the fastqc_data.txt files between the two programs result from choices for numerical precision output, or as a result of Falco calculating certain averages based on more of the data within each file.
Some alternative software tools exist for quality control of sequencing data, and users may opt for them due to their efficiency in cases where not all FastQC analysis modules are necessary. Among these, fastp21 has gained popularity for its speed and versatile set of options for trimming. fastp has demonstrated superior runtime to FastQC even when generating FASTQ format output files corrected by trimming adapters and filtering (which requires both input and output). HTQC22 is another tool that was developed with the intent to both improve speed performance and incorporate trimming functions after quality control. The two programs were used as benchmarks to compare Falco with.
Although most fastp modules are both calculated and displayed equivalently to FastQC, one major difference between these tools is how overrepresented sequences are estimated. While fastp counts the sequences at every P reads (which users may specify), FastQC stores the first 100,000 reads encountered for the first time, and subsequently checks if the following sequences match any of the stored candidates. This choice of implementation causes fastp’s runtime to greatly differ when overrepresentation is enabled. Conversely, FastQC’s runtime does not seem to be affected by disabling the overrepresented sequences module. For a comprehensive comparison between programs, we have measured the run times for our test datasets both with and without the overrepresented sequences module enabled. Programs were compared both in compressed (gzip FASTQ) and uncompressed (plain FASTQ) file formats.
Files used to assess Falco’s output comparison to FastQC (Table 2) were also used for speed and memory comparison. Tests were executed in an Intel Xeon CPU E5-2640 v3 2.60GHz processor with a CentOS Linux 7 operating system. All file I/O was done using local disk to reduce variability in execution runtime. Both fastp and FastQC were instructed to run using a single thread.
FastQC version 0.11.8 was run with default parameters and the configuration limits, adapters and contaminants provided with the software. fastp version 0.20.0 was run with the -A, -G, -Q and -L flags to disable adapter trimming, poly-G trimming, quality filtering and length filtering, thus requiring the program to only perform QC tests without generating a new FASTQ file. When testing for overrepresented sequences, we set the -p flag to enable this module, and set the frequency of counts to the program’s default value of P = 20. We ran the ht-stat program on the tested files using the -S flag for single-ended reads. HTQC was not tested on gzip FASTQ files as this file format is not accepted by the program. We used the GNU time command to measure the total running times for each program, using the total elapsed wall time as measurement. The benchmarking results (Table 3 and Table 4) show that Falco performs faster than fastp and FastQC in all datasets, with an average 3 times faster runtime than FastQC, both with the overrepresented sequences module on and off. Despite HTQC failing to process most test datasets due to unaccepted header formats, the two tests that ran to completion demonstrate that Falco’s analysis times are also significantly smaller in comparison.
Asterisks (*) indicate tests in which tools did not run to completion.
The memory required to run Falco differs between short-read samples (tests 1-8; Table 2) and long-read samples (tests 9-11). All programs demonstrated similar behavior in memory usage, with all short-read samples having similar memory requirements, and test 10 requiring the most memory (as it contains the longest read). The total memory usage was also measured by GNU time command. For Falco, short-read samples required 92 MB of RAM, whereas long-read samples used at most 342 MB of RAM. In short-read samples, FastQC and fastp used 319 MB and 568 MB of RAM, respectively. In long-read samples, FastQC and fastp used at most 4.88 GB and 1.28 GB of RAM, respectively. This comparison suggests that Falco’s memory requirement is also the lowest across all tests.
Despite FastQC’s clarity in its HTML reports, graphs are displayed as static images and have limited visualization flexibility, such as tile heatmaps not displaying raw deviations from average Phred scores in base positions, or raw values in line plots not being visible. We have opted to display Falco’s analysis results using the Plotly JavaScript library28, which allows interactive changes of axis labels, hovering on data points to visualize raw values, and screenshots from specific positions on the plot (Figure 1). This choice of presentation provides greater options to explore and interpret QC results while maintaining the visualization standards set by FastQC.
Falco14 is a faster alternative to calculate the wide range of QC metrics reported by FastQC. It is entirely based on emulating the analysis modules FastQC provides while running faster than popular QC tools and generating dynamic visual summaries of analysis results. Falco’s text output provides the same information generated by FastQC, so tools that parse this file for custom visualization and downstream analysis can seamlessly incorporate Falco into their pipeline.
Datasets used to compare Falco and FastQC are shown in Table 2. Guidance for how to accept accession wgs-FAB49164 is available from the Benchmark directory of the Falco GitHub page.
Source code for Falco available at: https://github.com/smithlabcode/falco.
Users may report errors, bugs, installation problems, and improvement suggestions in the same page provided to download the source code under the “issues” section.
The scripts used to download files and reproduce the benchmarking steps described are also available in the same repository within the “benchmark” directory.
Archived source code at time of publication: http://doi.org/10.5281/zenodo.442938114.
License: GNU General Public License version 3.0.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Metabarcoding ; molecular ecology ; systematics ; mycology
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Genome informatics.
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
References
1. Kodama Y, Shumway M, Leinonen R, International Nucleotide Sequence Database Collaboration: The Sequence Read Archive: explosive growth of sequencing data.Nucleic Acids Res. 2012; 40 (Database issue): D54-6 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Metabarcoding ; molecular ecology ; systematics ; mycology
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 27 Jan 21 |
read | |
Version 1 07 Nov 19 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)