Keywords
FastQC, high-throughput sequencing, quality control
This article is included in the Cell & Molecular Biology gateway.
This article is included in the Bioinformatics gateway.
FastQC, high-throughput sequencing, quality control
High-throughput sequencing is routinely used to profile copy number variations in cancers1, assemble genomes of microbial organisms2,3, quantify gene expression4, identify cell populations from single-cell transcriptomes in a variety of tissues5 and track epigenetic changes in developing organisms and diseases6, among numerous other applications. New sequencing protocols are constantly being introduced7,8, and as the cost of sequencing per base decreases, sequencing data is growing in abundance, dataset size and read length9.
When high-throughput sequencing data is generated it often undergoes common upstream analysis steps involving quality control (QC), adapter trimming, filtering contaminants and low-quality reads, and mapping reads to a reference genome or transcriptome. Excluding sequence assembly applications, read mapping should be the most computationally expensive step early in analysis pipelines. In comparison, the time and computation required for QC should be negligible. However, the efficiency of mapping algorithms has improving substantially over the past decade, while software for QC has received far less attention. As a consequence, the computation required for QC is appreciable, and can no longer be ignored when considering the total cost of sequencing.
The most commonly used tool for quality control of sequencing data is FastQC10, which, since 2011, has incorporated a wide range of QC checks covering multiple use cases. FastQC has been cited over 3,000 times, with citations increasing steadily since its introduction. Its analysis reports have become the standard for several QC tools, and automated analysis pipelines often rely on its evaluation as a safety criteria to proceed with downstream steps or, alternatively, to filter, trim or ultimately discard the data11,12.
FastQC is implemented in a modular design, where multiple independent analysis procedures are run sequentially after an input record is read. This design allows new modules to be incorporated easily, but it implies that each analysis module is applied independently to each read, so the time required to process each read is the sum of the processing times for each module. If multiple modules use similar measurements, such as nucleotide content or average sequence quality, the same measurement will be calculated multiple times, causing the total analysis run time to increase.
Several QC software tools have been introduced since FastQC, many focusing on speed improvements, more flexible module visualization, incorporation of paired-end reads and filtering sequences that failed QC checks. Despite proposing different alternatives to calculate and present QC results, the modules available in these tools are largely similar to FastQC’s (Table 1).
At the same time, FastQC’s analysis results are already part of many standard initial analysis pipelines. If a new QC software tool were to be incorporated in these pipelines, it is desirable that its results, and its output formats, remain consistent with those of FastQC.
To improve the speed of quality control while retaining the behaviour of FastQC, we developed FastQC Alternative Code (falco)13, an emulation of FastQC’s current analysis modules. We show that falco generates the same results as FastQC across a wide variety of datasets of different read lengths, sizes, file formats and library preparation protocols at significantly shorter running times. We also present example datasets from the public domain where FastQC fails to generate reports even when run on high-performance computing hardware, demonstrating that falco expands the range of possible cases in which these quality control metrics can be applied.
We designed falco13 to faithfully emulate FastQC’s calculations, results and text reports. Our goal was to minimize the effort required to replace FastQC with falco in the context of larger automated analysis pipelines. We use the same set of command line arguments, configuration file names and formats. We also produce the same plain text format output, and the same report structure as FastQC, allowing users to take advantage of improved speed without adjusting to different program behaviors.
There are major differences between the implementations of falco and FastQC. While FastQC’s code emphasizes modularity in a way that allows for additional types of QC information to be added easily and uniformly,
falco’s design centralizes the function to read sequences from the input file and collects the minimum data necessary to subsequently create all modules after file processing. To ensure consistency with FastQC, we wrote each module’s source code based on FastQC’s implementation, adapting the portions that relate to sequence processing and maintaining the postprocessing functions that define how the collected data is used to generate summaries and reports.
Compilation of falco requires a GNU GCC compiler version 5.0.0 (July 16, 2015; full support for the C++11 standard) or greater. Once installed, falco can be run on uncompressed files (FASTQ and SAM) without any additional dependencies. In order to process files in gzip compressed FASTQ and BAM formats, falco must be compiled with the ZLib14 and HTSLib15 libraries. The full documentation on how to compile, install dependencies and run the program is available in the README file in the falco repository.
Like FastQC, falco13 can be applied to any sequencing data file (i.e. a file of sequenced reads) in the accepted formats. The only required command line argument is the path to the input file. Also like FastQC, a wide range of options can be provided if users only require a given subset of its analysis modules or outputs. The letters and symbols used for command line arguments were chosen to maintain consistency with FastQC’s options. As mentioned above, this choice is to facilitate integration with larger pipelines that already employ FastQC and depend on its behaviours.
Falco can be run on a FASTQ format file named example.fastq with the following simple command:
This will generate three files:
1. fastqc_data.txt: The complete numerical values generated in each module’s individual analysis.
2. fastqc_report.html: A visual page display of the text report’s data and plots generated in modules.
3. summary.txt: A short summary indicating whether the input file passed or failed each module, and whether any warnings were raised.
Default configuration files are contained in a Configuration directory that is included with the program, but falco also allows users to manually define the thresholds for statistics to be considered a pass, warning or fail, the list of adapters to search for in reads and the list of contaminants to check overrepresented sequences by using configuration files in the same format used by FastQC.
We compared the output of falco13 to its FastQC counterpart using 11 datasets (Table 2). The tests consist of Illumina files originating from a range of different library preparation protocols for DNA, RNA and epigenetic experiments, as well as reads from the nanopore16 technology. For simplicity, Illumina paired-end datasets were only tested on their first read.
test | accession | reference | file size (FASTQ) | reads | length (bp) | protocol |
---|---|---|---|---|---|---|
1 | SRR10124060 | unpublished | 7.3GB | 25,172,255 | 130 | RNA-Seq |
2 | SRR10143153 | unpublished | 11.0GB | 15,949,900 | 150 | miRNA-Seq |
3 | SRR3897196 | 21 | 4.2GB | 15,570,348 | 100 | BS-Seq |
4 | SRR9624732 | 22 | 1.6GB | 18,807,797 | 150 | ChIP-Seq |
5 | SRR1853178 | 23 | 130.0GB | 510,210,716 | 60 | Drop-Seq |
6 | SRR6387347 | 24 | 20.0GB | 305,434,830 | 100 | 10x genomics |
7 | SRR6954584 | 5 | 56.0GB | 152,853,013 | 150 | Microwell-Seq |
8 | SRR891268 | 25 | 46.0GB | 192,904,649 | 50 | ATAC-Seq |
9 | SRR9878537 | unpublished | 38.0MB | 3,284 | 64,000 | Nanopore |
10 | wgs-FAB49164 | 17 | 8.4GB | 746,333 | 180,000 | Nanopore |
11 | SRR6059706 | unpublished | 1.4GB | 892,313 | 150,000 | Nanopore |
FASTQ files available in the Sequencing Read Archive (SRA) were downloaded using the fastq-dump command from the SRA toolkit. We used the following flags when running fastq-dump: -skip-technical, -readids, -read-filter pass, -dumpbase, -split-3 and -clip. One dataset was downloaded from the Whole Human Genome Sequencing Project17.
We directly compared the text summary for each output of falco to FastQC’s output summary files, obtaining the same outputs (pass/warn/fail) for all tested criteria in all datasets.
To assess if falco’s output is consistent with FastQC’s format, we used the fastqcr18 R package version 0.1.2 and MultiQC11 version 1.7. Both tools can successfully parse the text reports generated by falco for the tested files. Differences in the fastq_data.txt files between the two programs result from choices for numerical precision output, or as a result of falco calculating certain averages based on more of the data within each file.
Some alternative software tools exist for quality control of sequencing data, and users may opt for them due to their efficiency in cases where not all FastQC analysis modules are necessary. Among these, fastp19 has gained popularity for its speed and versatile set of options for trimming. fastp has demonstrated superior runtime to FastQC even when generating FASTQ format output files corrected by trimming adapters and filtering (which requires both input and output). HTQC20 is another tool that was developed with the intent to both improve speed performance and incorporate trimming functions after quality control. The two programs were used as benchmarks to compare with falco’s performance.
Although most fastp modules are both calculated and displayed equivalently to FastQC, one major difference between these tools is how overrepresented sequences are estimated. fastp counts the sequences at every P reads (which users may specify), whereas FastQC stores the first 100,000 reads encountered for the first time, and subsequently checks if the following sequences match any of the stored candidates. This choice of implementation causes fastp’s runtime to greatly differ when over-representation is enabled. Conversely, FastQC’s runtime does not seem to be affected by disabling the overrepresented sequences module. For a comprehensive comparison between programs, we have measured the run times for our test datasets both with and without the overrepresented sequences module enabled. Programs were compared both in compressed (gzipped FASTQ) and uncompressed (plain FASTQ) file formats.
Files used to assess falco’s output comparison to FastQC (Table 2) were also used for speed benchmarking. Tests were executed in an Intel Xeon CPU E5-2640 v3 2.60GHz processor with a CentOS Linux 7 operating system. All file I/O was done using local disk to reduce variability in execution runtime. Programs were instructed to run using a single thread.
FastQC version 0.11.8 was run with default parameters and the configuration limits, adapters and contaminants provided with the software. fastp version 0.20.0 was run with the -A, -G, -Q and -L flags to disable adapter trimming, poly-G trimming, quality filtering and length filtering, thus requiring the program to only perform QC checks without generating a new FASTQ file. When testing for overrepresented sequences, we set the -p flag to enable this module, and set the frequency of counts to the program’s default value of P = 20. We ran the ht-stat program on the tested files using the -S flag for single-ended reads. HTQC was not tested on gzip
compressed files as this file format is not accepted by the program. We used the time command (using the BASH shell keyword) to measure the total running times for each program, using the real time (total wall clock from program start to finish) as measurement. The benchmarking results (Table 3 and Table 4) show that falco performs faster than fastp and FastQC in all datasets, with an average 3x faster runtime than FastQC, both with the overrepresented sequences module on and off. Despite HTQC failing to process most test datasets due to unaccepted header formats, the two tests that ran to completion demonstrate that falco’s analysis times are also significantly smaller in comparison.
Asterisks (*) indicate tests in which tools did not run to completion.
Nanopore sequencing is gaining popularity in genome assembly applications and as a low-cost protocol to quantify short reads26. Nanopore sequencers can generate reads of up to millions of bases, and assessing quality metrics for these datasets is fundamental to test for potential problems in quality or bias in specific regions of such long reads. While FastQC is capable of making summaries for protocols such as 45427 PacBio28, which generate sequences with around 10,000 bases per read, we have observed that it does not run to completion when given files with larger reads of over 100,000 bases. Files for which FastQC’s analysis does not finish are marked with an asterisk in Table 3 and Table 4. Falco successfully completes its analysis on these datasets, demonstrating that it can equally be used as a QC tool for longer reads.
Despite FastQC’s clarity in its HTML reports, graphs are displayed as static images and have limited visualization flexibility, such as tile heatmaps not displaying raw deviations from average Phred scores in base positions or raw values in line plots not being visible. We have opted to display falco’s analysis results using the Plotly JavaScript library29, which allows interactive changes of axis labels, hovering on data points to visualize raw values and screenshots from specific position on the plot (Figure 1). This choice of presentation provides greater options to explore and interpret QC results while maintaining the visualization standards set by FastQC.
Falco13 is a faster alternative to calculate the wide range of QC metrics generated by FastQC. It is entirely based on emulating the analysis modules FastQC provides while running faster than popular QC tools and generating dynamic visual summaries of analysis results. Both falco’s text and HTML outputs provide the same information generated by FastQC’s report, so tools that parse these files for custom visualization and downstream analysis can seamlessly incorporate falco into their pipeline.
Datasets used to compare Falco and FastQC are shown in Table 2. Guidance for how to accept accession wgs-FAB49164 is available from the Benchmark directory of the falco GitHub page.
Source code for falco available at: https://github.com/smithlabcode/falco.
The scripts used to download files and reproduce the benchmarking steps described are also available in the same repository within the “benchmark” directory.
Archived source code at time of publication: http://doi.org/10.5281/zenodo.352093313.
License: GNU General Public License version 3.0.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Genome informatics.
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
References
1. Kodama Y, Shumway M, Leinonen R, International Nucleotide Sequence Database Collaboration: The Sequence Read Archive: explosive growth of sequencing data.Nucleic Acids Res. 2012; 40 (Database issue): D54-6 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Metabarcoding ; molecular ecology ; systematics ; mycology
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 27 Jan 21 |
read | |
Version 1 07 Nov 19 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)