ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

Falco: high-speed FastQC emulation for quality control of sequencing data

[version 1; peer review: 1 approved, 1 approved with reservations]
PUBLISHED 07 Nov 2019
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Cell & Molecular Biology gateway.

This article is included in the Bioinformatics gateway.

Abstract

Quality control is an essential first step in sequencing data analysis, and software tools for quality control are deeply entrenched in standard pipelines at most sequencing centers. Although the associated computations are straightforward, in many settings the total computing effort required for quality control is appreciable and warrants optimization. We present falco, an emulation of the popular FastQC tool that runs on average three times faster while generating equivalent results. Compared to FastQC, falco also provides greater scalability for datasets with longer reads and more flexible visualization of HTML reports.

Keywords

FastQC, high-throughput sequencing, quality control

Introduction

High-throughput sequencing is routinely used to profile copy number variations in cancers1, assemble genomes of microbial organisms2,3, quantify gene expression4, identify cell populations from single-cell transcriptomes in a variety of tissues5 and track epigenetic changes in developing organisms and diseases6, among numerous other applications. New sequencing protocols are constantly being introduced7,8, and as the cost of sequencing per base decreases, sequencing data is growing in abundance, dataset size and read length9.

When high-throughput sequencing data is generated it often undergoes common upstream analysis steps involving quality control (QC), adapter trimming, filtering contaminants and low-quality reads, and mapping reads to a reference genome or transcriptome. Excluding sequence assembly applications, read mapping should be the most computationally expensive step early in analysis pipelines. In comparison, the time and computation required for QC should be negligible. However, the efficiency of mapping algorithms has improving substantially over the past decade, while software for QC has received far less attention. As a consequence, the computation required for QC is appreciable, and can no longer be ignored when considering the total cost of sequencing.

The most commonly used tool for quality control of sequencing data is FastQC10, which, since 2011, has incorporated a wide range of QC checks covering multiple use cases. FastQC has been cited over 3,000 times, with citations increasing steadily since its introduction. Its analysis reports have become the standard for several QC tools, and automated analysis pipelines often rely on its evaluation as a safety criteria to proceed with downstream steps or, alternatively, to filter, trim or ultimately discard the data11,12.

FastQC is implemented in a modular design, where multiple independent analysis procedures are run sequentially after an input record is read. This design allows new modules to be incorporated easily, but it implies that each analysis module is applied independently to each read, so the time required to process each read is the sum of the processing times for each module. If multiple modules use similar measurements, such as nucleotide content or average sequence quality, the same measurement will be calculated multiple times, causing the total analysis run time to increase.

Several QC software tools have been introduced since FastQC, many focusing on speed improvements, more flexible module visualization, incorporation of paired-end reads and filtering sequences that failed QC checks. Despite proposing different alternatives to calculate and present QC results, the modules available in these tools are largely similar to FastQC’s (Table 1).

Table 1. Comparison of analysis modules provided by fastp and HTQC, two commonly used QC software tools.

FastQC modulefastpHTQC
Per base sequence qualityNoYes
Per base N contentYesYes
Per tile sequence qualityNoYes
Per sequence quality scoresNoYes
Per sequence GC contentNoNo
Sequence length distributionNoYes
Sequence duplication levelsYesNo
Overrepresented sequencesYesNo
Adapter contentNoNo
Kmer contentYesNo

At the same time, FastQC’s analysis results are already part of many standard initial analysis pipelines. If a new QC software tool were to be incorporated in these pipelines, it is desirable that its results, and its output formats, remain consistent with those of FastQC.

To improve the speed of quality control while retaining the behaviour of FastQC, we developed FastQC Alternative Code (falco)13, an emulation of FastQC’s current analysis modules. We show that falco generates the same results as FastQC across a wide variety of datasets of different read lengths, sizes, file formats and library preparation protocols at significantly shorter running times. We also present example datasets from the public domain where FastQC fails to generate reports even when run on high-performance computing hardware, demonstrating that falco expands the range of possible cases in which these quality control metrics can be applied.

Methods

Implementation choices

We designed falco13 to faithfully emulate FastQC’s calculations, results and text reports. Our goal was to minimize the effort required to replace FastQC with falco in the context of larger automated analysis pipelines. We use the same set of command line arguments, configuration file names and formats. We also produce the same plain text format output, and the same report structure as FastQC, allowing users to take advantage of improved speed without adjusting to different program behaviors.

There are major differences between the implementations of falco and FastQC. While FastQC’s code emphasizes modularity in a way that allows for additional types of QC information to be added easily and uniformly,

falco’s design centralizes the function to read sequences from the input file and collects the minimum data necessary to subsequently create all modules after file processing. To ensure consistency with FastQC, we wrote each module’s source code based on FastQC’s implementation, adapting the portions that relate to sequence processing and maintaining the postprocessing functions that define how the collected data is used to generate summaries and reports.

Operation

Compilation of falco requires a GNU GCC compiler version 5.0.0 (July 16, 2015; full support for the C++11 standard) or greater. Once installed, falco can be run on uncompressed files (FASTQ and SAM) without any additional dependencies. In order to process files in gzip compressed FASTQ and BAM formats, falco must be compiled with the ZLib14 and HTSLib15 libraries. The full documentation on how to compile, install dependencies and run the program is available in the README file in the falco repository.

Use cases

Like FastQC, falco13 can be applied to any sequencing data file (i.e. a file of sequenced reads) in the accepted formats. The only required command line argument is the path to the input file. Also like FastQC, a wide range of options can be provided if users only require a given subset of its analysis modules or outputs. The letters and symbols used for command line arguments were chosen to maintain consistency with FastQC’s options. As mentioned above, this choice is to facilitate integration with larger pipelines that already employ FastQC and depend on its behaviours.

Falco can be run on a FASTQ format file named example.fastq with the following simple command:

  • $ falco example.fastq

This will generate three files:

  • 1. fastqc_data.txt: The complete numerical values generated in each module’s individual analysis.

  • 2. fastqc_report.html: A visual page display of the text report’s data and plots generated in modules.

  • 3. summary.txt: A short summary indicating whether the input file passed or failed each module, and whether any warnings were raised.

Default configuration files are contained in a Configuration directory that is included with the program, but falco also allows users to manually define the thresholds for statistics to be considered a pass, warning or fail, the list of adapters to search for in reads and the list of contaminants to check overrepresented sequences by using configuration files in the same format used by FastQC.

Results

Falco matches FastQC’s output

We compared the output of falco13 to its FastQC counterpart using 11 datasets (Table 2). The tests consist of Illumina files originating from a range of different library preparation protocols for DNA, RNA and epigenetic experiments, as well as reads from the nanopore16 technology. For simplicity, Illumina paired-end datasets were only tested on their first read.

Table 2. Datasets used for comparison with FastQC’s output and run time speed benchmarking between QC tools.

testaccessionreferencefile size (FASTQ)readslength (bp)protocol
1SRR10124060unpublished7.3GB25,172,255130RNA-Seq
2SRR10143153unpublished11.0GB15,949,900150miRNA-Seq
3SRR3897196214.2GB15,570,348100BS-Seq
4SRR9624732221.6GB18,807,797150ChIP-Seq
5SRR185317823130.0GB510,210,71660Drop-Seq
6SRR63873472420.0GB305,434,83010010x genomics
7SRR6954584556.0GB152,853,013150Microwell-Seq
8SRR8912682546.0GB192,904,64950ATAC-Seq
9SRR9878537unpublished38.0MB3,28464,000Nanopore
10wgs-FAB49164178.4GB746,333180,000Nanopore
11SRR6059706unpublished1.4GB892,313150,000Nanopore

FASTQ files available in the Sequencing Read Archive (SRA) were downloaded using the fastq-dump command from the SRA toolkit. We used the following flags when running fastq-dump: -skip-technical, -readids, -read-filter pass, -dumpbase, -split-3 and -clip. One dataset was downloaded from the Whole Human Genome Sequencing Project17.

We directly compared the text summary for each output of falco to FastQC’s output summary files, obtaining the same outputs (pass/warn/fail) for all tested criteria in all datasets.

To assess if falco’s output is consistent with FastQC’s format, we used the fastqcr18 R package version 0.1.2 and MultiQC11 version 1.7. Both tools can successfully parse the text reports generated by falco for the tested files. Differences in the fastq_data.txt files between the two programs result from choices for numerical precision output, or as a result of falco calculating certain averages based on more of the data within each file.

Falco is faster than popular QC tools

Some alternative software tools exist for quality control of sequencing data, and users may opt for them due to their efficiency in cases where not all FastQC analysis modules are necessary. Among these, fastp19 has gained popularity for its speed and versatile set of options for trimming. fastp has demonstrated superior runtime to FastQC even when generating FASTQ format output files corrected by trimming adapters and filtering (which requires both input and output). HTQC20 is another tool that was developed with the intent to both improve speed performance and incorporate trimming functions after quality control. The two programs were used as benchmarks to compare with falco’s performance.

Although most fastp modules are both calculated and displayed equivalently to FastQC, one major difference between these tools is how overrepresented sequences are estimated. fastp counts the sequences at every P reads (which users may specify), whereas FastQC stores the first 100,000 reads encountered for the first time, and subsequently checks if the following sequences match any of the stored candidates. This choice of implementation causes fastp’s runtime to greatly differ when over-representation is enabled. Conversely, FastQC’s runtime does not seem to be affected by disabling the overrepresented sequences module. For a comprehensive comparison between programs, we have measured the run times for our test datasets both with and without the overrepresented sequences module enabled. Programs were compared both in compressed (gzipped FASTQ) and uncompressed (plain FASTQ) file formats.

Files used to assess falco’s output comparison to FastQC (Table 2) were also used for speed benchmarking. Tests were executed in an Intel Xeon CPU E5-2640 v3 2.60GHz processor with a CentOS Linux 7 operating system. All file I/O was done using local disk to reduce variability in execution runtime. Programs were instructed to run using a single thread.

FastQC version 0.11.8 was run with default parameters and the configuration limits, adapters and contaminants provided with the software. fastp version 0.20.0 was run with the -A, -G, -Q and -L flags to disable adapter trimming, poly-G trimming, quality filtering and length filtering, thus requiring the program to only perform QC checks without generating a new FASTQ file. When testing for overrepresented sequences, we set the -p flag to enable this module, and set the frequency of counts to the program’s default value of P = 20. We ran the ht-stat program on the tested files using the -S flag for single-ended reads. HTQC was not tested on gzip

compressed files as this file format is not accepted by the program. We used the time command (using the BASH shell keyword) to measure the total running times for each program, using the real time (total wall clock from program start to finish) as measurement. The benchmarking results (Table 3 and Table 4) show that falco performs faster than fastp and FastQC in all datasets, with an average 3x faster runtime than FastQC, both with the overrepresented sequences module on and off. Despite HTQC failing to process most test datasets due to unaccepted header formats, the two tests that ran to completion demonstrate that falco’s analysis times are also significantly smaller in comparison.

Table 3. Real run times for falco, fastp and FastQC on uncompressed FASTQ format, with the overrepresented sequences module on and off.

Asterisks (*) indicate tests in which tools did not run to completion.

testfalcofastpFastQCfalcofastpFastQCHTQC
overrep offoverrep offoverrep offoverrep onoverrep onoverrep on
148s1m54s3m30s55s5m57s3m23s12m09s
236s1m20s2m08s37s4m32s2m10s*
327s1m04s1m25s30s2m16s1m24s*
444s1m48s2m40s51s3m37s2m38s*
515m49s35m14s41m27s15m58s44m30s37m43s*
67m59s18m42s22m59s8m33s42m50s22m53s134m42s
76m0513m50s19m42s6m49s41m55s19m52s*
85m12s11m47s13m59s5m20s15m25s14m08s*
91s1s*1s0m26s**
101m37s1m50s*1m32s4m37s**
1113s24s*13s1m07s**

Table 4. Real run times for falco, fastp and FastQC on gzip compressed FASTQ format.

testfalcofastpFastQCfalcofastpFastQC
overrep offoverrep offoverrep offoverrep on overrep on overrep on
11m19s2m19s3m49s1m25s6m23s3m50s
245s1m31s2m21s51s5m23s2m24s
333s1m10s1m35s35s2m26s1m36s
41m01s2m06s3m01s1m03s3m59s3m00s
516m05s42m40s44m57s18m17s53m09s44m59s
612m26s23m18s26m39s12m29s47m32s26m38s
78m40s17m34s22m31s8m34s44m41s22m31s
87m08s14m37s16m06s6m31s18m19s16m11s
92s1s*1s27s*
102m23s2m32s*2m34s5m22s*
1122s31s*23s1m14s*

Falco scales for longer Nanopore reads

Nanopore sequencing is gaining popularity in genome assembly applications and as a low-cost protocol to quantify short reads26. Nanopore sequencers can generate reads of up to millions of bases, and assessing quality metrics for these datasets is fundamental to test for potential problems in quality or bias in specific regions of such long reads. While FastQC is capable of making summaries for protocols such as 45427 PacBio28, which generate sequences with around 10,000 bases per read, we have observed that it does not run to completion when given files with larger reads of over 100,000 bases. Files for which FastQC’s analysis does not finish are marked with an asterisk in Table 3 and Table 4. Falco successfully completes its analysis on these datasets, demonstrating that it can equally be used as a QC tool for longer reads.

Falco allows dynamic visualization of results

Despite FastQC’s clarity in its HTML reports, graphs are displayed as static images and have limited visualization flexibility, such as tile heatmaps not displaying raw deviations from average Phred scores in base positions or raw values in line plots not being visible. We have opted to display falco’s analysis results using the Plotly JavaScript library29, which allows interactive changes of axis labels, hovering on data points to visualize raw values and screenshots from specific position on the plot (Figure 1). This choice of presentation provides greater options to explore and interpret QC results while maintaining the visualization standards set by FastQC.

1d67e864-49fe-4290-8919-09c70ca391fb_figure1.gif

Figure 1. Sample HTML report for test 8 (accession SRR891268).

Layout and plots are based on FastQC’s HTML report.

Conclusions

Falco13 is a faster alternative to calculate the wide range of QC metrics generated by FastQC. It is entirely based on emulating the analysis modules FastQC provides while running faster than popular QC tools and generating dynamic visual summaries of analysis results. Both falco’s text and HTML outputs provide the same information generated by FastQC’s report, so tools that parse these files for custom visualization and downstream analysis can seamlessly incorporate falco into their pipeline.

Data availability

Datasets used to compare Falco and FastQC are shown in Table 2. Guidance for how to accept accession wgs-FAB49164 is available from the Benchmark directory of the falco GitHub page.

Software availability

Source code for falco available at: https://github.com/smithlabcode/falco.

The scripts used to download files and reproduce the benchmarking steps described are also available in the same repository within the “benchmark” directory.

Archived source code at time of publication: http://doi.org/10.5281/zenodo.352093313.

License: GNU General Public License version 3.0.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 07 Nov 2019
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
de Sena Brandine G and Smith AD. Falco: high-speed FastQC emulation for quality control of sequencing data [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2019, 8:1874 (https://doi.org/10.12688/f1000research.21142.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 07 Nov 2019
Views
22
Cite
Reviewer Report 30 Oct 2020
Weihong Qi, Functional Genomics Center Zurich, Zürich, Switzerland 
Approved
VIEWS 22
The authors developed falco, an emulation of the popular FastQC tool, which is faster and can handle very long Nanopore reads. It is a useful development, especially for core facilities and research labs that produce high volumes of sequencing data ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Qi W. Reviewer Report For: Falco: high-speed FastQC emulation for quality control of sequencing data [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2019, 8:1874 (https://doi.org/10.5256/f1000research.23273.r72941)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 27 Jan 2021
    Guilherme de Sena Brandine, Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, 90089, USA
    27 Jan 2021
    Author Response
    The reviewer has raised some important questions about points not addressed by the manuscript. We provide our responses below, and highlight changes made to the manuscript to address the reviewer’s ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 27 Jan 2021
    Guilherme de Sena Brandine, Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, 90089, USA
    27 Jan 2021
    Author Response
    The reviewer has raised some important questions about points not addressed by the manuscript. We provide our responses below, and highlight changes made to the manuscript to address the reviewer’s ... Continue reading
Views
44
Cite
Reviewer Report 07 Jul 2020
R Henrik Nilsson, Gothenburg Global Biodiversity Centre, University of Gothenburg, Gothenburg, Sweden 
Approved with Reservations
VIEWS 44
The authors present a welcome addition to the flora of FastQC-style read processing packages. The fact that it is a drop-in replacement for FastQC is particularly nice.
 
The manuscript is a bit too short in my opinion. ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Nilsson RH. Reviewer Report For: Falco: high-speed FastQC emulation for quality control of sequencing data [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2019, 8:1874 (https://doi.org/10.5256/f1000research.23273.r66327)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 27 Jan 2021
    Guilherme de Sena Brandine, Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, 90089, USA
    27 Jan 2021
    Author Response
    The reviewer has presented a thorough feedback to the description of the Falco software tool, as well as its implementation, description and documentation. We truly appreciate the very helpful comments, ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 27 Jan 2021
    Guilherme de Sena Brandine, Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, 90089, USA
    27 Jan 2021
    Author Response
    The reviewer has presented a thorough feedback to the description of the Falco software tool, as well as its implementation, description and documentation. We truly appreciate the very helpful comments, ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 07 Nov 2019
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.