ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

NGSeasy: a next generation sequencing pipeline in Docker containers

[version 1; peer review: 3 approved with reservations]
PUBLISHED 05 Oct 2015
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Container Virtualization in Bioinformatics collection.

Abstract

Motivation: Bioinformatic pipelines often use large numbers of components and deploying them incurs substantial configuration and maintenance burden that remains a significant barrier to reproducible research. Our aim is to define a new paradigm and best practices for developing, distributing and running pipelines encapsulated in Docker containers (lightweight virtualization), with a focus on next generation sequencing (NGS) workflows. This approach provides several advantages, namely: efficiency, portability, versioning and reproducibility. Using the NGSeasy pipeline, a user can quickly deploy any pipeline version in any environment (e.g. operating systems, workstations, clusters, clouds). While this might also be achieved with a virtual machine (VM); VMs lack portability, have substantial overhead (disk, CPU, RAM), and require allocated resources to be provisioned statically – Docker, to a large extent, solves these issues.
Results: We demonstrate best practices for packaging and execution of a multicomponent pipeline for NGS using a set of container building blocks which are versioned, modular and reusable. We present a basic ”proof of concept” evaluation of a next generation sequencing pipeline in Docker containers, capable of producing meaningful results, that are comparable with public and ”best practice” workflows, with little to no impact on standard computing performance.
Availability: Both versioned Dockerfiles and container images for each component are published on GitHub and Docker Hub, respectively. The pipeline and containers can be pulled from Docker Hub and executed on any environment capable of running the Docker platform with minimum hardware requirements for running an NGS pipeline.

Keywords

next generation sequencing, docker, container, variant caller, pipeline, read alignment, reproducible, bioinformatics

Introduction

Bioinformatic pipelines are frequently composed of large numbers of loosely coupled pieces of software, each tool requiring substantial configuration, maintenance and management of dependencies. Historically to facilitate packaging and reuse of pipelines, management frameworks such as Galaxy1, Ruffus2, and Taverna3 have been developed. While these workflow management systems work well, portability and deployment complexity limit their usability.

Our primary motivation for developing NGSeasy was to simplify pipeline deployment for academic and clinical labs, minimising the burden of informatic support. To achieve this, we used Docker4, an emerging container-based virtualization technology. Compared to virtual machines, Docker containers are simply a set of processes running in a multi-tenanted Linux host kernel, so are very lightweight as there is no underlying machine to emulate. These containers capture the initial investment of effort to build and configure them greatly facilitating re-use, they can be easily extended to modify or incorporate new components and shared on private or public (Docker Hub) registries.

Using NGSeasy and Docker, bioinformaticians and more importantly, researchers with fewer bioinformatic skills can very quickly deploy the pipeline to different environments e.g. development, testing and production, with the knowledge that the containers should always run consistently. Furthermore, we support multiple versions of the NGSeasy containers on Docker Hub, as each container packages its own dependencies and is versioned, the fidelity of the analysis is preserved in future execution – a requirement for reproducible research and clinical auditing5.

Methods

Dockerising an NGS pipeline

NGSeasy has provided us with the opportunity to start defining and thinking about best practices for building Dockerised modular pipelines. Many of these practices have been adapted in our images. Our (compbio/ngseasy-base) image forms the foundation layer on which each pipeline container application is built.

All Dockerfiles used to generate the NGSeasy images are available at https://github.com/KHP-Informatics/ngseasy.

We include what we think of as some of the best and most useful NGS "power tools" in compbio/ngseasy-base image (Table 1). These are all tools that allow the user to manipulate BED/SAM/BAM/VCF files in a variety of ways.

Table 1. NGSeasy 1.0-r001 base image components.

ComponentDescriptionVersion
samtools6Parse [s/b]am1.2-17
bcftools7Parse vcf1.2-5-g7fa0d25
vcftools8Parse vcfv0.1.12b
vcflib9Parse vcfv1.0.0
bamUtil10Parse [s/b]am1.0.13
bedtools11Parse [s/b]am/bedv2.23.0-10
samblaster12Parse [s/b]am0.1.21
sambamba13Parse [s/b]amv0.5.1
seqtk14Parse FASTQ1.0-r77
vt15Parse VCF*Latest
vawk16Awk-like VCF parser0.0.2
bioawk17Awk-like NGS parser*Latest

Our feature rich base image, allows pipes and streamlined system calls for manipulating the output of NGS pipelines, namely, BED/SAM/BAM/VCF files. Therefore, we built these into a single development environment for NGSeasy. This image is used as the base for all of our compbio/ngseasy-* tools.

A more Docker-esque approach, would be to have separate containers for each NGS tool. However, this belies the fact that many of these tools are required to interact, e.g. through pipe calls, when used as part of a streamlined pipeline.

Many of the raw NGSeasy images are fairly heavy (2–4GB). As a result, we flattened all images in order to compress multiple Docker layers into one, creating an image with fewer and smaller layers, before committing and pushing to Docker Hub.

With exception of the content built into the base image, each NGSeasy pipeline component (Table 2) is encapsulated in a separate container. Using separate containers helps to minimize container size, reduce unexpected interactions between components, and maximise the re-usability of containers.

Table 2. NGSeasy 1.0-r001 Components.

ComponentShort descriptionVersion
FastQC21Quality reports0.11.2
Trimmomatic22Read trimmer0.32
Picardtools23NGS tool1.128
GATK19NGS tool3.2-2
BWA24Aligner0.7.12-r1039
Bowtie225Aligner2.2.5
Stampy26Aligner1.0.23
Snap27Aligner1.0beta.18.
Novoalign28Aligner3.02.11
Glia29Re-aligner03-2015
FreeBayes30Variant caller0.9.21-5
Platypus31Variant caller0.8

Results and discussion

Overview of the NGSeasy pipeline

A typical NGS pipeline for variant calling and discovery involves the following steps, all of which are implemented in the current version of NGSeasy (1.0-r001):

  • 1. Pre-alignment quality control

  • 2. Sequence alignment

  • 3. Raw alignment processing (e.g. local realignment around candidate indel sites and base quality score recalibration)

  • 4. Post-alignment quality control

  • 5. Variant calling

NGSeasy contains all of the basic tools needed for manipulation and quality control of raw FASTQ files (Illumina focused), SAM/BAM manipulation, alignment, SAM/BAM cleaning and first pass variant discovery. The software we provide as part of NGSeasy are summarised in Table 1 and Table 2.

NGSeasy follows many of the current published best practices for next generation DNA sequencing analysis, specifically, we include options to include the Genome Analysis Toolkit (GATK) recommendations for de-duplication (using Picardtools MarkDuplicates), GATK’s base quality score recalibration (BQSR) and GATK’s realignment around indels1820.

We also include alternatives to GATK’s BQSR and indel realignment tools, specifically, BamUtil’s recab function http://genome.sph.umich.edu/wiki/BamUtil:recab), and for indel realignment, use of glia (https://github.com/ekg/glia). These options are provided for use in commercial and/or clinical laboratories who do not want to use or pay for a GATK licence.

Operation and implementation

Containerised software is automatically deployed, so we have opted to provide a wide variety of tools, including multiple tools for alignment and variant calling where available.

To keep the NGSeasy pipeline small and portable, input files, indexed reference genomes and generated output should bypass the container’s root file system instead using a host mounted directory or volume (Figure 1).

b86ad94d-f9ee-45b1-9dd8-8225bc2dfaec_figure1.gif

Figure 1. Pipeline steps (orange), containerised applications (blue) each extend our base image, mounted host directories or volumes (green) are used to handle input and output.

In certain instances it may be necessary to inspect a running container and this can be done by injecting a new process (e.g. a shell terminal) into the container with the docker-exec command, a valuable feature for debugging or monitoring. For resource allocation, Docker uses cgroups to control memory and CPU allocation (hard or soft allocation).

The container images are only provided for software which is freely available. For software components which require registration (e.g. GATK), or are proprietary (e.g. novoalign), we provide a short Dockerfile to complete the build with the additional components which the user must acquire. We believe this is a pragmatic solution for packaging and publishing pipelines that provide the option to use components with a restricted licence. In this way we provide maximum automated deployment with the minimum burden on the end user.

NGSeasy consists of a set of shell (bash) script wrappers, that orchestrate and call all parts of the Dockerised NGS pipeline - where the system calls are to docker run -i -t NGSTool instead of /bin/bash NGSTool, for example. Docker is agnostic, however, in that any workflow management software can be used to orchestrate a Docker based pipeline (eg. rufus2 or nextflow32).

Our design choice was largely influenced by our desire to provide a lightweight and fairly dependency free solution, that is "easy" to set up and maintain. We did not want the user to be tasked with installing a large number of software dependencies before being able to run NGSeasy. In this way, NGSeasy takes advantage of the fact that any modern computer, running any operating system with Docker (or for example boot2docker https://github.com/boot2docker/boot2docker-cli) installed, will come pre-packaged with all of the basic software needed to run a NGS pipeline.

NGSeasy gives the user several options to call a complete NGS pipeline, going from raw FASTQ files to aligned BAM files, variant calls (VCF) and annotations using a range of software. All options are defined in a simple configuration file that can be made, for example, using any spreadsheet application, and then saved as a tab-delimited text file. With this, the user is able to choose from a wide selection of sequence aligners, and variant callers, see Table 2.

The NGSeasy scripts enforce specific naming conventions and directory structures upon the user - allowing sensible and reproducible organisation of NGS projects and associated data on the users local machine. This also avoids all of the potential issues with typographical errors that are typical of manual input.

All NGSeasy applications are run as a non-root user within each container. This is hard coded in the NGSeasy ecosystem and provides some security for Docker containers running in shared computing environments.

Many useful optimisations and recommendations were adapted from bcbio-nextgen (https://bcbio-nextgen.readthedocs.org/en/latest/ - A python toolkit providing best practice pipelines for fully automated high throughput sequencing analysis - and speedseq (https://github.com/cc2qe/speedseq) - a flexible and open source framework to rapidly identify genomic variation33.

For useful cutting edge discussion and testing of NGS pipelines, we also refer readers to the Blue Collar Bioinformatics site at http://bcb.io/.

Getting and running NGSeasy

All Dockerfiles used to generate the NGSeasy images are available at https://github.com/KHPInformatics/ngseasy along with documentation on installing and running NGSeasy. The pre-built containers are available to download from https://registry.hub.docker.com/repos/compbio.

Getting and running NGSeasy is simple and outlined in the code block below.

Listing 1. "Getting and running NGSeasy"

## Install Docker : Full instructions at
    https://docs.docker.com/

## Get NGSeasy
git clone
    https://github.com/KHP-Informatics/ngseasy.git
    
## Install NGSeasy
cd ngseasy
sudo make INTSALLDIR="/media/scratch" all
sudo make intsall

## Running NGSeasy
nsgeasy -c ngseasy_test.config.tsv -d
    /media/scratch/nsg_projects

Users should note that deploying the pipeline containers is fairly fast, dependant on network speeds, however, downloading the reference genomes and test datasets for the resources folder can take a while. For example, the install time averages at about 94 min on machines connected to relatively fast networks (i.e. > 500 Mbit/s).

For full details on obtaining, setting up and running NGSeasy, please refer to our GitHub repository documentation (https://github.com/KHPInformatics/ngseasy).

System requirements

See Table 3 for our recommended system requirements. The hard disk requirements are based on our experience, and result from the fact that the pipeline/tools produce a range of intermediary and temporary files for each sample. The full NGSeasy install includes indexed genomes for hg19 and b37 for all aligners, annotation files from the GATK’s resource bundle (ftp://ftp.broadinstitute.org/bundle, 34), and all of the NGSeasy Docker images.

Table 3. System requirements.

ComponentMinimumRecommended
RAM16GB>48GB
CPU8 cores>32 cores
Hard disk (per sample)50–100GB200–500GB
NGSeasy install50GB100GB
Indexed reference genomes143GB200GB
    hg1973GB100GB
    b3770GB100GB
Sample data50GB100GB

Based on our experience, a basic NGS computing system for a small lab would consist of at least 4TB disk space, 60GB RAM and at least 32 CPU cores. Network speed is a major bottle neck when dealing with NGS sized data, and groups are encouraged to think about these issues before embarking on multi sample or population level studies - where computing requirements can very quickly escalate, and transferring NGS data between sites becomes a major rate limiting step.

Genome comparison and analytic testing

We tested basic NGSeasy functionality - going from raw .fastq to .bam to .vcf - on an Illumina 100bp paired end whole exome (30x coverage) dataset available from GCAT: Genome Comparison and Analytic Testing - An analytical framework for optimizing variant discovery from personal genomes (http://www.bioplanet.com/gcat). For more details about GCAT, please refer to 35.

For this report, a basic/fast "non-GATK" based pipeline was tested. We skipped FASTQ quality control trimming, re-alignment around indels and BQSR. The selected pipeline first runs FastQC on the raw data, followed by read alignment using all of the selected aligners: stampy, snap, novoalign, and bowtie2. All reads were aligned to the UCSC hg19 reference genome available at http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/.

The alignment stage outputs a duplicate marked (samblaster), sorted and indexed BAM file (sambamba), annotated with the appropriate read group information (e.g. sample name, platform unit etc). The alignment stage also includes generation of basic alignment statistics using sambamba’s flagstat function, and a bed file of aligned regions using the bedtools function bamtobed - these extras steps are reflected in the average run times for NGSeasy’s alignment stage (see Table 4). Note that stampy alignment is contingent on aligning reads with bwa first, and hence, we chose not to report separate results for bwa.

Table 4. Average run times:30× 100bp PE Illumina data.

NGSeasy step30×
FastQC12mins
Read quality rrimming15mins
Aligner: BWA6–10mins
Aligner: Bowtie260mins
Aligner: Novoalign60mins
Aligner: Snap(+index load)5–10mins
Aligner: Stampy (post BWA)25mins
Variant calling: Platypus5mins
Variant calling: FreeBayes30mins
Complete pipeline30–120mins

Variant calling was peformed using the haplotype based variant callers Platypus31 and FreeBayes30, and the resulting VCF files uploaded to the GCAT server for comparisons to the genome in a bottle (GIB) call set36. The GCAT results for the tests listed above are available at the following urls:

A full discussion on GIB performance statistics is beyond the scope of this paper. Briefly, for the 30x whole exome dataset, NGSeasy is achieving GIB sensitivities and specificities of 81.1–85.8% and 99.996–99.998%, respectively. There are obvious gains to be made by further pipeline optimisations, and the planned inclusion of structural variant callers and variant re-calling and filtering options.

We are presenting these results solely as a "proof of concept". That is, we have successfully Dockerised a full NGS pipeline, that is capable of producing meaningful results, that are comparable with public and "best practice" workflows.

Run performance

For the testing carried out in this paper, NGSeasy was run on Rosalind, an Openstack private cloud based at Kings College London, using a virtual machine with 256GB RAM and 32 cores. We have also successfully tested NGSeasy on workstations running a wide variety of environments (OSX, Windows 7, Ubuntu 14.04).

Average representative run times for a full NGSeasy pipeline and its components are presented in Table 4.

The obvious winners for alignment, based purely on speed, are bwa and snap. The two software are comparable. The extra run time seen for snap are due to loading/reading of the indexed reference genome. Once this has been done, snap will run at speed, and is the fastest aligner these authors have seen. The reported runtime for stampy is dependent on bwa having been run first.

Note, that fastQC and read quality trimming need only be applied once. After which, the pipeline is set up to test for, and skip these stages, if the have already been run - speeding up subsequent pipeline calls that use the same data. Be aware that run times will vary depending on depth, quality of data, and compute power (e.g. available RAM and CPU).

Both Platypus31 and FreeBayes30, are highly parallelisable and run at speed; Platypus being 6x faster than FreeBayes in our test, but, less sensitive than FreeBayes; the average GIB sensitivity over all aligners from Platypus versus FreeBayes was 82.40% versus 84.15%.

Running a full NGS pipeline using Docker containers had no real noticeable reduction in computing performance (run time) when compared to our original native (non-Dockered) NGS pipeline. The differences are in the milliseconds to seconds range, and largely depend on the underlying system hardware (and data quality). These observations are similar to those reported in 37.

Strikingly, depending on available compute, read depth and the selected pipeline components, the observed runs times indicate that a full clinical NGS pipeline could be run, and achieve actionable results in less than 2 hours. This has major positive implications for molecular diagnostics and projects like the 100,000 Genomes Project (http://www.genomicsengland.co.uk/the-100000-genomes-project/). That is, alignment and variant calling are no longer a major bottle neck. More work is needed to speed up and improve library preparation, sequencing machine run times and solutions for variant annotation, prioritisation and clinical reporting.

Use cases

NGSeasy demonstrates the utility of Docker as a means to package software used in modular workflows. We envisage NGSeasy as a method for deploying drop-in analyses, in scenarios where data cannot be shared (either for size or privacy reasons) and an analysis must be carried out in-situ. In such cases, using a pipeline like NGSeasy, it is simple to develop an analysis off site, package it and deploy it on computational facilities where access to the data is provided, examples of such scenarios include the 100,000 Genomes Project and Illumina BaseSpace38 Docker ’apps’.

In addition, NGSeasy is being tested across a select group of NHS Labs (under the NHS England Open Source Initiative) for molecular diagnostic and clinical research pipelines. In particular, a version of NGSeasy has been adapted by Viapath at King’s College Hospital (publication pending; personal communication from Dr Barnaby Clark and Dr David Brawand http://www.viapath.co.uk/locations/kings-college-hospital). The advantages being, the ease of use and set up, the built in version control and the ability for audit tracking and reproducibility conferred by the use of Docker and the open source community built around GitHub.

NSGeasy future developments

NGSeasy is under continual development. What we demonstrate here is the pre-production release and basic proof of concept evaluation of NGSeasy :a next generation sequencing pipeline in Docker containers. We want to present this to the scientific community at large, especially those working in the bioinformatics domain, and wish to encourage and invite collaboration on NGSeasy and our groups efforts to Dockerise bioinformatic pipelines.

The group is currently working on a GUI for NGSeasy and along with a modular benchmarking suite. In planned extensions, NGSeasy will provide options for consensus calling, trio/family and population based calling pipelines, human leukocyte antigen (HLA) calling, structural variant calling, cancer pipelines, more optimisations, improved logging, and the latest b38 indexed genomes.

In later versions we will publish detailed benchmarking statistics for all aligners and variant calling on whole exome, genome and clinical panels from a range of depths and platforms.

Development work on Docker continues at pace. The present Docker daemon, runs as root, and there remain security issues with the notion of providing access to this daemon in a shared user environment, such as a typical cluster, a solution to this exists using Linux kernel user namespaces but this is presently undergoing review.

Software availability

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 05 Oct 2015
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Folarin AA, Dobson RJ and Newhouse SJ. NGSeasy: a next generation sequencing pipeline in Docker containers [version 1; peer review: 3 approved with reservations]. F1000Research 2015, 4:997 (https://doi.org/10.12688/f1000research.7104.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 05 Oct 2015
Views
104
Cite
Reviewer Report 23 Oct 2015
Michael Barton, Joint Genome Institute, Walnut Creek, CA, USA 
Approved with Reservations
VIEWS 104
My understanding of this article is that NGSeasy pipeline aims to simplify the distribution of common tools used in sequencing analysis. A still significant problem in bioinformatics is getting the third-party tools installed and working, by using Docker containers as ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Barton M. Reviewer Report For: NGSeasy: a next generation sequencing pipeline in Docker containers [version 1; peer review: 3 approved with reservations]. F1000Research 2015, 4:997 (https://doi.org/10.5256/f1000research.7650.r10673)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
100
Cite
Reviewer Report 20 Oct 2015
Brad Chapman, Bioinformatics Core, Department of Biostatistics, Harvard T.H. Chan School of Public Health, Cambridge, MA, USA 
Approved with Reservations
VIEWS 100
The authors describe NGSeasy, a containerized set of tools for variant calling on high throughput sequencing data. The implementation is open-source, maintained and well documented with instructions for getting started. The paper reports a number of useful practical results, including ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Chapman B. Reviewer Report For: NGSeasy: a next generation sequencing pipeline in Docker containers [version 1; peer review: 3 approved with reservations]. F1000Research 2015, 4:997 (https://doi.org/10.5256/f1000research.7650.r10873)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 22 Oct 2015
    Stephen Newhouse, The Institute of Psychiatry, Psychology and Neuroscience, King’s College, London, UK
    22 Oct 2015
    Author Response
    Thanks for the review and comments!
    Competing Interests: No competing interests were disclosed.
COMMENTS ON THIS REPORT
  • Author Response 22 Oct 2015
    Stephen Newhouse, The Institute of Psychiatry, Psychology and Neuroscience, King’s College, London, UK
    22 Oct 2015
    Author Response
    Thanks for the review and comments!
    Competing Interests: No competing interests were disclosed.
Views
106
Cite
Reviewer Report 06 Oct 2015
Fabien Campagne, Department of Physiology and Biophysics, Weill Cornell Medical College, New York, NY, USA 
Approved with Reservations
VIEWS 106
This manuscript describes a software tool called NGSEasy, which consists of a set of configured docker images for writing pipelines to call genomic variations. The manuscript addresses a need because the portability and reproducibility of bioinformatics pipelines, such as the ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Campagne F. Reviewer Report For: NGSeasy: a next generation sequencing pipeline in Docker containers [version 1; peer review: 3 approved with reservations]. F1000Research 2015, 4:997 (https://doi.org/10.5256/f1000research.7650.r10672)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 22 Oct 2015
    Stephen Newhouse, The Institute of Psychiatry, Psychology and Neuroscience, King’s College, London, UK
    22 Oct 2015
    Author Response
    Great - thanks! We are waiting for a 3rd reviewer comments before addressing all of these points.
    Competing Interests: No competing interests were disclosed.
COMMENTS ON THIS REPORT
  • Author Response 22 Oct 2015
    Stephen Newhouse, The Institute of Psychiatry, Psychology and Neuroscience, King’s College, London, UK
    22 Oct 2015
    Author Response
    Great - thanks! We are waiting for a 3rd reviewer comments before addressing all of these points.
    Competing Interests: No competing interests were disclosed.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 05 Oct 2015
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.