Keywords
metaWRAP, shotgun metagenomics, high-performance computing, big data, Snakemake
This article is included in the Bioinformatics gateway.
metaWRAP, shotgun metagenomics, high-performance computing, big data, Snakemake
This includes an updated name, changes to grammar, and a clarification of the functions of SnakeWRAP.
To read any peer review reports and author responses for this article, follow the "read" links in the Open Peer Review table.
As sequencing technology has become cheaper and more readily accessible, the need for the increased computational capacity to process these data has become apparent. In particular, high-throughput sequencing has been particularly useful when applied to the field of metagenomics. Substantial effort has been devoted to developing software and computational pipelines, such as MetaWRAP, which cater to this growing area of research. MetaWRAP is a modular wrapper that combines many of the necessary tools to process reads, create bins, and visualize data within a robust modular design (Uritskiy, DiRuggiero, & Taylor, 2018). The primary limitation arising from this design is the inability to automatically scale its usage to massive datasets. However, its unique code that produces exceptionally high quality bins from bins generated from other binning programs is invaluable. Snakemake is a widely-used Python based workflow management system that automates repetitive tasks, allowing software processes to be both scalable and reproducible (Mölder et al., 2021). By integrating the original MetaWRAP processes into Snakemake, the customizable modular nature of MetaWRAP can be preserved even when automatically processing large datasets through a single workflow. Our SnakeWRAP software automates the tasks performed within MetaWRAP, allowing for individual modules to be toggled on and off using Snakemake-defined rules.
As this tool makes use of Snakemake, individual steps of the workflow are broken into rules, many of which can be toggled (Mölder et al., 2021). To do this, Snakemake requires a YAML configuration file as an input to determine which steps are specified for the run, as well detailed parameters such as which assembly tool to use. This configuration file also specifies the location of a metadata file, which contains a list of files associated with the run. An example of both the YAML and metadata file are included in the Github repository. SnakeWRAP can be submitted for scheduling from the command line by using the submission script also found within the repository. Running the script requires the complete installation of both Snakemake and METAWrap, as well as all of their dependencies. It is recommended that Snakemake and METAWrap be installed using the latest version of Miniconda, following instructions found on their corresponding Github repositories (Mölder et al., 2021; Orjuela, Huang, Hembach, Robinson, & Soneson, 2019).
The advantage of using SnakeWRAP over the base METAWrap program is the ability to appropriately support large-scale job sizes while requiring less user input by automating the entire process. This enables jobs to be better organized and allows improved consistency in outputs. We recommended that this tool be run on cluster computing or within a high-performance infrastructure for larger jobs. In such cases, we have found that a minimum of 100 GB of RAM across 24 CPUs is sufficient, with larger jobs requiring additional resources.
The rules found within the Snakemake file correspond to different steps within the original METAWrap software. Some initial required steps include read quality control, assembly, binning, and bin refinement and reassembly. However, we have provided toggle settings for memory-heavy modules that generate figures, which are Blobology, Kraken, bin quantification, bin annotation, and bin classification. This design reduces resource usage by only generating the desired figures, and allows the modification of this setting to create these figures at a later time by toggling the module on in the YAML file, deleting any outputs created for that module, and resubmitting the job. Snakemake will skip any rules already completed and only run the required rule(s).
As stated above, this tool works best with a high-performance computing infrastructure. We anticipate that the most effective use of this tool is applying it to big metagenomics data, such as meta-analyses of existing datasets or datasets with many samples. To assess the capabilities of this tool, the publicly available samples SRR13296364 and SRR13296365, accessed from the NCBI Sequence Read Archive from the study SRP299130, were run together successfully. We then ran a larger set of files from the SRP257563 study through the metaWRAP pipeline using this software. We found that this tool will run as expected as long as sufficient memory is provided and all dependencies are installed correctly. Inputs include sequencing files in fastq format, a metadata file listing all fastq filenames, and a YAML file describing paths for all file sources and destinations. Outputs include high-quality genomic bins, as wells as a number of generated figures such as a blobplot, heatmap, and kronogram.
While the processing of metagenomics datasets is untractable for most personal computers, researchers with access to high-performance computing infrastructure can take full advantage of this software. The core functions found within the original MetaWRAP include read quality control, assembly, and binning are required for the analysis and cannot be toggled off. These functions include several refinement steps that are unique to MetaWRAP, which allow it to create higher-quality bins than existing stand-alone programs (Uritskiy et al., 2018). Our design enables the user to decide whether computationally-expensive modules, such as Kraken and Blobology, that assist with figure generation can be skipped.
Snakemake automatically generates a directed acyclic graph (DAG) to order tasks, track the progress of each task for each sample, and eliminate duplicate tasks for the same sample (Mölder et al., 2021). This is vital to efficiently processing and analyzing large datasets, as jobs can fail due to insufficient memory or timing out. An advantage of implementing this workflow in Snakemake is the ability to resume the workflow from the same step that was underway if a process fails. The input paths, output paths, and parameters for each job are assigned by the end-user within a configuration file. This file is read by Snakemake to prevent data from being incorrectly assigned or lost and to facilitate reuse and customization of the workflow. We anticipate that this advance in automating metagenomics data processing will facilitate the generation, analysis, and re-analysis of larger datasets in the future.
Source code available from: https://github.com/jkrapohl/SnakeWRAP.
Archived DOI at time of publication: https://doi.org/10.5281/zenodo.5719960.
License: Open.
We would like to thank the Brigham Young University (BYU) Office of Research Computing as well as BYU for providing the facilities for this work.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Metagenomics analysis, pipeline development, strain inference, MAG binning, assembly graphs, evolutionary biology
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatics; genomics; metagenomics
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatics, High Performance Computing, Computational RNA Biology, Cardiovascular Disease
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 2 (revision) 28 Apr 22 |
read | read | |
Version 1 02 Mar 22 |
read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)