SnakeWRAP: a Snakemake workflow to facilitate automated processing of metagenomic data through the metaWRAP pipeline

John Krapohl; Brett E. Pickett

doi:10.12688/f1000research.108835.2

Home Browse SnakeWRAP: a Snakemake workflow to facilitate automated processing...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

Revised

SnakeWRAP: a Snakemake workflow to facilitate automated processing of metagenomic data through the metaWRAP pipeline

[version 2; peer review: 2 approved, 1 approved with reservations]

Previously titled: METASnake: a Snakemake workflow to facilitate automated processing of metagenomic data through the metaWRAP pipeline

John Krapohl¹, Brett E. Pickett ¹

PUBLISHED 28 Apr 2022

Author details Author details

¹ Department of Microbiology and Molecular Biology, Brigham Young University, Provo, UT, 84602, USA

John Krapohl
Roles: Software, Writing – Original Draft Preparation

Brett E. Pickett
Roles: Supervision, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

Abstract

Generating high-quality genome assemblies of complex microbial populations from shotgun metagenomics data is often a manually intensive task involving many computational steps. SnakeWRAP is a novel tool, implemented in the Snakemake workflow language, to automate multiple metaWRAP modules. Specifically, it wraps the shell scripts provided within the original metaWRAP software, within Snakemake. This approach enables high-throughput simultaneous assembly and analysis of multiple shotgun metagenomic datasets using the robust modular metaWRAP software. We expect this advancement to be of import in institutions where high-performance computing infrastructure is available, especially in the context of big data. This software tool is publicly available at https://github.com/jkrapohl/SnakeWRAP

Keywords

metaWRAP, shotgun metagenomics, high-performance computing, big data, Snakemake

Corresponding author: Brett E. Pickett

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2022 Krapohl J and Pickett BE. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Krapohl J and Pickett BE. SnakeWRAP: a Snakemake workflow to facilitate automated processing of metagenomic data through the metaWRAP pipeline [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Research 2022, 11:265 (https://doi.org/10.12688/f1000research.108835.2) First published: 02 Mar 2022, 11:265 (https://doi.org/10.12688/f1000research.108835.1) Latest published: 28 Apr 2022, 11:265 (https://doi.org/10.12688/f1000research.108835.2)

Revised Amendments from Version 1

This includes an updated name, changes to grammar, and a clarification of the functions of SnakeWRAP.

To read any peer review reports and author responses for this article, follow the "read" links in the Open Peer Review table.

Introduction

As sequencing technology has become cheaper and more readily accessible, the need for the increased computational capacity to process these data has become apparent. In particular, high-throughput sequencing has been particularly useful when applied to the field of metagenomics. Substantial effort has been devoted to developing software and computational pipelines, such as MetaWRAP, which cater to this growing area of research. MetaWRAP is a modular wrapper that combines many of the necessary tools to process reads, create bins, and visualize data within a robust modular design (Uritskiy, DiRuggiero, & Taylor, 2018). The primary limitation arising from this design is the inability to automatically scale its usage to massive datasets. However, its unique code that produces exceptionally high quality bins from bins generated from other binning programs is invaluable. Snakemake is a widely-used Python based workflow management system that automates repetitive tasks, allowing software processes to be both scalable and reproducible (Mölder et al., 2021). By integrating the original MetaWRAP processes into Snakemake, the customizable modular nature of MetaWRAP can be preserved even when automatically processing large datasets through a single workflow. Our SnakeWRAP software automates the tasks performed within MetaWRAP, allowing for individual modules to be toggled on and off using Snakemake-defined rules.

Methods

Implementation

As this tool makes use of Snakemake, individual steps of the workflow are broken into rules, many of which can be toggled (Mölder et al., 2021). To do this, Snakemake requires a YAML configuration file as an input to determine which steps are specified for the run, as well detailed parameters such as which assembly tool to use. This configuration file also specifies the location of a metadata file, which contains a list of files associated with the run. An example of both the YAML and metadata file are included in the Github repository. SnakeWRAP can be submitted for scheduling from the command line by using the submission script also found within the repository. Running the script requires the complete installation of both Snakemake and METAWrap, as well as all of their dependencies. It is recommended that Snakemake and METAWrap be installed using the latest version of Miniconda, following instructions found on their corresponding Github repositories (Mölder et al., 2021; Orjuela, Huang, Hembach, Robinson, & Soneson, 2019).

Operation

The advantage of using SnakeWRAP over the base METAWrap program is the ability to appropriately support large-scale job sizes while requiring less user input by automating the entire process. This enables jobs to be better organized and allows improved consistency in outputs. We recommended that this tool be run on cluster computing or within a high-performance infrastructure for larger jobs. In such cases, we have found that a minimum of 100 GB of RAM across 24 CPUs is sufficient, with larger jobs requiring additional resources.

The rules found within the Snakemake file correspond to different steps within the original METAWrap software. Some initial required steps include read quality control, assembly, binning, and bin refinement and reassembly. However, we have provided toggle settings for memory-heavy modules that generate figures, which are Blobology, Kraken, bin quantification, bin annotation, and bin classification. This design reduces resource usage by only generating the desired figures, and allows the modification of this setting to create these figures at a later time by toggling the module on in the YAML file, deleting any outputs created for that module, and resubmitting the job. Snakemake will skip any rules already completed and only run the required rule(s).

Use cases

As stated above, this tool works best with a high-performance computing infrastructure. We anticipate that the most effective use of this tool is applying it to big metagenomics data, such as meta-analyses of existing datasets or datasets with many samples. To assess the capabilities of this tool, the publicly available samples SRR13296364 and SRR13296365, accessed from the NCBI Sequence Read Archive from the study SRP299130, were run together successfully. We then ran a larger set of files from the SRP257563 study through the metaWRAP pipeline using this software. We found that this tool will run as expected as long as sufficient memory is provided and all dependencies are installed correctly. Inputs include sequencing files in fastq format, a metadata file listing all fastq filenames, and a YAML file describing paths for all file sources and destinations. Outputs include high-quality genomic bins, as wells as a number of generated figures such as a blobplot, heatmap, and kronogram.

Discussion

While the processing of metagenomics datasets is untractable for most personal computers, researchers with access to high-performance computing infrastructure can take full advantage of this software. The core functions found within the original MetaWRAP include read quality control, assembly, and binning are required for the analysis and cannot be toggled off. These functions include several refinement steps that are unique to MetaWRAP, which allow it to create higher-quality bins than existing stand-alone programs (Uritskiy et al., 2018). Our design enables the user to decide whether computationally-expensive modules, such as Kraken and Blobology, that assist with figure generation can be skipped.

Snakemake automatically generates a directed acyclic graph (DAG) to order tasks, track the progress of each task for each sample, and eliminate duplicate tasks for the same sample (Mölder et al., 2021). This is vital to efficiently processing and analyzing large datasets, as jobs can fail due to insufficient memory or timing out. An advantage of implementing this workflow in Snakemake is the ability to resume the workflow from the same step that was underway if a process fails. The input paths, output paths, and parameters for each job are assigned by the end-user within a configuration file. This file is read by Snakemake to prevent data from being incorrectly assigned or lost and to facilitate reuse and customization of the workflow. We anticipate that this advance in automating metagenomics data processing will facilitate the generation, analysis, and re-analysis of larger datasets in the future.

Data availability

No data are associated with this article.

Software availability

Source code available from: https://github.com/jkrapohl/SnakeWRAP.

Archived DOI at time of publication: https://doi.org/10.5281/zenodo.5719960.

License: Open.

Acknowledgements

We would like to thank the Brigham Young University (BYU) Office of Research Computing as well as BYU for providing the facilities for this work.

References

Mölder F, Jablonski KP, Letcher B, et al.: Sustainable data analysis with Snakemake. F1000Res. 2021; 10: 33. PubMed Abstract | Publisher Full Text
Orjuela S, Huang R, Hembach KM, et al.: ARMOR: An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data. G3 (Bethesda). 2019; 9(7): 2089–2096. PubMed Abstract | Publisher Full Text
Uritskiy GV, DiRuggiero J, Taylor J: MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome. 2018; 6(1): 158. PubMed Abstract | Publisher Full Text

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 02 Mar 2022

Author details Author details

¹ Department of Microbiology and Molecular Biology, Brigham Young University, Provo, UT, 84602, USA

John Krapohl
Roles: Software, Writing – Original Draft Preparation

Brett E. Pickett
Roles: Supervision, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (2)

version 2

Revised

Published: 28 Apr 2022, 11:265

https://doi.org/10.12688/f1000research.108835.2

version 1

Published: 02 Mar 2022, 11:265

https://doi.org/10.12688/f1000research.108835.1

Copyright

© 2022 Krapohl J and Pickett BE. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Krapohl J and Pickett BE. SnakeWRAP: a Snakemake workflow to facilitate automated processing of metagenomic data through the metaWRAP pipeline [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Research 2022, 11:265 (https://doi.org/10.12688/f1000research.108835.2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 2

VERSION 2

PUBLISHED 28 Apr 2022

Revised

Views

4

Reviewer Report 18 Jul 2023

Lauren Mak, Weill Cornell Medicine, New York, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.133574.r186472

The authors have developed a Snakemake wrapper around MetaWRAP modules to address two structural shortcomings in the original pipeline: 1) Its inability to take advantage of HPC resources and 2) its inability to switch resource-intensive steps on or off. The ... Continue reading

The authors have developed a Snakemake wrapper around MetaWRAP modules to address two structural shortcomings in the original pipeline: 1) Its inability to take advantage of HPC resources and 2) its inability to switch resource-intensive steps on or off. The authors successfully used Snakemake to address these pain points by encapsulating each module and letting the Snakemake scheduler manage resource allocation and option ingestion. Furthermore, because MetaWRAP is no longer being maintained, some kind of updated structure is needed to manage the analysis of large datasets.

While the authors have indeed addressed the pain points described, they have made some software architecture choices that make the utility of their pipeline questionable.

Major:

For one, their installation process involves installing MetaWRAP and SnakeWRAP separately. An more efficient way around this would be to fork MetaWRAP and integrate their SnakeWRAP code, thereby obviating the need to add a MetaWRAP path to the config. However, this might not even be the best option - MetaWRAP has many documented environment management issues, which is to be expected given the pace of the field. How does SnakeWRAP handle this? The Snakefile doesn't seem to have conda integration.

Re: '...sufficient information provided to allow interpretation of the expected output...' and '...conclusions about the tool and its performance adequately supported'. No explanation at all, not even a summary of the original MetaWRAP output. Furthermore, there is no technical comparison (ex. in terms of runtime, at least) to support the claim that the pipeline is a more efficient implementation of MetaWRAP.

Re: '...sufficient details of the code... to allow replication...' In the README, some installation commands are provided, though instructions on how to download the FastQs from the SRA are missing. Furthermore, there is no `metatext.txt` for samples, only `sample_metadata.txt` which contains not FastQ but SRA pathnames.

Minor:

While the number of threads is handled by the config, it seems the amount of memory allocated to each job is not. From the Slurm submission script, it appears that the mem-per-cpu flag allocates a certain amount per thread. Is this necessarily the best setup? What if a task parallelizes poorly but requires the storage of a large matrix in memory (i.e.: anything loading a database)?

A purely philosophical question - why are Bash scripts called in the Snakefile, when the purpose of the shell block is to run commands Bash-style?

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Metagenomics analysis, pipeline development, strain inference, MAG binning, assembly graphs, evolutionary biology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

9

Reviewer Report 17 Jul 2023

Michael J Roach, Flinders University, Adelaide, Australia

Approved

https://doi.org/10.5256/f1000research.133574.r186487

The authors describe a Snakemake pipeline that automates running the popular metaWRAP pipeline. The metaWRAP bash scripts that SnakeWRAP runs are themselves bash pipelines that run external programs and perform intermediate steps. It would be far more beneficial to perform ... Continue reading

The authors describe a Snakemake pipeline that automates running the popular metaWRAP pipeline. The metaWRAP bash scripts that SnakeWRAP runs are themselves bash pipelines that run external programs and perform intermediate steps. It would be far more beneficial to perform a more complete rewrite of the metaWRAP package entirely within the Snakemake framework. Looking at the metaWRAP flowcharts, I can appreciate that this would be a much bigger task. My main suggestion otherwise would be to ship the pipeline with a tiny test dataset that the users can run, and to provide somewhere (zenodo perhaps) the example outputs that the authors allude to in the manuscript.

Other thoughts:

I would recommend a more meaningful name. There are thousands of SNAKEmake pipelines and most of them are WRAPpers for external tools.
Use standardised structure for snakemake pipelines: https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#distribution-and-reproducibility
Consider including snakemake rules to download databases and metaWRAP.
You can read in a config file within the Snakefile using the "configfile:" directive. This will initialise config with sensible defaults, but any user-supplied config (either modified in the default config file, or in a user-supplied config file, or passed with -C) will overwrite these defaults. This will save you having to initialise empty config settings and sanitise filepaths at the start.
Consider alternate target rules to rule all, e.g. rule qc could specify just the qc module which the users could run with "snakemake...qc". This could potentially replace the config toggles, e.g. the user would run "snakemake...assembly kraken prokka" and the stages not specified (blobplot/salmon/etc) would not be run.
Specify resources for use with snakemake profiles on HPC clusters etc. These can be parsed from the config file.
Consider messages instead of echo in shell commands: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#messages
You can move rules to one or more external files and add them in with the "include:" directive if your Snakefile is getting too big.
You can use "script:" instead of "shell:" for your bash scripts: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#bash

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bioinformatics; genomics; metagenomics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 02 Mar 2022

Views

17

Reviewer Report 29 Mar 2022

Tobias Jakobi, Department of Internal Medicine and the Center for Translational Cardiovascular Research, University of Arizona, Phoenix, AZ, USA

Approved

https://doi.org/10.5256/f1000research.120268.r126391

The manuscript submitted by Krapohl et al. outlines a Snakemake-based framework for the metaWRAP software. Given complex workflows like metaWRAP it makes sense to streamline operations by using workflow managers such as Snakemake. The authors provide a good overview of ... Continue reading

The manuscript submitted by Krapohl et al. outlines a Snakemake-based framework for the metaWRAP software. Given complex workflows like metaWRAP it makes sense to streamline operations by using workflow managers such as Snakemake. The authors provide a good overview of the feature of their software, but the addition of more details in different parts of the manuscript might help readers and potential users to get a better picture.

The abstract mentions the metaWRAP software, but does not really introduce what metaWRAP does. Although there is clearly a connections to metagenomics, one sentence to introduce the software would make the link more direct.
The introduction only focuses on metaWRAP; here, mentioning other, similar tools would make the introduction more inclusive. Why did the authors choose metaWRAP as basis for their software and not another tool?
The snakemake-defined rules probably do not require quotes.
Yaml should be capitalized to YAML and spelled out when it’s first used.
The authors state that 100GB RAM and 24 CPUs are sufficient for “normal” jobs. Here, it would be very helpful to give readers and potentials user an idea what “normal” and “larger” jobs correspond to, e.g. number of reads, or libraries?
The authors refer to the suitability of the approach for HPC environment. However, are any interfaces to specific scheduling systems included? The most time consuming step of setting up new HPC pipelines usually is the scheduling setup.
The authors write that two different datasets completed successfully, but how long did the run take? Of course that’s dependent on the hardware used, but it should be quantifiable.
There seems is another, unrelated software on GitHub that uses the same name: https://github.com/EthanHolleman/metasnake.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bioinformatics, High Performance Computing, Computational RNA Biology, Cardiovascular Disease

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 02 Mar 2022

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 2 (revision) 28 Apr 22		read	read
Version 1 02 Mar 22	read

Tobias Jakobi, University of Arizona, Phoenix, USA
Michael J Roach, Flinders University, Adelaide, Australia
Lauren Mak, Weill Cornell Medicine, New York, USA

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

4 Views

18 Jul 2023 | for Version 2

Lauren Mak, Weill Cornell Medicine, New York, USA

4 Views Cite this report Responses(0)

Approved With Reservations

The authors have developed a Snakemake wrapper around MetaWRAP modules to address two structural shortcomings in the original pipeline: 1) Its inability to take advantage of HPC resources and 2) its inability to switch resource-intensive steps on or off. The authors successfully used Snakemake to address these pain points by encapsulating each module and letting the Snakemake scheduler manage resource allocation and option ingestion. Furthermore, because MetaWRAP is no longer being maintained, some kind of updated structure is needed to manage the analysis of large datasets.

While the authors have indeed addressed the pain points described, they have made some software architecture choices that make the utility of their pipeline questionable.

Major:

For one, their installation process involves installing MetaWRAP and SnakeWRAP separately. An more efficient way around this would be to fork MetaWRAP and integrate their SnakeWRAP code, thereby obviating the need to add a MetaWRAP path to the config. However, this might not even be the best option - MetaWRAP has many documented environment management issues, which is to be expected given the pace of the field. How does SnakeWRAP handle this? The Snakefile doesn't seem to have conda integration.

Re: '...sufficient information provided to allow interpretation of the expected output...' and '...conclusions about the tool and its performance adequately supported'. No explanation at all, not even a summary of the original MetaWRAP output. Furthermore, there is no technical comparison (ex. in terms of runtime, at least) to support the claim that the pipeline is a more efficient implementation of MetaWRAP.

Re: '...sufficient details of the code... to allow replication...' In the README, some installation commands are provided, though instructions on how to download the FastQs from the SRA are missing. Furthermore, there is no `metatext.txt` for samples, only `sample_metadata.txt` which contains not FastQ but SRA pathnames.

Minor:

While the number of threads is handled by the config, it seems the amount of memory allocated to each job is not. From the Slurm submission script, it appears that the mem-per-cpu flag allocates a certain amount per thread. Is this necessarily the best setup? What if a task parallelizes poorly but requires the storage of a large matrix in memory (i.e.: anything loading a database)?

A purely philosophical question - why are Bash scripts called in the Snakefile, when the purpose of the shell block is to run commands Bash-style?

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Metagenomics analysis, pipeline development, strain inference, MAG binning, assembly graphs, evolutionary biology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

9 Views

17 Jul 2023 | for Version 2

Michael J Roach, Flinders University, Adelaide, Australia

9 Views Cite this report Responses(0)

Approved

The authors describe a Snakemake pipeline that automates running the popular metaWRAP pipeline. The metaWRAP bash scripts that SnakeWRAP runs are themselves bash pipelines that run external programs and perform intermediate steps. It would be far more beneficial to perform a more complete rewrite of the metaWRAP package entirely within the Snakemake framework. Looking at the metaWRAP flowcharts, I can appreciate that this would be a much bigger task. My main suggestion otherwise would be to ship the pipeline with a tiny test dataset that the users can run, and to provide somewhere (zenodo perhaps) the example outputs that the authors allude to in the manuscript.

Other thoughts:

I would recommend a more meaningful name. There are thousands of SNAKEmake pipelines and most of them are WRAPpers for external tools.
Use standardised structure for snakemake pipelines: https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#distribution-and-reproducibility
Consider including snakemake rules to download databases and metaWRAP.
You can read in a config file within the Snakefile using the "configfile:" directive. This will initialise config with sensible defaults, but any user-supplied config (either modified in the default config file, or in a user-supplied config file, or passed with -C) will overwrite these defaults. This will save you having to initialise empty config settings and sanitise filepaths at the start.
Consider alternate target rules to rule all, e.g. rule qc could specify just the qc module which the users could run with "snakemake...qc". This could potentially replace the config toggles, e.g. the user would run "snakemake...assembly kraken prokka" and the stages not specified (blobplot/salmon/etc) would not be run.
Specify resources for use with snakemake profiles on HPC clusters etc. These can be parsed from the config file.
Consider messages instead of echo in shell commands: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#messages
You can move rules to one or more external files and add them in with the "include:" directive if your Snakefile is getting too big.
You can use "script:" instead of "shell:" for your bash scripts: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#bash

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics; genomics; metagenomics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

17 Views

29 Mar 2022 | for Version 1

Tobias Jakobi, Department of Internal Medicine and the Center for Translational Cardiovascular Research, University of Arizona, Phoenix, AZ, USA

17 Views Cite this report Responses(0)

Approved

The manuscript submitted by Krapohl et al. outlines a Snakemake-based framework for the metaWRAP software. Given complex workflows like metaWRAP it makes sense to streamline operations by using workflow managers such as Snakemake. The authors provide a good overview of the feature of their software, but the addition of more details in different parts of the manuscript might help readers and potential users to get a better picture.

The abstract mentions the metaWRAP software, but does not really introduce what metaWRAP does. Although there is clearly a connections to metagenomics, one sentence to introduce the software would make the link more direct.
The introduction only focuses on metaWRAP; here, mentioning other, similar tools would make the introduction more inclusive. Why did the authors choose metaWRAP as basis for their software and not another tool?
The snakemake-defined rules probably do not require quotes.
Yaml should be capitalized to YAML and spelled out when it’s first used.
The authors state that 100GB RAM and 24 CPUs are sufficient for “normal” jobs. Here, it would be very helpful to give readers and potentials user an idea what “normal” and “larger” jobs correspond to, e.g. number of reads, or libraries?
The authors refer to the suitability of the approach for HPC environment. However, are any interfaces to specific scheduling systems included? The most time consuming step of setting up new HPC pipelines usually is the scheduling setup.
The authors write that two different datasets completed successfully, but how long did the run take? Of course that’s dependent on the hardware used, but it should be quantifiable.
There seems is another, unrelated software on GitHub that uses the same name: https://github.com/EthanHolleman/metasnake.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics, High Performance Computing, Computational RNA Biology, Cardiovascular Disease

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

[1] Mölder F, Jablonski KP, Letcher B, et al.: Sustainable data analysis with Snakemake. F1000Res. 2021; 10: 33. PubMed Abstract | Publisher Full Text

[2] Orjuela S, Huang R, Hembach KM, et al.: ARMOR: An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data. G3 (Bethesda). 2019; 9(7): 2089–2096. PubMed Abstract | Publisher Full Text

[3] Uritskiy GV, DiRuggiero J, Taylor J: MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome. 2018; 6(1): 158. PubMed Abstract | Publisher Full Text

SnakeWRAP: a Snakemake workflow to facilitate automated processing of metagenomic data through the metaWRAP pipeline

Abstract

Keywords

Revised Amendments from Version 1

Introduction

Methods

Implementation

Operation

Use cases

Discussion

Data availability

Software availability

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated