Keywords
Snakemake, metagenomics, microbiology, genomics, bioinformatics, microbial ecology
This article is included in the Bioinformatics gateway.
Snakemake, metagenomics, microbiology, genomics, bioinformatics, microbial ecology
Following the comments from the two reviewers, the authors have made the following update:
i) we released a new version of our workflow (v1.1.0) that now integrates GUNC, a software for detection of chimerism and contamination in prokaryotic genomes.
ii) we showed that the MAGs recovered by SnakeMAGs only were not significantly different from the MAGs recovered by both workflow, in terms of completeness, contamination, genome size and relative abundance.
iii) we demonstrated that regardless of the MAGs quality criteria, SnakeMAGs produces more MAGs and more diverse MAGs than ATLAS.
See the authors' detailed response to the review by Aram Mikaelyan
See the authors' detailed response to the review by Célio Dias Santos Júnior
Over the last years, microbial ecology has progressively made the transition from gene-centric to genome-centric analyses,1 allowing the clear assignment of (sometimes novel) microbial taxa to specific functions and metabolisms.2–5 Indeed, technical and technological progresses such as binning methods applied to metagenomics,6 single-cell genome sequencing7 as well as high-throughput cultivation methods8 have contributed to the continuing and exponential increase of available prokaryotic genomes.9 This is particularly true for metagenomics that offers the possibility to reconstruct metagenome-assembled genomes (MAGs) on a large scale and from various environments, and thus has generated a huge amount of new prokaryotic genomes.10,11
Although the use of MAGs in microbial ecology is becoming a common practice nowadays, processing raw metagenomic reads up to genome reconstruction involves various steps and software which can represent a major technical obstacle, especially for non-specialists. To face this problem, several workflows such as MetaWRAP,12 its Snakemake version called SnakeWRAP,13 ATLAS14 and more recently MAGNETO,15 have been developed to automatically reconstruct genomes from metagenomes. However, these workflows contain various modules and perform more tasks than only generating MAGs. For instance, they will taxonomically assign the metagenomic reads, create gene catalogs or perform functional annotations. They rely on numerous dependencies, require significant computational resources and regenerate a lot of outputs which are not essential to most research projects. To simplify this procedure and make it more accessible while remaining efficient, reproducible and biologically relevant, we developed with the popular Snakemake workflow management system,16 a configurable and easy-to-use workflow called SnakeMAGs to reconstruct MAGs in just a few steps. It integrates state-of-the-art bioinformatic tools to sequentially perform from Illumina raw reads: quality filtering of the reads, adapter trimming, an optional step of host sequence removal, assembly of the reads, binning of the contigs, quality assessment of the bins, taxonomic classification of the MAGs and estimation of the relative abundance of these MAGs.
Our tool was built by integrating a set of software needed to process metagenomic datasets, utilizing Snakemake. There are no additional equations/maths needed to recreate this tool.
The workflow has been developed with the workflow management system Snakemake v7.0.016 based on the Python language. Snakemake enables reproducible and scalable data analyses as well as an independent management of the required software within a workflow. SnakeMAGs is composed of two main files:
The Snakefile, named “SnakeMAGs.smk”, contains the workflow script. It is divided into successive rules which correspond to individual steps. Our workflow includes a total of 17 distinct rules. Each rule requires input files and relies on a single software installed independently when starting the workflow in a dedicated conda v4.12.0 environment. At the end of each rule, output files will be generated in a dedicated folder, as well as a log file (stored in the logs folder) summarizing the events of the software run and a benchmark file (stored in the benchmarks folder) containing the central processing unit (CPU) run time, the wall clock time and the maximum memory usage required to complete the rule. Thanks to Snakemake wildcards, our rules are generalized, so one can process multiple datasets in parallel without having to adjust the source code manually.
The configuration file,43 named “config.yaml”, is used to define some variable names (e.g. names of the input files), paths (e.g. working directory, location of the reference databases), software parameters and computational resource allocations (threads, memory) for each of the main steps.
To run the workflow, the user only requires Snakemake. It can be easily installed, for instance via Conda, as explained in the GitHub repository:
conda create -n snakemake_7.0.0 snakemake=7.0.0
After that, the user will only have to edit the config file (an example is provided on the GitHub repository) and then run SnakeMAGs:
#Example of command on a Slurm cluster snakemake --snakefile SnakeMAGs.smk --cluster \ 'sbatch -p <cluster_partition> --mem -c \ -o "cluster_logs/{wildcards}.{rule}.{jobid}.out" \ -e "cluster_logs/{wildcards}.{rule}.{jobid}.err" ' \ --jobs --use-conda --conda-frontend \ conda --conda-prefix/path/to/SnakeMAGs_conda_env/ \ --jobname "{rule}.{wildcards}.{jobid}" --configfile/path/to/config.yaml
During the first use of the workflow, a dedicated Conda environment will be installed for each of the bioinformatic tool to avoid conflict. Then the input files will be processed sequentially. Output files will be stored in eight dedicated folder: logs, benchmarks, QC_fq (containing FASTQ files), Assembly, Binning, Bins_quality (all three containing FASTA files), Classification (containing FASTA files and text files with the taxonomic information), and MAGs_abundances (text files).
The workflow has been successfully used on a workstation with Ubuntu 22.04 as well as on high-performance computer clusters with Slurm v18.08.7 and SGE v8.1.9.
The minimal system requirements to run the workflow will depend on the size of the metagenomic dataset. Small datasets (e.g. the test files provided on the GitHub repository) have been successfully analyzed on a workstation with an Intel Xeon Silver 4210, 2.20GHz (10 cores/20 threads) processor and 96GB of RAM. Larger datasets should be processed on cluster computing or within a high-performance infrastructure. For instance, performance evaluation of publicly available metagenomes (see below) was performed on a computer cluster under CentOS Linux release 7.4.1708 distribution with Slurm 18.08.7, on a node possessing an Intel Xeon CPU E7-8890 v4, 2.20GHz (96 cores/192 threads) and 512 GB RAM.
SnakeMAGs integrates a series of bioinformatic tools to sequentially perform from Illumina raw reads: quality filtering of the reads with illumina-utils v2.12,17 adapter trimming with Trimmomatic v0.3918 (RRID:SCR_011848), an optional step of host sequence removal (e.g. animal or plant sequences) with Bowtie2 v2.4.519 (RRID:SCR_016368), assembly of the reads with MEGAHIT v1.2.920 (RRID:SCR_018551), binning of the contigs with MetaBAT2 v2.1521 (RRID:SCR_019134), quality assessment of the bins with CheckM v1.1.322 (RRID:SCR_016646) and optionally with GUNC v1.0.5,23 classification of the MAGs with GTDB-Tk v2.1.024 (RRID:SCR_019136) and estimation of the relative abundance of these MAGs with CoverM v0.6.1. An overview of the workflow is presented in Figure 1.
To demonstrate the benefits and potential of our workflow, we compared it to another Snakemake workflow named ATLAS v2.9.1.14 To produce a fair comparison, ATLAS was run with the MEGAHIT assembler, without co-binning and dereplicating only 100% similar MAGs. To test these two workflows, we downloaded and analyzed ten publicly available termite gut metagenomes (accession numbers: SRR10402454; SRR14739927; SRR8296321; SRR8296327; SRR8296329; SRR8296337; SRR8296343; DRR097505; SRR7466794; SRR7466795) from five studies25–29 and belonging to ten different termite species.
SnakeMAGs requires only a limited number of inputs files: the raw metagenomic reads in FASTQ format from the 10 above-mentioned metagenomes, a FASTA file containing the adapter sequences,43 a YAML configuration file specifying the variable names, paths and computational resource allocations (available on the GitHub repository and on Zenodo), and here since we worked with host-associated metagenomes a FASTA file containing the termite genome sequences.42 Regarding the outputs, SnakeMAGs produced quality-controlled FASTQ files without adapters nor termite sequences, in the QC_fq folder. Then the reads assembled into contigs and scaffolds (FASTA files) were saved in the Assembly folder. Products of the binning procedure were stored in the Binning folder. Bins with >50% completeness and <10% contamination (according to CheckM) were considered as medium-quality MAGs30 and stored in the Bins_quality folder. At this step, it is also possible to use GUNC to remove potential chimeric and contaminated genomes. Subsequently, the results of the MAGs classification and relative abundance estimation were sent to the Classification and MAGs_abundances folders, respectively. ATLAS requires similar input files and produces, among others, similar outputs files.
ATLAS appeared to be faster than SnakeMAGs (Wilcoxon test, P = 0.002) to reconstruct MAGs from metagenomes (Figure 2A) with a similar memory usage (Wilcoxon test, P = 0.393). However, SnakeMAGs always recovered more MAGs (>50% completeness and <10% contamination according to CheckM) per metagenome or at least as much as ATLAS (Figure 2B). From the ten metagenomes, SnakeMAGs produced a total of 65 MAGs while ATLAS generated only 37 MAGs. Additionally, SnakeMAGs was able to recover MAGs encompassing a higher diversity of bacterial phyla (n = 15 phyla) compared to ATLAS (n = 11 phyla). Only one phylum, namely Patescibacteria, represented by a single MAG was recovered by ATLAS and not by SnakeMAGs. On the contrary, ATLAS failed to reconstruct MAGs belonging to Verrucomicrobiota, Planctomycetota, Synergistota, Elusimicrobiota and Acidobacteriota when SnakeMAGs succeeded (Figure 2C). We found no difference in MAG quality or genome size between the two workflows (Wilcoxon test, P = 0.15 for completeness; P = 0.60 for contamination and P = 0.64 for genome size). We also found that the additional MAGs recovered by SnakeMAGs did not differ from the others. MAGs belonging to phyla generated by SnakeMAGs only or recovered by both SnakeMAGs and ATLAS indeed did not significantly differ in terms of quality, relative abundance and genome size (Wilcoxon test, P = 0.19 for completeness; P = 0.43 for contamination; P = 0.51 for relative abundance and P = 0.19 for genome size).
A. CPU time (in seconds) required to process each metagenome. B. Number of MAGs reconstructed from each metagenome. On both boxplots, gray lines link the result obtained with ATLAS and the one obtained with SnakeMAGs for each of the 10 analyzed termite metagenomes. C. Number of bacterial MAGs at the phylum level recovered from each workflow.
Then we evaluated the effect of MAG quality criteria on our workflow. Using an estimated quality threshold ≥50 (with quality defined as completeness − 5 × contamination),31 SnakeMAGs still allowed the recovery of more MAGs (n = 46) than ATLAS (n = 31). In terms of diversity, SnakeMAGs also recovered more phyla (n = 13) than ATLAS (n = 10). Similarly, using GUNC combined with CheckM for genome quality assessment reduced the number of MAGs recovered by both workflows but a similar trend was observed: SnakeMAGs produced more MAGs and more diverse MAGs (n = 59 MAGs, encompassing 13 phyla) than ATLAS (n = 29 MAGs, encompassing 9 phyla). In summary, the advantages of our workflow are robust to the MAG quality criteria.
Using metagenomic datasets from the gut of various termite species, our analyses revealed that while being slower, SnakeMAGs allowed the recovery of more MAGs encompassing more diverse phyla compared to ATLAS, another similar Snakemake workflow. More importantly our results showed that SnakeMAGs was able to recover MAGs encompassing the major bacterial phyla found in termite guts,32,33 and that some of these phyla were not recovered by ATLAS. Indeed, taxa belonging to Verrucomicrobiota,34 Planctomycetota,33,35 Synergistota,36 Elusimicrobiota37 and Acidobacteriota38,39 have been repeatedly found in the gut of various termite species. As such, they would represent relevant targets for genome-centric analyses of the termite gut microbiota. Although we found no significant difference between the relative abundance of the MAGs belonging to phyla recovered by SnakeMAGs only and the MAGs recovered by both workflows, it is worth mentioning that Verrucomicrobiota, Planctomycetota, Synergistota, Elusimicrobiota and Acidobacteriota are usually less abundant than Spirochaetota, Firmicutes and Bacteroidota which are dominant phyla in termite gut,32,33 and that have been recovered by both workflows in our study. Therefore, it would suggest that SnakeMAGs is not restricted to the most abundant taxa. Altogether, we showed that SnakeMAGs has the potential to retrieve quantitatively more genomic information from metagenomes but also to extract genomic features of biological interest.
Thanks to the inherent flexibility of Snakemake, SnakeMAGs offers the possibility to the users to easily tune the parameters of the workflow (e.g. resource allocations for each rule, options of a specific tools) to adapt their analysis to the datasets and to the computational infrastructure. Additionally, advanced users will have the opportunity to edit or add new rules to the workflow. Regarding the future of SnakeMAGs, several avenues will be considered for the next versions of the workflow. Firstly, the workflow could give more freedom to the users by offering the choice of different tools to perform the same task (e.g. different trimming, assembly or binning software). Secondly, with the current emergence of metagenomic datasets generated with long-read DNA sequencing,40 it might be relevant to adjust our workflow for long-read sequencing technology by including specific bioinformatic tools for this technology.41 Meanwhile, since the majority of the metagenomic datasets have been and are still currently generated with Illumina short-read technology, SnakeMAGs can be widely used to explore the genomic content of various ecosystems via metagenomics.
Source code available from: https://github.com/Nachida08/SnakeMAGs
Archived source code at time of publication: https://doi.org/10.5281/zenodo.7665149.42
License: CeCILL v2.1
Termite genome references used for removing host sequences and their Bowtie2 index are available at: https://zenodo.org/record/6908287#.Y1JLANJBzUR
The termite gut metagenomes analyzed in the present study are available on NCBI with the following accession numbers: SRR10402454; SRR14739927; SRR8296321; SRR8296327; SRR8296329; SRR8296337; SRR8296343; DRR097505; SRR7466794; SRR7466795.
Zenodo. Reconstruction of prokaryotic genomes from ten termite gut metagenomes using two distinct workflows: SnakeMAGs and ATLAS: https://doi.org/10.5281/zenodo.7661004. 43
- SnakeMAGs_config.yaml (The configuration file used to analyze the 10 termite gut metagenomes with SnakeMAGs)
- ATLAS_config.yaml (The configuration file used to analyze the 10 termite gut metagenomes with ATLAS)
- MAGs_SnakeMAGs.zip (A zipped folder containing the genomes of the 65 MAGs reconstructed with SnakeMAGs)
- MAGs_ATLAS.zip (A zipped folder containing the genomes of the 37 MAGs reconstructed with ATLAS)
- taxonomic_assignment_MAGs.csv (A text file containing the taxonomic assignment of all the MAGs reconstructed by both workflows)
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
The authors thank Emmanuelle Morin and Hélène Gardon for their valuable advice and feedback during the workflow development.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatics, microbiology, biochemistry, biotechnology
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Entomology, Microbial Ecology, Microbial systematics, Symbiosis, Bioinformatics, Evolution
Is the rationale for developing the new software tool clearly explained?
Partly
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
References
1. Lai S, Pan S, Sun C, Coelho LP, et al.: metaMIC: reference-free misassembly identification and correction of de novo metagenomic assemblies.Genome Biol. 2022; 23 (1): 242 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatics, microbiology, biochemistry, biotechnology
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 27 Feb 23 |
read | |
Version 1 15 Dec 22 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)