<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="other" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.54418.1</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Software Tool Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Large-scale quality assessment of prokaryotic genomes with metashot/prok-quality</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 1; peer review: 2 approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Albanese</surname>
                        <given-names>Davide</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-9493-3850</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Donati</surname>
                        <given-names>Claudio</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Research and Innovation Centre, Fondazione Edmund Mach, San Michele all&#x2019;Adige, TN, 38098, Italy</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:davide.albanese@fmach.it">davide.albanese@fmach.it</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>17</day>
                <month>8</month>
                <year>2021</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2021</year>
            </pub-date>
            <volume>10</volume>
            <elocation-id>822</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>12</day>
                    <month>8</month>
                    <year>2021</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2021 Albanese D and Donati C</copyright-statement>
                <copyright-year>2021</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/10-822/pdf"/>
            <abstract>
                <p>Metagenomic sequencing allows large-scale identification and genomic characterization. Binning is the process of recovering genomes from complex mixtures of sequence fragments (metagenome contigs) of unknown bacteria and archaeal species. Assessing the quality of genomes recovered from metagenomes requires the use of complex pipelines involving many independent steps, often difficult to reproduce and maintain. A comprehensive, automated and easy-to-use computational workflow for the quality assessment of draft prokaryotic genomes, based on container technology, would greatly improve reproducibility and reusability of published results. We present metashot/prok-quality, a container-enabled Nextflow pipeline for quality assessment and genome dereplication. The metashot/prok-quality tool produces genome quality reports that are compliant with the Minimum Information about a Metagenome-Assembled Genome (MIMAG) standard, and can run out-of-the-box on any platform that supports Nextflow, Docker or Singularity, including computing clusters or batch infrastructures in the cloud. metashot/prok-quality is part of the metashot 
                    <ext-link ext-link-type="uri" xlink:href="https://metashot.github.io">collection of analysis pipelines</ext-link>. Workflow and documentation are available under GPL3 licence on 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/metashot/prok-quality">GitHub</ext-link>.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>metagenome-assembled genome</kwd>
                <kwd>MAG</kwd>
                <kwd>genome quality</kwd>
                <kwd>MIMAG</kwd>
                <kwd>dereplication</kwd>
                <kwd>completeness</kwd>
                <kwd>contamination</kwd>
                <kwd>nextflow</kwd>
                <kwd>docker</kwd>
            </kwd-group>
            <funding-group>
                <award-group id="fund-1">
                    <funding-source>Autonomous Province of Trento (Accordo di Programma)</funding-source>
                </award-group>
                <funding-statement>This work was supported by the Autonomous Province of Trento (Accordo di Programma).</funding-statement>
                <funding-statement>
                    <italic>The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</italic>
                </funding-statement>
            </funding-group>
        </article-meta>
    </front>
    <body>
        <sec id="sec1" sec-type="intro">
            <title>Introduction</title>
            <p>Genome-resolved metagenomics is one of the most promising approaches to identify and characterize novel microbial species. Large-scale environmental and host-associated studies demonstrated how metagenomics can expand our knowledge of uncultivated prokaryotes, recovering thousands of metagenome-assembled genomes (MAGs) of new archaeal and bacterial species.
                <sup>
                    <xref ref-type="bibr" rid="ref1">1</xref>,
                    <xref ref-type="bibr" rid="ref2">2</xref>
                </sup> For this reason, automated and reproducible methods for assessing the quality of MAGs play a critical role.</p>
            <p>To recover MAGs, metagenomic sequence reads are first assembled into contigs using specific algorithms.
                <sup>
                    <xref ref-type="bibr" rid="ref3">3</xref>
                </sup> Contigs are then processed by tools like MetaBAT 2
                <sup>
                    <xref ref-type="bibr" rid="ref4">4</xref>
                </sup> or VAMB
                <sup>
                    <xref ref-type="bibr" rid="ref5">5</xref>
                </sup> that use tetra-nucleotide frequency (TNF) profiles and abundance patterns to group sequences that are likely to belong to the same organism (binning). Binning improves the interpretability of metagenomic data, but at the same time represents (together with assembly) a significant source of error.
                <sup>
                    <xref ref-type="bibr" rid="ref6">6</xref>
                </sup> Manual refinement
                <sup>
                    <xref ref-type="bibr" rid="ref7">7</xref>
                </sup> can increase the quality of resulting MAGs, but undermines the reproducibility of the analysis and is unfeasible for large-scale studies.</p>
            <p>The recently introduced Minimum Information about a Metagenome-Assembled Genome (MIMAG) standard
                <sup>
                    <xref ref-type="bibr" rid="ref8">8</xref>
                </sup> recommends a set of measures for assessing the quality of MAGs. This comprises basic assembly statistics (e.g. N50), genome 
                <italic toggle="yes">completeness,</italic> c
                <italic toggle="yes">ontamination</italic> and the presence of ribosomal RNA (rRNA) and transfer RNA (tRNA) genes.</p>
            <p>Recovering this information involves computational pipelines composed of a series of specialized tools that are often difficult to use and install. Moreover, each task can require parameters and custom scripts that are often poorly documented, making reproducibility of results challenging. Tools and standards such as Galaxy,
                <sup>
                    <xref ref-type="bibr" rid="ref9">9</xref>
                </sup> Nextflow
                <sup>
                    <xref ref-type="bibr" rid="ref10">10</xref>
                </sup> and the Common Workflow Language,
                <sup>
                    <xref ref-type="bibr" rid="ref11">11</xref>
                </sup> coupled with container technologies like 
                <ext-link ext-link-type="uri" xlink:href="https://www.docker.com/">Docker</ext-link>, allows researchers to circumvent these issues, providing a way to build, run and share reproducible computational workflows.
                <sup>
                    <xref ref-type="bibr" rid="ref12">12</xref>
                </sup>
            </p>
            <p>We present metashot/prok-quality, a comprehensive and easy-to-use Nextflow pipeline for assessing the quality of draft prokaryotic genomes. Metashot/prok-quality reports the quality statistics and estimates recommended by the MIMAG standard. Basic assembly statistics, completeness, both redundant and non-redundant contamination, rRNA and tRNA genes are reported in a single, comprehensive table.</p>
        </sec>
        <sec id="sec2" sec-type="methods">
            <title>Methods</title>
            <sec id="sec3">
                <title>Implementation</title>
                <p>Metashot/prok-quality is written using the Nextflow domain-specific language. Nextflow is a framework for building scalable scientific workflows using containers, allowing implicit parallelism on a wide range of computing systems. Reproducibility is guaranteed by versioned Docker images, which enclose software applications together with their dependencies, allowing isolation from the host environment and portability across platforms. metashot/prok-quality v1.2.0 is composed of five main modules (
                    <xref ref-type="fig" rid="f1">Figure 1</xref>) and includes several custom scripts, designed to manipulate the output of the different tasks.</p>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>Metashot/prok-quality workflow.</title>
                        <p>The workflow takes a series of genomes (input bins) in FASTA format and returns: i) a tab-separated values (TSV) file including, for each input genome, the quality information recommended by the MIMAG standard (genome info table); ii) a directory containing the bins filtered according the completeness and contamination thresholds; iii) a TSV file listing the cluster membership of each genome after the dereplication (optional) and iv) a directory containing the cluster representatives. The original outputs of each task (e.g. Barrnap&#x2019;s GFF output) are also reported in dedicated folders.</p>
                    </caption>
                    <graphic id="gr1" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/57902/2224e994-68ee-4f41-8118-aa038895a752_figure1.gif"/>
                </fig>
                <p>Software included in version 1.2.0:</p>
                <p>
                    <italic toggle="yes">CheckM v1.1.2.</italic> Several tools have been developed for the assessment of completeness and contamination of MAGs. The proposed workflow includes the widely used CheckM
                    <sup>
                        <xref ref-type="bibr" rid="ref13">13</xref>
                    </sup> which estimate these metrics using ubiquitous and lineage-specific, single-copy core genes (SCGs) catalogs. CheckM is also used to recover the basic assembly statistics.</p>
                <p>
                    <italic toggle="yes">G</italic>
                    <italic toggle="yes">UNC v1.0.1.</italic> SCG-based tools like CheckM can have very low sensitivity towards contamination by fragments from unrelated organisms (non-redundant contamination).
                    <sup>
                        <xref ref-type="bibr" rid="ref6">6</xref>
                    </sup> In order to circumvent this problem, the recent GUNC
                    <sup>
                        <xref ref-type="bibr" rid="ref14">14</xref>
                    </sup> tool was added to the pipeline. GUNC quantifies the lineage homogeneity of contigs with respect to the full gene complement, accurately detecting chimerism induced by both redundant and non-redundant contamination.</p>
                <p>
                    <italic toggle="yes">Barrnap v0.9.</italic> The presence of 5S, 23S and 16S rRNA genes is predicted by the BAsic Rapid Ribosomal RNA Predictor (
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/tseemann/barrnap">Barrnap</ext-link>) using Hidden Markov models (HMM). Both bacteria and archaea databases are used.</p>
                <p>
                    <italic toggle="yes">tRNAscan-SE v2.0.6.</italic> tRNA genes are searched using tRNAscan-SE,
                    <sup>
                        <xref ref-type="bibr" rid="ref15">15</xref>
                    </sup> using bacteria and archaea covariance models. The number of tRNAs and tRNA isotypes found is reported.</p>
                <p>
                    <italic toggle="yes">dRep v2.6.2.</italic> Dereplication is a procedure that groups the input genomes according to their whole-genome similarity, using metrics such as the Average Nucleotide Identity
                    <sup>
                        <xref ref-type="bibr" rid="ref16">16</xref>
                    </sup> (ANI). Dereplication dramatically simplifies downstream analysis when the input genomes come from different sources.
                    <sup>
                        <xref ref-type="bibr" rid="ref17">17</xref>
                    </sup> In the proposed workflow, filtered genomes (genomes that pass completeness, contamination and GUNC filters) are optionally dereplicated using dRep.
                    <sup>
                        <xref ref-type="bibr" rid="ref18">18</xref>
                    </sup> For each cluster, dRep reports, as the cluster representative, the best-scoring MAG using the CheckM&#x2019;s quality estimates. The score is computed using the following formula:</p>
                <p>score = completeness &#x2212; 5 &#x00d7; contamination + 0.5 &#x00d7; log(N50)</p>
                <p>
                    <italic toggle="yes">Python3 custom scripts.</italic> The workflow includes three Python3 custom scripts, designed to manipulate the output of the different steps. The scripts make use of 
                    <ext-link ext-link-type="uri" xlink:href="https://numpy.org/">NumPy</ext-link>,
                    <sup>
                        <xref ref-type="bibr" rid="ref17">17</xref>
                    </sup> 
                    <ext-link ext-link-type="uri" xlink:href="https://pandas.pydata.org/">Pandas</ext-link> and 
                    <ext-link ext-link-type="uri" xlink:href="https://scikit-learn.org/">scikit-learn</ext-link> libraries.</p>
            </sec>
            <sec id="sec4">
                <title>Operation</title>
                <p>metashot/prok-quality v1.2.0 requires Docker and Nextflow (tested on v20.07.1). Alternatively, the Singularity container 
                    <ext-link ext-link-type="uri" xlink:href="https://singularity.lbl.gov/">engine</ext-link> can be used in place of Docker. At least 70 GB of RAM is required, a limit imposed by CheckM (v1.1.2). The workflow can run in a workstation with 16 GB of RAM using the options 
                    <monospace>--reduced_tree</monospace> and 
                    <monospace>--max_memory 16.GB</monospace>.</p>
            </sec>
        </sec>
        <sec id="sec5">
            <title>Use case</title>
            <p>As mentioned above, metagenome assembly tools combine the sequence reads into larger regions called contigs. Recently, many metagenomic assembly tools have been proposed. Amongst these, metaSPAdes
                <sup>
                    <xref ref-type="bibr" rid="ref3">3</xref>
                </sup> and MEGAHIT
                <sup>
                    <xref ref-type="bibr" rid="ref19">19</xref>
                </sup> have been shown to be able to efficiently handle large-scale short read sequencing data, producing high-quality contigs. Metagenomics contigs are then processed by tools like MetaBAT 2
                <sup>
                    <xref ref-type="bibr" rid="ref4">4</xref>
                </sup> in order to group sequences that are likely to belong to the same organism (binning). After binning, it is essential to assess the quality of the resulting candidate draft genomes.</p>
            <p>In this section, we will show how to assess the quality of draft prokaryotic genomes using metashot/prok-quality. Given a series of candidate genomes in FASTA format stored in the &#x201c;bins&#x201d; directory, the version 1.2.0 of the workflow can be run with the following command line:

                <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">nextflow run metashot/prok-quality -r 1.2.0
\--genomes 'bins/*.fa'
\--outdir results</preformat>
            </p>
            <p>A series of files and directories are created in the output directory results. The main output file is &#x201c;genome_info.tsv&#x201d;. This TSV file contains, for each input genome, a set of quality statistics, including completeness, contamination, GUNC filter, N50, rRNA genes found, number of tRNA and tRNA types. The columns included in this file are:
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>Genome:</monospace> the genome filename;</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>Completeness, Contamination, Strain heterogeneity:</monospace> CheckM estimates;</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>GUNC pass:</monospace> if a genome does not pass GUNC analysis it means it is likely to be chimeric;</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>Genome size (bp), ... , # predicted genes:</monospace> basic genome statistics (see 
                            <ext-link ext-link-type="uri" xlink:href="https://github.com/Ecogenomics/CheckM/wiki/Genome-Quality-Commands#qa">https://github.com/Ecogenomics/CheckM/wiki/Genome-Quality-Commands#qa</ext-link>);</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>5S rRNA, 23S rRNA, 16S rRNA:</monospace> &#x201c;Yes&#x201d; if the rRNA gene was found;</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace># tRNA, # tRNA types:</monospace> the number of tRNA and tRNA types found, respectively.</p>
                    </list-item>
                </list>
            </p>
            <p>The directory &#x201c;filtered&#x201d; contains the genomes (in FASTA format) filtered according to 
                <monospace>--min_completeness</monospace>, 
                <monospace>--max_contamination</monospace> and 
                <monospace>--gunc_filter</monospace> options (see below). The TSV file &#x201c;genome_info_filtered.tsv&#x201d; includes the same information as &#x201c;genome_info.tsv&#x201d;, but for the filtered genomes only. Representative (dereplicated) genomes (default ANI threshold 0.95) are reported in the &#x201c;filtered_repr&#x201d; folder. The companion file &#x201c;derep_info.tsv&#x201d; contains the summary of the dereplication procedure, including the genome filename, the cluster ID and the representativeness. A set of secondary directories contains the original output of each tool included in the pipeline:
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>checkm:</monospace> contains the original CheckM's &#x201c;qc&#x201d; file;</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>gunc</monospace>: contains the original GUNC output file;</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>barrnap</monospace>: includes the predicted rRNA sequences for bacteria (.bac) and archaea (.arc) models in GFF and FASTA formats;</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>trnascan_se:</monospace> includes the predicted tRNA sequences for bacteria (.bac) and archaea (.arc) models in TSV and FASTA formats;</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>drep:</monospace> dRep original data tables, figures and log file.</p>
                    </list-item>
                </list>
            </p>
            <p>The command options are:</p>
            <p>Input and output
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--genomes:</monospace> input genomes/bins in FASTA format (default &#x201c;data/*.fa&#x201d;);</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--ext:</monospace> FASTA files extension, files with different extensions will be ignored (default &#x201c;fa&#x201d;);</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--outdir:</monospace> output directory (default &#x201c;results&#x201d;);</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--gunc_db:</monospace> the GUNC database. If &#x201c;none&#x201d; the database will be automatically downloaded and will be placed the output folder (gunc_db directory) (default &#x201c;none&#x201d;);</p>
                    </list-item>
                </list>CheckM
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--reduced_tree:</monospace> reduce the memory requirements to approximately 14 GB, set --max_memory to 16.GB (default false);</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--checkm_batch_size:</monospace> run CheckM on &#x201c;checkm_batch_size&#x201d; genomes at once in order to avoid memory issues, see 
                            <ext-link ext-link-type="uri" xlink:href="https://github.com/Ecogenomics/CheckM/issues/118">https://github.com/Ecogenomics/CheckM/issues/118</ext-link> (default 1000);</p>
                    </list-item>
                </list>GUNC
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--gunc_batch_size:</monospace> run GUNC on &#x201c;gunc_batch_size&#x201d; genomes at once (default 100);</p>
                    </list-item>
                </list>Filtering
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--min_completeness:</monospace> discard sequences with less than &#x201c;min_completeness&#x201d; % completeness (default 50);</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--max_contamination:</monospace> discard sequences with more than &#x201c;max_contamination&#x201d; % contamination (default 10);</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--gunc_filter:</monospace> if true, discard genomes that do not pass the GUNC filter (default false);</p>
                    </list-item>
                </list>Dereplication
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--skip_dereplication:</monospace> skip the dereplication step (default false);</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--ani_thr:</monospace> ANI threshold for dereplication (&gt; 0.90) (default 0.95);</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--min_overlap:</monospace> minimum required overlap in the alignment between genomes to compute ANI (default 0.30);</p>
                    </list-item>
                </list>Resource limits
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--max_cpus:</monospace> maximum number of CPUs for each process (default 8);</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--max_memory:</monospace> maximum memory for each process (default 70.GB);</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>
                            <monospace>--max_time:</monospace> maximum time for each process (default 96.h).</p>
                    </list-item>
                </list>
            </p>
        </sec>
        <sec id="sec6">
            <title>Software availability</title>
            <p>Source code available from: 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/metashot/prok-quality">https://github.com/metashot/prok-quality</ext-link>
            </p>
            <p>Archived source code at time of publication: 
                <ext-link ext-link-type="uri" xlink:href="http://doi.org/10.5281/zenodo.4475355">http://doi.org/10.5281/zenodo.4475355</ext-link>.
                <sup>
                    <xref ref-type="bibr" rid="ref20">20</xref>
                </sup>
            </p>
            <p>License: 
                <ext-link ext-link-type="uri" xlink:href="https://opensource.org/licenses/GPL-3.0">GPL-3.0</ext-link>
            </p>
            <p>Docker image definitions available from: 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/metashot/docker">https://github.com/metashot/docker</ext-link>
            </p>
        </sec>
        <sec id="sec7">
            <title>Data availability</title>
            <sec id="sec8">
                <title>Underlying data</title>
                <p>Zenodo: metashot/prok-quality v1.2.0 with test data, v1.2.0, 
                    <ext-link ext-link-type="uri" xlink:href="http://doi.org/10.5281/zenodo.4475355">http://doi.org/10.5281/zenodo.4475355</ext-link>.
                    <sup>
                        <xref ref-type="bibr" rid="ref3">3</xref>
                    </sup>
                </p>
                <p>This project contains test data and workflow documentation.</p>
                <p>Data are available under the terms of 
                    <ext-link ext-link-type="uri" xlink:href="https://opensource.org/licenses/GPL-3.0">GNU General Public License version 3</ext-link> (GPL-3).</p>
            </sec>
            <sec id="sec9">
                <title>Extended data</title>
                <p>Docker Hub: metashot docker images, 
                    <ext-link ext-link-type="uri" xlink:href="https://hub.docker.com/u/metashot">https://hub.docker.com/u/metashot</ext-link>
                </p>
                <p>This registry contains the pre-built Docker images </p>
                <p>GitHub: metashot/docker, 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/metashot/docker">https://github.com/metashot/docker</ext-link>
                </p>
                <p>This project contains Docker image definitions.</p>
            </sec>
        </sec>
    </body>
    <back>
        <ack>
            <title>Acknowledgements</title>
            <p>The authors wish to thank Giuseppe Cossu and the Information Technology team of the Fondazione Edmund Mach for technical support.</p>
        </ack>
        <ref-list>
            <title>References</title>
            <ref id="ref1">
                <label>1</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Pasolli</surname>
                            <given-names>E</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Asnicar</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Manara</surname>
                            <given-names>S</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle.</article-title>
                    <source>

                        <italic toggle="yes">Cell.</italic>
</source>
                    <year>2019 Jan 24</year>;<volume>176</volume>(<issue>3</issue>):<fpage>649</fpage>&#x2013;<lpage>62.e20</lpage>.
                    <pub-id pub-id-type="pmid">30661755</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.cell.2019.01.001</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6349461</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref2">
                <label>2</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Parks</surname>
                            <given-names>DH</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Rinke</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Chuvochina</surname>
                            <given-names>M</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life.</article-title>
                    <source>

                        <italic toggle="yes">Nat Microbiol.</italic>
</source>
                    <year>2017 Nov</year>;<volume>2</volume>(<issue>11</issue>):<fpage>1533</fpage>&#x2013;<lpage>1542</lpage>.
                    <pub-id pub-id-type="pmid">29234139</pub-id>
                    <pub-id pub-id-type="doi">10.1038/s41564-017-0083-5</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref3">
                <label>3</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Nurk</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Meleshko</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Korobeynikov</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>metaSPAdes: a new versatile metagenomic assembler.</article-title>
                    <source>

                        <italic toggle="yes">Genome Res.</italic>
</source>
                    <year>2017 May</year>;<volume>27</volume>(<issue>5</issue>):<fpage>824</fpage>&#x2013;<lpage>834</lpage>.
                    <pub-id pub-id-type="pmid">28298430</pub-id>
                    <pub-id pub-id-type="doi">10.1101/gr.213959.116</pub-id>
                    <pub-id pub-id-type="pmcid">PMC5411777</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref4">
                <label>4</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Kang</surname>
                            <given-names>DD</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kirton</surname>
                            <given-names>E</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies.</article-title>
                    <source>

                        <italic toggle="yes">PeerJ.</italic>
</source>
                    <year>2019 Jul 26</year>;<volume>7</volume>:<fpage>e7359</fpage>.
                    <pub-id pub-id-type="pmid">31388474</pub-id>
                    <pub-id pub-id-type="doi">10.7717/peerj.7359</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6662567</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref5">
                <label>5</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Nissen</surname>
                            <given-names>JN</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Johansen</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Alles&#x00f8;e</surname>
                            <given-names>RL</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Improved metagenome binning and assembly using deep variational autoencoders.</article-title>
                    <source>

                        <italic toggle="yes">Nat Biotechnol.</italic>
</source>
                    <year>2021 Jan 4</year>;
                    <pub-id pub-id-type="doi">10.1038/s41587-020-00777-4</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref6">
                <label>6</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>L-X</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Anantharaman</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Shaiber</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Accurate and complete genomes from metagenomes.</article-title>
                    <source>

                        <italic toggle="yes">Genome Res.</italic>
</source>
                    <year>2020 Mar</year>;<volume>30</volume>(<issue>3</issue>):<fpage>315</fpage>&#x2013;<lpage>333</lpage>.
                    <pub-id pub-id-type="doi">10.1101/808410</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref7">
                <label>7</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Shaiber</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Eren</surname>
                            <given-names>AM</given-names>
                        </name>
</person-group>:
                    <article-title>Composite Metagenome-Assembled Genomes Reduce the Quality of Public Genome Repositories.</article-title>
                    <source>

                        <italic toggle="yes">MBio.</italic>
</source>
                    <year>2019 Jun 4</year>;<volume>10</volume>(<issue>3</issue>).
                    <pub-id pub-id-type="pmid">31164461</pub-id>
                    <pub-id pub-id-type="doi">10.1128/mBio.00725-19</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6550520</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref8">
                <label>8</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Bowers</surname>
                            <given-names>RM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kyrpides</surname>
                            <given-names>NC</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Stepanauskas</surname>
                            <given-names>R</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea.</article-title>
                    <source>

                        <italic toggle="yes">Nat Biotechnol.</italic>
</source>
                    <year>2017 Aug 8</year>;<volume>35</volume>(<issue>8</issue>):<fpage>725</fpage>&#x2013;<lpage>731</lpage>.
                    <pub-id pub-id-type="doi">10.1038/nbt.3893</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref9">
                <label>9</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Afgan</surname>
                            <given-names>E</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Baker</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Batut</surname>
                            <given-names>B</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>2018 Jul 2</year>;<volume>46</volume>(<issue>W1</issue>):<fpage>W537</fpage>&#x2013;<lpage>44</lpage>.
                    <pub-id pub-id-type="pmid">29790989</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gky379</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6030816</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref10">
                <label>10</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Di Tommaso</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Chatzou</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Floden</surname>
                            <given-names>EW</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Nextflow enables reproducible computational workflows.</article-title>
                    <source>

                        <italic toggle="yes">Nat Biotechnol.</italic>
</source>
                    <year>2017 Apr 11</year>;<volume>35</volume>(<issue>4</issue>):<fpage>316</fpage>&#x2013;<lpage>319</lpage>.
                    <pub-id pub-id-type="pmid">28398311</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nbt.3820</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref11">
                <label>11</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Strozzi</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Janssen</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wurmus</surname>
                            <given-names>R</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Scalable Workflows and Reproducible Data Analysis for Genomics.</article-title>
                    <source>

                        <italic toggle="yes">Methods Mol Biol.</italic>
</source>
                    <year>2019</year>;<volume>1910</volume>:<fpage>723</fpage>&#x2013;<lpage>745</lpage>.
                    <pub-id pub-id-type="pmid">31278683</pub-id>
                    <pub-id pub-id-type="doi">10.1007/978-1-4939-9074-0_24</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref12">
                <label>12</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ewels</surname>
                            <given-names>PA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Peltzer</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Fillinger</surname>
                            <given-names>S</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The nf-core framework for community-curated bioinformatics pipelines.</article-title>
                    <source>

                        <italic toggle="yes">Nat Biotechnol.</italic>
</source>
                    <year>2020 Mar</year>;<volume>38</volume>(<issue>3</issue>):<fpage>276</fpage>&#x2013;<lpage>278</lpage>.
                    <pub-id pub-id-type="pmid">32055031</pub-id>
                    <pub-id pub-id-type="doi">10.1038/s41587-020-0439-x</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref13">
                <label>13</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Parks</surname>
                            <given-names>DH</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Imelfort</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Skennerton</surname>
                            <given-names>CT</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes.</article-title>
                    <source>

                        <italic toggle="yes">Genome Res.</italic>
</source>
                    <year>2015 Jul</year>;<volume>25</volume>(<issue>7</issue>):<fpage>1043</fpage>&#x2013;<lpage>1055</lpage>.
                    <pub-id pub-id-type="pmid">25977477</pub-id>
                    <pub-id pub-id-type="doi">10.1101/gr.186072.114</pub-id>
                    <pub-id pub-id-type="pmcid">PMC4484387</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref14">
                <label>14</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Orakov</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Fullam</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Coelho</surname>
                            <given-names>LP</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>GUNC: Detection of Chimerism and Contamination in Prokaryotic Genomes.</article-title>
                    <source>

                        <italic toggle="yes">bioRxiv.</italic>
</source>
                    <year>2020</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://www.biorxiv.org/content/early/2020/12/16/2020.12.16.422776">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref15">
                <label>15</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chan</surname>
                            <given-names>PP</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Lowe</surname>
                            <given-names>TM</given-names>
                        </name>
</person-group>:
                    <article-title>tRNAscan-SE: Searching for tRNA Genes in Genomic Sequences.</article-title>
                    <source>

                        <italic toggle="yes">Methods Mol Biol.</italic>
</source>
                    <year>2019</year>;<volume>1962</volume>:<fpage>1</fpage>&#x2013;<lpage>14</lpage>.
                    <pub-id pub-id-type="pmid">31020551</pub-id>
                    <pub-id pub-id-type="doi">10.1007/978-1-4939-9173-0_1</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6768409</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref16">
                <label>16</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Goris</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Konstantinidis</surname>
                            <given-names>KT</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Klappenbach</surname>
                            <given-names>JA</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>DNA-DNA hybridization values and their relationship to whole-genome sequence similarities.</article-title>
                    <source>

                        <italic toggle="yes">Int J Syst Evol Microbiol.</italic>
</source>
                    <year>2007 Jan</year>;<volume>57</volume>(<issue>Pt 1</issue>):<fpage>81</fpage>&#x2013;<lpage>91</lpage>.
                    <pub-id pub-id-type="pmid">17220447</pub-id>
                    <pub-id pub-id-type="doi">10.1099/ijs.0.64483-0</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref17">
                <label>17</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Evans</surname>
                            <given-names>JT</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Denef</surname>
                            <given-names>VJ</given-names>
                        </name>
</person-group>:
                    <article-title>To Dereplicate or Not To Dereplicate?</article-title>
                    <source>

                        <italic toggle="yes">mSphere.</italic>
</source>
                    <year>2020 May 20</year>;<volume>5</volume>(<issue>3</issue>).
                    <pub-id pub-id-type="doi">10.1128/mSphere.00971-19</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref18">
                <label>18</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Olm</surname>
                            <given-names>MR</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Brown</surname>
                            <given-names>CT</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Brooks</surname>
                            <given-names>B</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication.</article-title>
                    <source>

                        <italic toggle="yes">ISME J.</italic>
</source>
                    <year>2017 Dec</year>;<volume>11</volume>(<issue>12</issue>):<fpage>2864</fpage>&#x2013;<lpage>2868</lpage>.
                    <pub-id pub-id-type="doi">10.1038/ismej.2017.126</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref19">
                <label>19</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Liu</surname>
                            <given-names>C-M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Luo</surname>
                            <given-names>R</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.</article-title>
                    <year>2015</year>.
                    <pub-id pub-id-type="pmid">25609793</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btv033</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref20">
                <label>20</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Albanese</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Donati</surname>
                            <given-names>C</given-names>
                        </name>
</person-group>:
                    <article-title>metashot/prok-quality v1.2.0 with test data (Version 1.2.0).</article-title>
                    <source>

                        <italic toggle="yes">Zenodo.</italic>
</source>
                    <year>2021, January 28</year>.
                    <pub-id pub-id-type="doi">10.5281/zenodo.4475355</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report100040">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.57902.r100040</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Baurain</surname>
                        <given-names>Denis</given-names>
                    </name>
                    <xref ref-type="aff" rid="r100040a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-2388-6185</uri>
                </contrib>
                <aff id="r100040a1">
                    <label>1</label>InBioS - PhytoSYSTEMS, Eukaryotic Phylogenomics, University of Li&#x00e8;ge, Li&#x00e8;ge, Belgium</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>31</day>
                <month>1</month>
                <year>2022</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2022 Baurain D</copyright-statement>
                <copyright-year>2022</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport100040" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.54418.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>This tool is useful and works as advertised by the authors. The manuscript reads well and I only have very minor comments.</p>
            <p> </p>
            <p> 
                <bold>Functionality:</bold>
            </p>
            <p> </p>
            <p> I tested metashot/prok-quality on a HPC facility running the SLURM grid engine. The authors' instructions were nearly enough to make it work on my first attempt. However, getting right the Singularity options was a little trickier than expected. In this regard, Docker is easier to use but generally not available in HPC environments. I contacted the authors by email (posing as a regular user) and they replied immediately, confirming that the reported issue was on my side and not a bug in their package. Unfortunately, I had to attend more pressing matters and only got back to the issue recently, hence this delayed review report. I apologize for this. For the record, here are the lines required to run the tool successfully on my system:</p>
            <p> </p>
            <p> # in&#x00a0;prok-quality/nextflow.config</p>
            <p> singularity.enabled = true</p>
            <p> singularity.cacheDir = "$PWD"</p>
            <p> singularity.autoMounts = false</p>
            <p> singularity.runOptions = "-B /path-to-user-home/prok-quality-1.2.0_with_test_data/ -B /tmpscratch/username"</p>
            <p> </p>
            <p> The interface of metashot/prok-quality is straightforward and well designed, with attention to details, such as the possibility to deal with limited RAM (
                <underline>--reduced_tree</underline> option for CheckM) and to economize resources (
                <underline>--gunc_db</underline> option to avoid downloading the GUNC database multiple times). I only have a minor complaint: the 
                <underline>--genomes</underline> option expects input filenames but if these do not match the 
                <underline>--ext</underline> option (e.g., .fasta instead of .fa), infiles are silently ignored. In my view, the 
                <underline>--ext</underline> option would make more sense with a 
                <underline>--genomes</underline> option expecting an input directory. Indeed, when infiles are specified, there is no real need for filtering.</p>
            <p> </p>
            <p> The tool's outputs are easy to understand and well documented. Testing it with default thresholds on a chimerical bacterial genome (Cornet and Baurain 2022, Genome Biology, in press) indeed flags it as contaminated and excludes it from the downstream dereplication step. In contrast, the two corresponding clean genomes undergo through all the steps of the pipeline, as expected.</p>
            <p> </p>
            <p> 
                <bold>Manuscript:</bold> 
                <list list-type="bullet">
                    <list-item>
                        <p>
                            <bold>Abstract:</bold> 
                            <list list-type="bullet">
                                <list-item>
                                    <p>Eukaryotic sequences can be found in metagenomic samples. These are more difficult to deal with, but should be mentioned in the second sentence.</p>
                                </list-item>
                                <list-item>
                                    <p>abstract: "... platform that supports Nextflow, Docker or Singularity" should read "... platform that supports Nextflow and either Docker or Singularity".</p>
                                </list-item>
                            </list> </p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Introduction: </bold>I would suggest adding a small paragraph about the need to use multiple tools when assessing genomic contaminations. This paragraph could include (and slightly expand) the rationale for GUNC (presently located in the Methods). For some ideas, see Lupo 
                            <italic>et al. </italic>(2021
                            <sup>
                                <xref ref-type="bibr" rid="rep-ref-100040-1">1</xref>
                            </sup>).</p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Methods:</bold> 
                            <list list-type="bullet">
                                <list-item>
                                    <p>Some sentences are either too much or too little, for example those about the rationale for dRep. I think that one more sentence would help and it would be better to move the whole idea of dereplication at the end of the Introduction. Indeed, Methods should not explain concepts.</p>
                                </list-item>
                                <list-item>
                                    <p>Figure 1: The figure is nice. If I had to quibble, I would say that "filtered bins" can be misleading: here, the authors mean "bins satisfying their criteria of completeness and contamination" whereas one could imagine that they mean "bins cleaned up from contaminating sequences" (i.e., bins are passed "as is" and are not modified by the pipeline). Moreover, showing on the figure the parameters controlling the thresholds would help the reader to realize that they can be user-specified. Finally, a word is missing in the legend: "according the completeness.." should read "according to the completeness..."&#x00a0;</p>
                                </list-item>
                                <list-item>
                                    <p>The score formula should specify if completeness and contamination metrics are computed in percentage or not. Moreover, it might be interesting to make it user-tweakable (only a suggestion).</p>
                                </list-item>
                            </list> </p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Use case:</bold> I am not sure about the position of the backslash chars in the code snippet. I would have put them at end of lines.</p>
                    </list-item>
                </list>
            </p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Yes</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Yes</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>bioinformatics, including genomic contaminations</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-100040-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics.</article-title>
                        <source>
                            <italic>Front Microbiol</italic>
                        </source>.<year>2021</year>;<volume>12</volume>:
                        <elocation-id>10.3389/fmicb.2021.755101</elocation-id>
                        <fpage>755101</fpage>
                        <pub-id pub-id-type="pmid">34745061</pub-id>
                        <pub-id pub-id-type="doi">10.3389/fmicb.2021.755101</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report100039">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.57902.r100039</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Schmidt</surname>
                        <given-names>Thomas S. B.</given-names>
                    </name>
                    <xref ref-type="aff" rid="r100039a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-8587-4177</uri>
                </contrib>
                <contrib contrib-type="author">
                    <name>
                        <surname>Fullam</surname>
                        <given-names>Anthony</given-names>
                    </name>
                    <xref ref-type="aff" rid="r100039a1">1</xref>
                    <role>Co-referee</role>
                </contrib>
                <contrib contrib-type="author">
                    <name>
                        <surname>Orakov</surname>
                        <given-names>Askarbek</given-names>
                    </name>
                    <xref ref-type="aff" rid="r100039a1">1</xref>
                    <role>Co-referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-6823-5269</uri>
                </contrib>
                <aff id="r100039a1">
                    <label>1</label>Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>24</day>
                <month>11</month>
                <year>2021</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2021 Schmidt TSB et al.</copyright-statement>
                <copyright-year>2021</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport100039" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.54418.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The authors present &#x2018;prok-quality&#x2019;, a one-stop-shop Nextflow workflow for prokaryotic genome quality assessment, wrapping several state-of-the-art tools into a user-friendly pipeline. prok-quality fills two relevant niches: it provides an easy to install, easy to use interface to relevant bioinformatics tools for non-expert users; and a portable, scalable workflow to process datasets of increasing sizes.</p>
            <p> </p>
            <p> The code is available on github. We were able to download, install and use the workflow as advertised. We have only very minor comments, to be addressed at the authors&#x2019; discretion:</p>
            <p> &#x00a0; 
                <list list-type="bullet">
                    <list-item>
                        <p>The authors restrict their discussion to metagenome-assembled genomes. However, there is no inherent reason why the prok-quality workflow shouldn&#x2019;t be used for reference genomes or isolate sequencing as well, and this may be worth emphasizing. In practice, many users work on large integrated datasets of MAGs and reference genomes, where fast, consistent quality control and clustering using a workflow such as prok-quality is highly relevant.</p>
                    </list-item>
                    <list-item>
                        <p>In the bigger picture of an entire genome-resolved metagenomics workflow, from raw reads to biological analyses, prok-quality covers a reasonable chunk: the quality control and &#x2018;dereplication&#x2019; (clustering) of genomes. That way, it can be used as a module, independently of upstream (assembly &amp; binning) and downstream (annotation, analysis) tool choices. However, in our view, prok-quality would benefit from the addition of a taxonomic classifier, e.g. GTDB-tk. Taxonomic information would fit in very well with the reported quality metrics; but in particular, the workflow could (optionally) provide consensus taxonomies for dRep 95% ANI clusters which would greatly add value for non-expert users.</p>
                    </list-item>
                    <list-item>
                        <p>Minor: the GUNC preprint has in the meantime been peer reviewed and published (Orakov 
                            <italic>et al. </italic>(2021
                            <sup>
                                <xref ref-type="bibr" rid="rep-ref-100039-1">1</xref>
                            </sup>)).</p>
                    </list-item>
                    <list-item>
                        <p>Minor: An option to switch from running with docker to running with singularity (e.g. '-profile singularity' with config profiles) would greatly enhance the ease of using this workflow as some clusters prohibit use of docker.</p>
                    </list-item>
                </list>
            </p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Yes</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Yes</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>metagenomics; microbiome</p>
            <p>We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-100039-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>GUNC: detection of chimerism and contamination in prokaryotic genomes</article-title>.
                        <source>
                            <italic>Genome Biology</italic>
                        </source>.<year>2021</year>;<volume>22</volume>(<issue>1</issue>) :
                        <elocation-id>10.1186/s13059-021-02393-0</elocation-id>
                        <pub-id pub-id-type="doi">10.1186/s13059-021-02393-0</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
    </sub-article>
</article>
