<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.16804.2</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Method Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Do you cov me? Effect of coverage reduction on species identification and genome reconstruction in complex biological matrices by metagenome shotgun high-throughput sequencing</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 2; peer review: 3 not approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Cattonaro</surname>
                        <given-names>Federica</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-8819-7458</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Spadotto</surname>
                        <given-names>Alessandro</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Radovic</surname>
                        <given-names>Slobodanka</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Marroni</surname>
                        <given-names>Fabio</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-1556-5907</uri>
                    <xref ref-type="corresp" rid="c2">b</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>IGA Technology Services Srl, Udine, Udine, 33100, Italy</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:fcattonaro@igatechnology.com">fcattonaro@igatechnology.com</email>
                </corresp>
                <corresp id="c2">
                    <label>b</label>
                    <email xlink:href="mailto:marroni@appliedgenomics.org">marroni@appliedgenomics.org</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>22</day>
                <month>3</month>
                <year>2019</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2018</year>
            </pub-date>
            <volume>7</volume>
            <elocation-id>1767</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>18</day>
                    <month>3</month>
                    <year>2019</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2019 Cattonaro F et al.</copyright-statement>
                <copyright-year>2019</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/7-1767/pdf"/>
            <abstract>
                <p>Shotgun metagenomics sequencing is a powerful tool for the characterization of complex biological matrices, enabling analysis of prokaryotic and eukaryotic organisms and viruses in a single experiment, with the possibility of reconstructing 
                    <italic toggle="yes">de novo</italic> the whole metagenome or a set of genes of interest. One of the main factors limiting the use of shotgun metagenomics on wide scale projects is the high cost associated with the approach. However, we demonstrate that&#x2014;for some applications&#x2014;it is possible to use shallow shotgun metagenomics to characterize complex biological matrices while reducing costs. We measured the variation of several summary statistics simulating a decrease in sequencing depth by randomly subsampling a number of reads. The main statistics that were compared are alpha diversity estimates, species abundance, detection threshold, and ability of reconstructing the metagenome in terms of length and completeness. Our results show that a classification of prokaryotic, eukaryotic and viral communities can be accurately performed even using very low number of reads, both in mock communities and in real complex matrices. With samples of 100,000 reads, the alpha diversity estimates were in most cases comparable to those obtained with the full sample, and the estimation of the abundance of all the present species was in excellent agreement with those obtained with the full sample. On the contrary, any task involving the reconstruction of the metagenome performed poorly, even with the largest simulated subsample (1M reads). The length of the reconstructed assembly was smaller than the length obtained with the full dataset, and the proportion of conserved genes that were identified in the meta-genome was drastically reduced compared to the full sample. Shallow shotgun metagenomics can be a useful tool to describe the structure of complex matrices, but it is not adequate to reconstruct&#x2014;even partially&#x2014;the metagenome.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>high-throughput sequencing</kwd>
                <kwd>metagenome</kwd>
                <kwd>metagenomics</kwd>
                <kwd>next generation sequencing</kwd>
                <kwd>alpha diversity</kwd>
                <kwd>complex matrices</kwd>
            </kwd-group>
            <funding-group>
                <award-group id="fund-1">
                    <funding-source>Coordinamento Regionale Veneto per la Libert&#x00e0; delle Vaccinazioni</funding-source>
                </award-group>
                <funding-statement>Metagenome sequencing of B1 and B2 (MPRV vaccines, Prorix Tetra, GlaxoSmithKline) was financed by Corvelva (non-profit association, Veneto, Italy), in the frame of a contract work with IGA Technology Services.  No other grants were involved in supporting the work.</funding-statement>
                <funding-statement>
                    <italic>The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</italic>
                </funding-statement>
            </funding-group>
        </article-meta>
        <notes>
            <sec sec-type="version-changes">
                <label>Revised</label>
                <title>Amendments from Version 1</title>
                <p>In this version we incorporated all the suggestions of the reviewers. Reviewer Alejandro Sanchez-Flores: 
                    <list list-type="bullet">
                        <list-item>
                            <p>We added a mock community (sample A1)</p>
                        </list-item>
                        <list-item>
                            <p>We added several details on sample type and input DNA quality.</p>
                        </list-item>
                        <list-item>
                            <p>We clarified that the aim of the work is to estimate the effect of the sequencing depth on metagenomics studies; thus, we chose heterogeneous samples to obtain results of general applicability.</p>
                        </list-item>
                        <list-item>
                            <p>We added estimate of the detection threshold of rare species at varying sequencing depths.</p>
                        </list-item>
                        <list-item>
                            <p>We provide more detail regarding the use of megahit for 
                                <italic>de novo</italic> assembly. We share online a pipeline to reproduce the main manuscript analyses.</p>
                        </list-item>
                        <list-item>
                            <p>We are now performing BUSCO separately on each species and then reporting average statistics.</p>
                        </list-item>
                        <list-item>
                            <p>We replaced Krona charts with a barplot. Krona charts are available online as html interactive graphs.</p>
                        </list-item>
                    </list> Reviewer Jos&#x00e9; F. Cobo Diaz: 
                    <list list-type="bullet">
                        <list-item>
                            <p>We rewrote the introduction to clarify that the aim of the work is to analyze the effect of varying sequencing depth in the characterization of complex matrices sequenced via whole genome shotgun. The first version erroneously convinced the readers that our focus was on analysis of functional data and/or on pathogens detection.</p>
                        </list-item>
                        <list-item>
                            <p>We added estimate of the detection threshold of rare species at varying sequencing depths.</p>
                        </list-item>
                        <list-item>
                            <p>We describe parameters used for bioinformatics analysis. We share online a pipeline to reproduce the main manuscript analyses.</p>
                        </list-item>
                        <list-item>
                            <p>We provide several measures related to species detection by our approach. Our approach for species classification is accurate. However, a small number of reads (possibly due to sequencing errors) is responsible for the inflation of the number of observed species.</p>
                        </list-item>
                        <list-item>
                            <p>We incorporated Good&#x2019;s coverage and Pielou&#x2019;s evenness index in the analysis.</p>
                        </list-item>
                        <list-item>
                            <p>We modified the discussion to clarify the aim of the work and the conclusions that can be drawn based on our observations.</p>
                        </list-item>
                    </list>
                </p>
            </sec>
        </notes>
    </front>
    <body>
        <sec sec-type="intro">
            <title>Introduction</title>
            <p>Shotgun metagenomics offers the possibility to assess the complete taxonomic composition of biological matrices and to estimate the relative abundances of each species in an unbiased way
                <sup>
                    <xref ref-type="bibr" rid="ref-1">1</xref>,
                    <xref ref-type="bibr" rid="ref-2">2</xref>
                </sup>. It allows to agnostically characterize complex communities containing eukaryotes, fungi, bacteria and also viruses.</p>
            <p> Metagenome shotgun high-throughput sequencing has progressively gained popularity in parallel with the advancing of next-generation sequencing (NGS) technologies
                <sup>
                    <xref ref-type="bibr" rid="ref-3">3</xref>,
                    <xref ref-type="bibr" rid="ref-4">4</xref>
                </sup>, which provide more data in less time at a lower cost than previous sequencing techniques. This allows the extensive application to study the most various biological mixtures such as environmental samples
                <sup>
                    <xref ref-type="bibr" rid="ref-5">5</xref>,
                    <xref ref-type="bibr" rid="ref-6">6</xref>
                </sup>, gut samples
                <sup>
                    <xref ref-type="bibr" rid="ref-7">7</xref>&#x2013;
                    <xref ref-type="bibr" rid="ref-9">9</xref>
                </sup>, skin samples
                <sup>
                    <xref ref-type="bibr" rid="ref-10">10</xref>
                </sup>, clinical samples for diagnostics and surveillance purposes
                <sup>
                    <xref ref-type="bibr" rid="ref-11">11</xref>&#x2013;
                    <xref ref-type="bibr" rid="ref-14">14</xref>
                </sup> and food ecosystems
                <sup>
                    <xref ref-type="bibr" rid="ref-15">15</xref>,
                    <xref ref-type="bibr" rid="ref-16">16</xref>
                </sup>. Another, more traditional approach currently used to assign taxonomy to DNA sequences is based on the sequencing of target conserved regions. Metabarcoding method relies on conserved sequences to characterize communities of complex matrices. These include the highly variable region of 16S rRNA gene in bacteria
                <sup>
                    <xref ref-type="bibr" rid="ref-17">17</xref>
                </sup>, the nuclear ribosomal internal transcribed spacer (ITS) region for fungi
                <sup>
                    <xref ref-type="bibr" rid="ref-18">18</xref>
                </sup>, 18S rRNA gene in eukaryotes
                <sup>
                    <xref ref-type="bibr" rid="ref-19">19</xref>
                </sup>, cytochrome c oxidase sub-unit I (
                <italic toggle="yes">COI</italic> or 
                <italic toggle="yes">cox1</italic>) for taxonomical identification of animals
                <sup>
                    <xref ref-type="bibr" rid="ref-20">20</xref>
                </sup>, 
                <italic toggle="yes">rbcL</italic>, 
                <italic toggle="yes">matK</italic> and 
                <italic toggle="yes">ITS2</italic> as the plant barcode
                <sup>
                    <xref ref-type="bibr" rid="ref-21">21</xref>
                </sup>. Metabarcoding has the advantage of reducing sequencing needs, since it does not require sequencing of the full genome, but just a marker region. On the other hand, given the commonly used approaches, characterization of microbial and eukaryotic communities requires different primers and library preparations
                <sup>
                    <xref ref-type="bibr" rid="ref-22">22</xref>
                </sup>. In addition, several studies suggested that whole shotgun metagenome sequencing is more effective in the characterization of metagenomics samples compared to target amplicon approaches, with the additional capability of providing functional information regarding the studied approaches
                <sup>
                    <xref ref-type="bibr" rid="ref-23">23</xref>
                </sup>.</p>
            <p>Current whole shotgun metagenome experiments are performed obtaining several million reads
                <sup>
                    <xref ref-type="bibr" rid="ref-5">5</xref>,
                    <xref ref-type="bibr" rid="ref-8">8</xref>
                </sup>. However, obtaining a broad characterization of the relative abundance of different species, might easily be achieved with lower number of reads.</p>
            <p>To test this hypothesis, we analzyed ten samples (eight sequenced in the framework of this study and two retrieved from the literature) derived from different complex matrices using whole metagenomics approach and tested accuracy of several summary statistics as a function of the reduction of the number of reads used for analysis. The selection of samples belonging to different matrices with distinct characteristics enabled to understand if the results are generally applicable and, if this is not the case, which are the features with the greatest impact on results.</p>
            <p>In summary, the aim of the present work is to test the effect of the reduction of sequencing depth on 1) estimates of diversity and species richness in complex matrices; 2) estimates of abundance of the species present in the complex matrix, and 3) completeness of 
                <italic toggle="yes">de novo</italic> reconstruction of the genome of the species present in the samples. To assess the consistency of our approach, we selected samples characterized by different levels of species richness and by different relative abundance of prokaryotic and eukaryotic organisms and viruses. In addition, publicly available viral particle enriched sequencing data was used to extend our analysis to viruses. Finally, we included in the study a mock community sample with known species composition.</p>
            <p>Some of the samples were predominantly composed by eukaryotic organisms, while others were composed by prokaryotes or viruses; some were represented by very few dominant species while others had greater diversity. Results that were observed across such dissimilar samples are likely to be of general validity.</p>
        </sec>
        <sec sec-type="methods">
            <title>Methods</title>
            <sec>
                <title>Samples description and DNA extraction</title>
                <p>The following samples were used in the present work: the mock community DNA sample &#x201c;20 Strain Staggered Mix Genomic Material&#x201d; ATCC
                    <sup>&#x00ae;</sup> MSA-1003
                    <sup>TM</sup> (short name: A1), two biological medicines (B1 and B2), two horse fecal samples (F1 and F2), three food samples (M1, M2, and M3), and two human faecal samples (V1 and V2).</p>
                <p>Biological medicines were two different lots of live attenuated MPRV vaccine, widely used for immunisation against measles, mumps, rubella and chickenpox in infants. Lyophilised vaccines were resuspended in 500 &#x03bc;l sterile water for injection and DNA extracted from 250 &#x03bc;l using Maxwell
                    <sup>&#x00ae;</sup> 16 Instrument and the Maxwell
                    <sup>&#x00ae;</sup> 16 Tissue DNA Purification Kit (Promega, Madison, WI, USA) according to the manufacturer's instructions. The vaccine composition declared by the producer is the following: live attenuated viruses: 1) Measles (ssRNA) Swartz strain, cultured in embryo chicken cell cultures; Mumps (ssRNA) strain RIT 4385, derived from the Jeryl Linn strain, cultured in embryo chicken cell cultures; Rubella (ssRNA) Wistar RA 27/3 strain, grown in human diploid cells (MRC-5); Varicella (dsDNA) OKA strain grown in human diploid cells (MRC-5).</p>
                <p>Horse feces from two individuals were processed as follows: 100 mg of starting material stored in 70% ethanol were used for DNA extraction using the QIAamp PowerFecal DNA Kit (QIAGEN GmbH, Hilden, Germany), according to the manufacturer's instructions.</p>
                <p>Food samples were raw materials of animal and plant origin, used to industrially prepare bouillon cubes. DNA extractions from those three samples were performed starting from 2 grams of material each, using the DNeasy mericon Food Kit (QIAGEN GmbH, Hilden, Germany), according to the manufacturer's instructions. The declared sample composition was 
                    <italic toggle="yes">Agaricus bisporus</italic> for M1, spice (
                    <italic toggle="yes">Piper nigrum</italic>) for M2 and mix of animal extracts for M3.</p>
                <p>The mock community declared components are: 0.18% Acinetobacter baumannii (ATCC 17978), 0.02% Actinomyces odontolyticus (ATCC 17982), 1.80% Bacillus cereus (ATCC 10987), 0.02% Bacteroides vulgatus (ATCC 8482), 0.02% Bifidobacterium adolescentis (ATCC 15703), 1.80% Clostridium beijerinckii (ATCC 35702), 0.18% Cutibacterium acnes (ATCC 11828), 0.02% Deinococcus radiodurans (ATCC BAA-816), 0.02% Enterococcus faecalis (ATCC 47077), 18.0% Escherichia coli (ATCC 700926), 0.18% Helicobacter pylori (ATCC 700392), 0.18% Lactobacillus gasseri (ATCC 33323), 0.18% Neisseria meningitidis (ATCC BAA-335), 18.0% Porphyromonas gingivalis (ATCC 33277), 1.80% Pseudomonas aeruginosa (ATCC 9027), 18.0% Rhodobacter sphaeroides (ATCC 17029), 1.80% Staphylococcus aureus (ATCC BAA-1556), 18.0% Staphylococcus epidermidis (ATCC 12228), 1.80% Streptococcus agalactiae (ATCC BAA-611), 18.0% Streptococcus mutans (ATCC 700610).</p>
                <p>DNA purity and concentration were estimated using a NanoDrop Spectrophotometer (NanoDrop Technologies Inc., Wilmington, DE, USA) and Qubit 2.0 fluorometer (Invitrogen, Carlsbad, CA, USA).</p>
                <p>Human fecal samples V1 and V2 derive from a study investigating the virome composition of feces of Amerindians
                    <sup>
                        <xref ref-type="bibr" rid="ref-24">24</xref>
                    </sup>. The two samples with the highest sequencing depth were choosen. Sequences were retrieved from SRA (SRR6287060 and SRR6287079, respectively).</p>
            </sec>
            <sec>
                <title>Whole metagenome DNA library construction and sequencing</title>
                <p>DNA library preparations were performed according to manufacturer&#x2019;s protocol, using the kit Ovation
                    <sup>&#x00ae;</sup> Ultralow System V4 1&#x2013;96 (Nugen, San Carlos, CA). Library prep monitoring and validation were performed both by Qubit 2.0 fluorometer (Invitrogen, Carlsbad, CA, USA) and Agilent 2100 Bioanalyzer DNA High Sensitivity Analysis kit (Agilent Technologies, Santa Clara, CA). Obtained DNA concentrations were as follows: A1 8 ng/&#x00b5;l (total amount = 640 ng), B1 10.7 ng/&#x00b5;l (total amount = 535 ng), B2 9.41 ng/&#x00b5;l (total amount = 470.5 ng), F1 42.3 ng/&#x00b5;l (total amount = 4,230 ng), F2 22.6 ng/&#x00b5;l (total amount = 2,260 ng), M1 16.6 ng/&#x00b5;l (total amount = 1,494 ng), M2 1.87 ng/&#x00b5;l (total amount = 168.3 ng), M3 16 ng/&#x00b5;l (total amount = 640 ng).</p>
                <p>Cluster generation was then performed on Illumina cBot and flowcell HiSeq SBS V4 (250 cycle), and sequenced on HiSeq2500 Illumina sequencer producing 125bp paired-end reads.</p>
                <p>Samples F1 and F2 were loaded on flowcell HiSeq Rapid SBS Kit v2 (500 cycles) producing 250bp paired-end reads. The estimated library insert sizes were: 539 bp (A1), 531 bp (B1), 536 bp (B2), 620 bp (F1), 620 bp (F2), 342 bp (M1), 178 bp (M2), 496 bp (M3). Samples were sequenced in different runs and pooled with other libraries of similar insert sizes.</p>
                <p>The CASAVA Illumina Pipeline version 1.8.2 was used for base-calling and de-multiplexing. Adapters were masked using cutadapt
                    <sup>
                        <xref ref-type="bibr" rid="ref-25">25</xref>
                    </sup>. Masked and low quality bases were filtered using 
                    <ext-link ext-link-type="uri" xlink:href="http://erne.sourceforge.net/">erne-filter</ext-link> version 1.4.6.
                    <sup>
                        <xref ref-type="bibr" rid="ref-26">26</xref>
                    </sup>. Bioinformatics analysis.</p>
                <p>The bioinformatics analysis performed in the present work are summarized in 
                    <xref ref-type="fig" rid="f1">Figure 1</xref>; a standard pipeline for reproducing the main steps of analysis is available on 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/fabiomarroni/doyoucovme">GitHub</ext-link>
                    <sup>
                        <xref ref-type="bibr" rid="ref-27">27</xref>
                    </sup>.</p>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>Workflow of the main bioinformatics analysis performed in the present work.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/20298/b1002bc6-73ef-4b88-a6ab-c4309ee9c5ca_figure1.gif"/>
                </fig>
                <p>Since different read lengths among samples may constitute an additional confounder in analysis, 250 bp long reads belonging to F1, F2, V1 and V2 were trimmed to a length of 125bp using fastx-toolkit version 0.0.13 (
                    <ext-link ext-link-type="uri" xlink:href="http://hannonlab.cshl.edu/fastx_toolkit/">http://hannonlab.cshl.edu/fastx_toolkit/</ext-link>) before subsequent analysis.</p>
                <p>Reduction in coverage was simulated by randomly sampling a fixed number of reads from the full set of reads. Subsamples of 10,000, 25,000, 50,000, 100,000, 250,000, 500,000 and 1,000,000 reads were extracted from the raw reads using 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/lh3/seqtk">seqtk</ext-link> version 1.3. To estimate the variability due to random effects, subsampling was replicated five times for each simulated depth and 99% confidence limits were estimated and plotted.</p>
                <p>To classify the largest possible number of prokaryotes, eukaryotes and viruses, reads were classified against the complete NCBI nt database using kraken2, version 2.0.6
                    <sup>
                        <xref ref-type="bibr" rid="ref-28">28</xref>
                    </sup>. The nt database was converted to kraken2 format using the built-in kraken2-build script with default parameters. Among the most significant parameters, kmer size for the database is by default set to 35 and the minimizer length to 31. A simplified representation of species composition was obtained using Krona
                    <sup>
                        <xref ref-type="bibr" rid="ref-29">29</xref>
                    </sup>.</p>
                <p>Observed number of taxa, Chao1 species richness
                    <sup>
                        <xref ref-type="bibr" rid="ref-30">30</xref>
                    </sup>, Good&#x2019;s coverage
                    <sup>
                        <xref ref-type="bibr" rid="ref-31">31</xref>
                    </sup>, Shannon&#x2019;s diversity index
                    <sup>
                        <xref ref-type="bibr" rid="ref-32">32</xref>
                    </sup> and Pielou&#x2019;s index
                    <sup>
                        <xref ref-type="bibr" rid="ref-33">33</xref>
                    </sup> were estimated using the R package vegan version 2.4.2
                    <sup>
                        <xref ref-type="bibr" rid="ref-34">34</xref>
                    </sup> or base R functions. The number of observed taxa was computed as the number of species to which at least one read was assigned. The number of singletons is defined as the number of species identified by only one read. The number of core species is the number of species with frequency equal or greater than 1&#x2030;. We then define the measure S90, obtained as follow: a) sort species in decreasing abundance, b) perform cumulative sum of the species abundance, and c) report how many of the ordered species are needed to reach an abundance equal or greater to 90% of the total number of reads.</p>
                <p>Assembly of the metagenome was performed using megahit version 1.1.2
                    <sup>
                        <xref ref-type="bibr" rid="ref-35">35</xref>
                    </sup> with default parameters, with kmer sizes varying as follows: 21, 29, 39, 59, 79, 99, 119, 141. Reconstructed contigs were classified at the species level using kraken2. Completeness of the assemblies of each species was assessed using BUSCO
                    <sup>
                        <xref ref-type="bibr" rid="ref-36">36</xref>
                    </sup>. For each species, the proportion of the reconstructed genes was measured as the proportion of genes that were fully reconstructed, plus the proportion of genes that were partially reconstructed. For each sample, results were then averaged over species to provide the average proportion of reconstructed genes. BUSCO analysis was performed on prokaryotic database for all the samples with the exception of M1 (predominanty composed by fungi) for which the fungal database was used.</p>
                <p>Unless otherwise specified, all the analysis were performed using R 3.3.3
                    <sup>
                        <xref ref-type="bibr" rid="ref-37">37</xref>
                    </sup>.</p>
            </sec>
        </sec>
        <sec sec-type="results">
            <title>Results</title>
            <sec>
                <title>Sample composition and downsampling</title>
                <p>Summary statistics for the full samples included in the study are shown in 
                    <xref ref-type="table" rid="T1">Table 1</xref>.</p>
                <table-wrap id="T1" orientation="portrait" position="anchor">
                    <label>Table 1. </label>
                    <caption>
                        <title>Summary statistics for the full samples included in the study.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Sample</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">N reads</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Core</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">N species</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Singletons</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">% Top 20</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">% Top 100</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">S90</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">A1</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">4,969,245</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">16</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2,571</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1,191</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">98.91</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">99.66</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">7</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">B1</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">11,031,061</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">4</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2,507</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1,299</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">99.75</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">99.81</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">B2</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">3,830,083</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">9</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">4,597</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1,795</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">98.83</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">99.27</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">F1</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">12,472,553</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">99</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">29,660</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">14,750</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">21.16</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">38.45</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2,795</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">F2</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">10,780,450</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">106</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">25,607</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">12,374</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">19.94</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">36.67</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2,947</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">M1</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1,898,011</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">3,206</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1,469</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">99.03</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">99.35</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">M2</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1,558,975</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">132</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">9,637</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">3,377</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">36.68</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">61.68</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1,218</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">M3</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1,867,879</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">19</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">5,566</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1,999</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">95</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">97.38</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">7</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">V1</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1,300,221</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">76</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">6,372</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2,114</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">73.91</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">86.05</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">186</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">V2</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2,001,984</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">9</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">3,177</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1,605</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">98.96</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">99.38</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2</td>
                            </tr>
                        </tbody>
                    </table>
                    <table-wrap-foot>
                        <fn>
                            <p>

                                <bold>Core</bold>: number of species with frequency greater than 1&#x2030;. 
                                <bold>N species</bold>: number of species identified in the sample; include species identified by one or more reads. 
                                <bold>Singletons</bold>: number of species identified in the sample by only one read. % 
                                <bold>Top 20</bold>: percentage of reads assigned to the 20 most abundant species. % 
                                <bold>Top 100</bold>: percentage of reads assigned to the 100 most abundant species. 
                                <bold>S90</bold>: Number of species accounting for 90% of the reads.</p>
                        </fn>
                    </table-wrap-foot>
                </table-wrap>
                <p>The number of reads obtained in the samples selected for the present study ranged from slightly more than 1 million (sample V1) to more than 12 millions (sample F1). The number of species identified in each sample was very high, ranging from 2,508 in sample B1 to 29,661 in sample F1. However, the 20 most abundant species accounted for a large proportion of the reads in each sample, from 74.62% in M2 to 99.75% in B1, and the 100 most abundant species accounted for 84.7% in M2 and 99.8% in B1. In sample A1 98.8% of the reads were assigned to the 20 declared species, and only 1.2% of reads were either unassigned or uncorrectly attributed to other species. To ensure that our conclusions have a general validity, we selected samples originating from very different sources with different compositions, and sequenced them at different depths. 
                    <xref ref-type="fig" rid="f2">Figure 2</xref> summarizes the composition of each sample at the Phylum level. Viruses are aggregated at the division level. Only phyla more abundant than 1% were plotted. Reads that were either unclassified or assigned to rare phyla were aggregated under the name &#x201c;Unknown/Other&#x201d;. Samples B1, B2 and M3 where mainly composed of Chordata, sample M1 was mostly composed by Basidiomycota, and sample V2 was mainly composed of Viruses. Samples F1, F2 and, to a lesser extent, M2 were characterized by a large proportion of reads unclassified or assigned to rare phyla. For a more detailed view of raw taxonomy composition, interactive html Chrona are available for download on Open Science Framework (
                    <ext-link ext-link-type="uri" xlink:href="https://osf.io/y7c39/">https://osf.io/y7c39/</ext-link>), under the project &#x201c;Do you cov me&#x201d;, DOI: 10.17605/OSF.IO/Y7C39.</p>
                <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                    <label>Figure 2. </label>
                    <caption>
                        <title>Phylum composition of the samples.</title>
                        <p>Only phyla represented by at least 1% of the reads are shown. Viruses are presented at division level. Unclassified reads and reads assigned to rare phyla are aggregated under the name &#x201c;Unknown/Other&#x201d;.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/20298/b1002bc6-73ef-4b88-a6ab-c4309ee9c5ca_figure2.gif"/>
                </fig>
            </sec>
            <sec>
                <title>Mock community analysis</title>
                <p>The mock community sample &#x201c;20 Strain Staggered Mix Genomic Material&#x201d; (ATCC
                    <sup>&#x00ae;</sup> MSA-1003
                    <sup>TM</sup>) was used as a reference to control performance of sequencing and classification procedures at various depth. The community includes a total of 20 bacterial species, of which 5 have a frequency of 0.02%, 5 a frequency of 0.18%, 5 a frequency of 1.8% and 5 a frequency of 18%.</p>
                <p>
                    <xref ref-type="fig" rid="f3">Figure 3</xref> shows the scatterplot (in logarithm scale) of the observed and expected abundance of organisms of the mock community at different taxonomic levels, from Sepcies to Phylum when using the full-dataset (4.9 M of reads). The correlation between the two measures is high even at the species level (r=0.87), and increases for higher taxonomic levels reaching 0.95 at Class and Phylum level.</p>
                <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                    <label>Figure 3. </label>
                    <caption>
                        <title>Log-log scatterplot of observed and expected abundance of bacterial organisms present in the mock community &#x201c;20 Strain Staggered Mix Genomic Material&#x201d; (ATCC&#x00ae; MSA-1003
                            <sup>TM</sup>).</title>
                        <p>In red 
                            <italic toggle="yes">Actinomyces odontolyticus</italic> identified at frequency &lt;0.002%, arbitrarily plotted at 0.002%.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/20298/b1002bc6-73ef-4b88-a6ab-c4309ee9c5ca_figure3.gif"/>
                </fig>
                <p>The correlation between expected and observed abundancies of the 20 mock species remained high when decreasing sequencing depth, and Pearson&#x2019;s correlation coefficient remains stable at 0.87 at all the investigated sequencing depths. Results for the hwole depth, 1,000,000 reads, 25,000 reads and 10,000 reads, together with 95% intervals are shown in 
                    <xref ref-type="fig" rid="f4">Figure 4</xref>.</p>
                <fig fig-type="figure" id="f4" orientation="portrait" position="float">
                    <label>Figure 4. </label>
                    <caption>
                        <title>Log-log scatterplot of observed and expected abundance of bacterial species present in the mock community &#x201c;20 Strain Staggered Mix Genomic Material&#x201d; (ATCC
                            <sup>&#x00ae;</sup> MSA-1003
                            <sup>TM</sup>) at varying sequencing depths.</title>
                        <p>In red 
                            <italic toggle="yes">Actinomyces odontolyticus</italic> identified at frequency &lt;0.002%, arbitrarily plotted at 0.002%. Error bars represent 95% confidence intervals obtained from five resampling experiments.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/20298/b1002bc6-73ef-4b88-a6ab-c4309ee9c5ca_figure4.gif"/>
                </fig>
            </sec>
            <sec>
                <title>Diversity analysis</title>
                <p>
                    <xref ref-type="fig" rid="f5">Figure 5</xref> shows the variation of several summary statistics as a function of the number of reads used for the analysis, from the smallest (10,000 reads) on the left, to the full dataset on the right. Panels A and B show the observed number of taxa and the value of Chao1 (expected number of taxa) respectively. The two measures have very similar trend, with a swift decrease in horse feces (F1 and F2) when going from full set to 1,000,000 reads, and a relatively slow decrease in all other samples and subsets.</p>
                <fig fig-type="figure" id="f5" orientation="portrait" position="float">
                    <label>Figure 5. </label>
                    <caption>
                        <p>Effect of reduction of sequencing depth on: 
                            <bold>A</bold>) Observed number of taxa, 
                            <bold>B</bold>) Chao1 estimated number of taxa, 
                            <bold>C</bold>) Good&#x2019;s Coverage, 
                            <bold>D</bold>) Shannon&#x2019;s diversity index, 
                            <bold>E</bold>) Pielou&#x2019;s diversity index, and 
                            <bold>F</bold>) Total length of 
                            <italic toggle="yes">de novo</italic> assembly. In all panels X axis is in log scale and Y axis is in linear scale with the exception of panel F, in which both axes are in log scale. Shaded areas represent the confidence limits of resampling experiments. &#x201c;Full&#x201d; represents the values obtained with the full set of reads.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/20298/b1002bc6-73ef-4b88-a6ab-c4309ee9c5ca_figure5.gif"/>
                </fig>
                <p>Downsampling has different effects on the observed and estimated number of species in different samples. For most samples, even a robust downsampling led to only a slight reduction in the estimated species richness. However, for samples F1 and F2, characterized by a high number of species including rare ones, the downsampling led to a significant reduction (panels A and B). Good&#x2019;s coverage (panel C) remained nearly constant when more than 100K reads were sequenced. Lower sequencing depth determined a decrease in Good&#x2019;s coverage, especially for samples F1, F2, M2 and V1.</p>
                <p>Shannon&#x2019;s diversity index (panel D) is a widely used method to assess biological diversity of ecological and microbiological communities. The effect of sequencing depth on Shannon&#x2019;s diversity index is negligible for all samples.</p>
                <p>Pielou&#x2019;s index (panel E) is a measure of the species&#x2019; distribution evenness. Values close to 1 denote equifrequent species, and lower values denote uneven distribution of species relative abundance. The effect of the number of reads on Pielou&#x2019;s index is moderate.</p>
            </sec>
            <sec>
                <title>Species abundance and detection threshold</title>
                <p>
                    <xref ref-type="fig" rid="f6">Figure 6</xref> shows the correlation in species abundance estimation between the full dataset and a reduced dataset of 100,000 reads. The linear correlation coefficient between the two datasets was &gt;0.99 in all five subsampling replicates. The plot is in log-log scale to emphasize differences in low abundance species. Only the relative abundance estimation of species with frequencies lower than 0.01% (i.e 
                    <italic toggle="yes">.</italic> species represented by 1 read out of 10,000) was affected by subsampling. The same pattern was observed in all examined samples.</p>
                <fig fig-type="figure" id="f6" orientation="portrait" position="float">
                    <label>Figure 6. </label>
                    <caption>
                        <title>Correlation of species abundance estimated using the full dataset and a set composed of 100,000 reads.</title>
                        <p>Data for all the five subsampled replicates are plotted. Each point (colored by sample of origin) represents a given species. Both axis are plotted in log scale to facilitate visualization of low abundance species. A red box encompasses datapoints of species that were present in the full set and absent in the reduced set, for which the frequency in the reduced set was set at &#x201c;&lt;=0.001%&#x201d;.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/20298/b1002bc6-73ef-4b88-a6ab-c4309ee9c5ca_figure6.gif"/>
                </fig>
                <p>In 
                    <xref ref-type="fig" rid="f7">Figure 7</xref> we show the results obtained by reducing the number of sampled reads to 10,000 reads per sample. Similarly to what we observed for 100,000 reads depth, the linear correlation coefficient between species abundance estimate in the full and the reduced dataset was high (r&gt;0.95) for all the samples and in all five subsampling replicates. Only rare species with frequencies lower than 1/1000 (0.1%) in full dataset showed some deviation.</p>
                <fig fig-type="figure" id="f7" orientation="portrait" position="float">
                    <label>Figure 7. </label>
                    <caption>
                        <title>Correlation of species abundance estimated using the full dataset and a set composed of 10,000 reads.</title>
                        <p>Data for all the five subsampled replicates are plotted. Each point (colored by sample of origin) represents a given species. Both axis are plotted in log scale to facilitate visualization of low abundance species. A red box encompasses datapoints of species that were present in the full set and absent in the reduced set, for which the frequency in the reduced set was set at &#x201c;&lt;0.01%&#x201d;.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/20298/b1002bc6-73ef-4b88-a6ab-c4309ee9c5ca_figure7.gif"/>
                </fig>
                <p>Since the reduction of the sequencing depth inevitably affects the ability of detecting rare species, we determined the minimum frequency required for a species to be identified at each sequencing depth. This detection threshold at any given sequencing depth was defined as follows: a) for each sample, identify all the species that are present in all five subsampling replicates; b) among the species identified, for each sample select the one with the lowest frequency in the full dataset; c) average the lowest frequencies across all samples. 
                    <xref ref-type="table" rid="T2">Table 2</xref> shows the average and standard deviation of the detection threshold across the ten samples at any sequencing depth. At 10,000 reads depth, the detection threshold was 0.0124%. This means that species with frequencies higher than 0.0124% in full dataset were consistently identified also in the reduced datasets, while species with lower frequencies may be lost. At 1,000,000 reads depth the detection threshold was 0.00006% (
                    <italic toggle="yes">i.e.</italic> 60 reads per million).</p>
                <table-wrap id="T2" orientation="portrait" position="anchor">
                    <label>Table 2. </label>
                    <caption>
                        <title>Detection threshold as a function of the sequencing depth.</title>
                        <p>
                            <bold>N. reads:</bold> Number of reads. 
                            <bold>Detection threshold (%):</bold> Detection threshold averaged across the ten samples, 
                            <bold>SD(%):</bold> Standard deviation of the detection threshold.</p>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">N. reads</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Detection threshold (%)</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">SD (%)</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">10,000</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.01242</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.006312</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">25,000</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.00348</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.001719</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">50,000</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.00189</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.001078</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">100,000</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.00069</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.000536</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">250,000</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.0001</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">6.53E-05</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">500,000</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.00007</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">4.77E-05</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">1,000,000</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.00006</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">4.9E-05</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
            <sec>
                <title>Completeness of de novo assembly</title>
                <p>We investigated the effect of coverage reduction on the completeness of 
                    <italic toggle="yes">de novo</italic> assembly. We reconstructed the metagenome of the full and reduced datasets and compared the completeness of the reconstructed genomes. Results are summarized in 
                    <xref ref-type="fig" rid="f5">Figure 5</xref> (panel F). As expected, the size of the assembly was strongly influenced by the sequencing depth. Assembly size for the full dataset ranged from less than 1 Mb (V2) to nearly 100 Mb (F1 and F2). A decrease in the sequencing depth led to a steady decrease in assembly size in all samples. At 1,000,000 reads the size ranged from slightly more than 100 kb (V2) to slightly more than 10Mb (A1 and M1).</p>
                <p>BUSCO analysis
                    <sup>
                        <xref ref-type="bibr" rid="ref-36">36</xref>
                    </sup> was used as an additional measure to assess the completeness of the reconstructed metagenome. The proportion of reconstructed genes in full (X axis) and reduced (Y axis) datasets obtained by randomly sampling 1,000,000 reads is shown in 
                    <xref ref-type="fig" rid="f8">Figure 8</xref>. In samples A1 and M1, on average 80% of the BUSCO genes were reconstructed in the full dataset. Reducing sequencing depth to 1,000,000 reads lowered the porportion of reconstructed genes in the two samples to 50% or less. In the remaining samples the proportion of reconstructed genes was very low even in the full dataset and the reduction of sequencing depth did not significantly alter the proportion.</p>
            </sec>
        </sec>
        <sec sec-type="discussion">
            <title>Discussion</title>
            <p>We set out to test the effect of the reduction of sequencing depth on 1) estimates of diversity and species richness; 2) estimates of abundance of the species present, and 3) completeness of 
                <italic toggle="yes">de novo</italic> reconstruction of the genome of the species present in complex matrices. We selected ten heterogeneous samples that underwent whole genome DNA-sequencing. This was also true for vaccine samples B1 and B2, several components of which are ssRNA viruses, and could not be detected using this approach. Indeed, the determination of the ssRNA components in vaccines was not the aim of the present study.</p>
            <p>We started by determining the general characteristics of our samples. All the samples resulted as a mixture of a large number of species, nearly half of which were singletons (
                <italic toggle="yes">i.e.</italic> represented by one read). A control sample A1 comprised 2,572 species, while it should contain only 20 of them. However, A1 core set (species with a frequency of at least 0.1%) was made up by 16 species. Based on product specifications, 15 species in the mock community had a frequency greater than 0.1% and we observed all of them. In addition, we erroneously identified 
                <italic toggle="yes">Staphylococcus lugdunensis</italic> (with a frequency of 0.11%), probably due to misclassification of other 
                <italic toggle="yes">Staphylococcus</italic> reads. We devised the S90 measure which reports the number of the species (sorted by decreasing abundance) accounting for 90% of the reads. For several samples the S90 is less than 10, while for highly complex matrices as F1 and F2, is 2,795 and 2,947 respectively. The abundance of rare species might be factual for samples with very high complexity such as feces. Still, species represented by only one read are unlikely to be real. A proportion of singleton species is probably originated from sequencing errors and/or from errors in the classification against the database. In addition, especially for low input samples, it is possible that contaminants in laboratory reagents artificially increase the number of observed species
                <sup>
                    <xref ref-type="bibr" rid="ref-38">38</xref>,
                    <xref ref-type="bibr" rid="ref-39">39</xref>
                </sup>. Nevertheless, determining the relative and absolute contribution of these biases to metagenomics studies is out of the scope of the present paper.</p>
            <p>The choice of the database against which sequences are matched can affect results. In the present study, we matched our sequences against the full NCBI nt database, because this allows to classify reads belonging to any given organism. However, this might cause some drawback in accuracy. As an example, in the vaccine sample B1, we identified 61 reads attributed to 
                <italic toggle="yes">Elaeophora elaphi</italic>, a nematode, only found as a parasite of the liver of deers
                <sup>
                    <xref ref-type="bibr" rid="ref-40">40</xref>
                </sup>. It is therefore highly unlikely that such organism might really be present in the vaccine sample. Repeating the analysis on the standard database, only consisting of 
                <italic toggle="yes">Homo sapiens</italic>, bacteria and viruses, 57 out of 61 reads were assigned to 
                <italic toggle="yes">Homo sapiens</italic> and the remaining 4 were unassigned (data not shown). Possible explanations are that a) some contamination from 
                <italic toggle="yes">Homo sapiens</italic> is present in the deposited sequence of 
                <italic toggle="yes">Elaeophora elaphi</italic>, or b) some reads belonging to 
                <italic toggle="yes">Homo sapiens</italic> are attributed by mistake to genuine 
                <italic toggle="yes">Elaeophora elaphi</italic> sequences.</p>
            <p>Such marginal missclassification problems do not affect the results of our study, but clearly indicates that researchers should be very cautious when reporting contaminants or unexpected results from metagenomics studies.</p>
            <p>In our study we kept the read length constant at 125 bp across experiments. Previous studies (although limited to targeted approaches) showed the effect of read length on the evaluation of the composition of complex matrices
                <sup>
                    <xref ref-type="bibr" rid="ref-41">41</xref>
                </sup>. Even though an extensive assessment of the effect of read length on the ability to characterize complex matrices was beyond the scope of the present work, we compared the results obtained for horse fecal samples (F1 and F2) when using 250 bp long reads. The use of shorter sequences led to a strong increase in the proportion of unclassified reads, from 56% to 74% in F1 and from 58% to 75% in F2.</p>
            <p>We performed a benchmark of the entire workflow with the help of a mock community with known composition. By comparing the expected and observed relative abundance of the 20 bacterial species included in the mock community we concluded that the workflow is accurate at all taxonomic levels (
                <xref ref-type="fig" rid="f3">Figure 3</xref>). One species, 
                <italic toggle="yes">Actynomices odontolyticus</italic>, with expected frequency of 0.02%, was observed with a much lower frequency (&lt;0.002%, represented as a red dot in 
                <xref ref-type="fig" rid="f3">Figure 3</xref> and 
                <xref ref-type="fig" rid="f4">Figure 4</xref>). Other species showed only slight deviations from expected frequencies in our experiment. To the best of our knowledge, this is the first published work reporting the observed frequencies of a mock community using WGS. However, previous works performed very extensive studies on target 16s sequencing of mock communities, and reported large deviations from expectation, depending on sequencing primers, extraction method and sequencing platform
                <sup>
                    <xref ref-type="bibr" rid="ref-42">42</xref>
                </sup>. We tested the effect of decrease in sequencing depth on deviations from expected frequency (
                <xref ref-type="fig" rid="f4">Figure 4</xref>) and observed that even when sampling 10,000 reads the average correlation between expected and observed abundances remained high (r=0.87), although the variance among resampling experiments was high.</p>
            <p>To assess the requirements in sequencing depth for characterizing complex matrices, we measured the variation of several diversity indexes while reducing the sequencing depths. We measured the number of observed taxa, Chao1 (or number of expected taxa), Good&#x2019;s coverage, Shannon&#x2019;s diversity index and Pielou&#x2019;s evenness index.</p>
            <p>Chao1 estimator is obtained as</p>
            <p>
                <disp-formula>
                    <mml:math display="block" id="math1">
                        <mml:mrow>
                            <mml:msub>
                                <mml:mi>S</mml:mi>
                                <mml:mrow>
                                    <mml:mi>C</mml:mi>
                                    <mml:mi>h</mml:mi>
                                    <mml:mi>a</mml:mi>
                                    <mml:mi>o</mml:mi>
                                    <mml:mn>1</mml:mn>
                                </mml:mrow>
                            </mml:msub>
                            <mml:mtext>&#x2009;</mml:mtext>
                            <mml:mo>=</mml:mo>
                            <mml:mspace width="0.3em"/>
                            <mml:msub>
                                <mml:mi>S</mml:mi>
                                <mml:mrow>
                                    <mml:mi>O</mml:mi>
                                    <mml:mi>b</mml:mi>
                                    <mml:mi>s</mml:mi>
                                    <mml:mtext>&#x2009;</mml:mtext>
                                </mml:mrow>
                            </mml:msub>
                            <mml:mo>+</mml:mo>
                            <mml:mtext>&#x2009;</mml:mtext>
                            <mml:mfrac>
                                <mml:mrow>
                                    <mml:msub>
                                        <mml:mi>f</mml:mi>
                                        <mml:mn>1</mml:mn>
                                    </mml:msub>
                                    <mml:mo stretchy="false">(</mml:mo>
                                    <mml:msub>
                                        <mml:mi>f</mml:mi>
                                        <mml:mn>1</mml:mn>
                                    </mml:msub>
                                    <mml:mtext>&#x2009;</mml:mtext>
                                    <mml:mo>&#x2013;</mml:mo>
                                    <mml:mn>1</mml:mn>
                                    <mml:mo stretchy="false">)</mml:mo>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mn>2</mml:mn>
                                    <mml:mtext>&#x2009;</mml:mtext>
                                    <mml:mo stretchy="false">(</mml:mo>
                                    <mml:msub>
                                        <mml:mi>f</mml:mi>
                                        <mml:mn>2</mml:mn>
                                    </mml:msub>
                                    <mml:mtext>&#x2009;</mml:mtext>
                                    <mml:mo>+</mml:mo>
                                    <mml:mn>1</mml:mn>
                                    <mml:mo stretchy="false">)</mml:mo>
                                </mml:mrow>
                            </mml:mfrac>
                        </mml:mrow>
                    </mml:math>
                </disp-formula>
            </p>
            <p>Where 
                <italic toggle="yes">S
                    <sub>obs</sub>
                </italic> is the number of observed species in the sample, 
                <italic toggle="yes">f
                    <sub>1</sub>
</italic> is the number of species observed once, and 
                <italic toggle="yes">f
                    <sub>2</sub>
                </italic> is the number of species observed twice.</p>
            <p>Under our experimental conditions, the number of observed and estimated taxa followed similar trends. Both of them were heavily affected by the small proportion of reads attributed to unique or rare taxa.</p>
            <p>Good&#x2019;s coverage (G) is defined as</p>
            <p>
                <disp-formula id="e2">
                    <mml:math display="block" id="math2">
                        <mml:mrow>
                            <mml:mi>G</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mn>1</mml:mn>
                            <mml:mo>&#x2212;</mml:mo>
                            <mml:mfrac>
                                <mml:mrow>
                                    <mml:msub>
                                        <mml:mi>f</mml:mi>
                                        <mml:mn>1</mml:mn>
                                    </mml:msub>
                                </mml:mrow>
                                <mml:mi>N</mml:mi>
                            </mml:mfrac>
                        </mml:mrow>
                    </mml:math>
                </disp-formula>
            </p>
            <p>where 
                <italic toggle="yes">f
                    <sub>1</sub>
                </italic> is the number of singletons and N is the total number of reads. G is heavily affected by the sequencing depth. Significant variation in G is observed when using 100,000 reads or less.	</p>
            <p>Shannon diversity index is estimated as</p>
            <p>
                <disp-formula>
                    <mml:math display="block" id="math3">
                        <mml:mrow>
                            <mml:mi>H</mml:mi>
                            <mml:mspace width="0.3em"/>
                            <mml:mo>=</mml:mo>
                            <mml:mspace width="0.3em"/>
                            <mml:mo>&#x2013;</mml:mo>
                            <mml:mspace width="0.3em"/>
                            <mml:mstyle displaystyle="true">
                                <mml:munderover>
                                    <mml:mo>&#x2211;</mml:mo>
                                    <mml:mrow>
                                        <mml:mi>i</mml:mi>
                                        <mml:mtext>&#x2009;</mml:mtext>
                                        <mml:mo>=</mml:mo>
                                        <mml:mtext>&#x2009;</mml:mtext>
                                        <mml:mn>1</mml:mn>
                                    </mml:mrow>
                                    <mml:mi>N</mml:mi>
                                </mml:munderover>
                                <mml:mrow>
                                    <mml:msub>
                                        <mml:mi>p</mml:mi>
                                        <mml:mi>i</mml:mi>
                                    </mml:msub>
                                    <mml:mo>*</mml:mo>
                                    <mml:mspace width="0.3em"/>
                                    <mml:mtext>ln</mml:mtext>
                                    <mml:mi> </mml:mi>
                                    <mml:mo stretchy="false">(</mml:mo>
                                    <mml:msub>
                                        <mml:mi>p</mml:mi>
                                        <mml:mi>i</mml:mi>
                                    </mml:msub>
                                    <mml:mo stretchy="false">)</mml:mo>
                                </mml:mrow>
                            </mml:mstyle>
                            <mml:mtext>&#x2009;</mml:mtext>
                        </mml:mrow>
                    </mml:math>
                </disp-formula>
            </p>
            <p>Where 
                <italic toggle="yes">N</italic> is the total number of species and 
                <italic toggle="yes">p
                    <sub>i</sub>
                </italic> is the frequency of the species 
                <italic toggle="yes">i</italic>. Thus Shannon diversity index is affected more by variation in the frequencies of highly abundant species than by the loss of rare species. In our study, Shannon&#x2019;s index was very stable across sample sizes.</p>
            <p>Pielou&#x2019;s evenness index is estimated as</p>
            <p>
                <disp-formula id="e4">
                    <mml:math display="block" id="math4">
                        <mml:mrow>
                            <mml:mi>J</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mfrac>
                                <mml:mi>H</mml:mi>
                                <mml:mrow>
                                    <mml:mtext>In</mml:mtext>
                                    <mml:mspace width="0.1em"/>
                                    <mml:mi>S</mml:mi>
                                </mml:mrow>
                            </mml:mfrac>
                        </mml:mrow>
                    </mml:math>
                </disp-formula>
            </p>
            <p>Where 
                <italic toggle="yes">H</italic> is Shannon&#x2019;s diversity index and S is the total number of observed species. The value 
                <italic toggle="yes">ln S</italic> corresponds to the maximum possible value of 
                <italic toggle="yes">H,</italic> observed when all species have the same frequency, thus Pielou&#x2019;s index approaches 1 when all the species are evenly distributed. In our study, Pielou&#x2019;s index showed a slight increase as the number of sampled reads decreased.</p>
            <p>Horse fecal samples F1 and F2 are characterized by a very large number of observed species (29,660 and 25,607, respectively), while all the other samples have lower number of species, ranging from 2507 in B1 to 9637 in M2. Chao1 captures this differences, showing that F1 and F2 have greater diversity estimates Measures such as the number of observed taxa and Chao1 capture this differences, showing that F1 and F2 have greater diversity estimates. Shannon&#x2019;s and Pielou&#x2019;s indices, on the contrary, rely on the frequency distribution of the species. Therefore, samples that have a relatively high number of common species with comparable frequencies tend to have high Shannon&#x2019;s diversity indices. Samples (such as M1) dominated by a single species, have very low Shannon and Pielou indices. The effect of sequencing depth on nearly all indices is moderate; we thus conclude that biological matrices with different levels of complexities, composed by different admixture of prokaryotes, eukaroytes and viruses can be satisfactorily characterized via WGS even at sequencing depth lower than 1,000,000 reads.</p>
            <p>We then set out to assess the changes in the estimated relative frequency of each individual species when reducing the number of sequenced reads. Accurate estimate of the relative abundance of each species is an important task when the aim is a) to detect species with a relative abundance above any given threshold, b) to differentiate two samples based on different abundance of any given species composition, or c) to cluster samples based on their species composition. Our results show that even in case of substantial reduction of the number of sequenced reads, species abundances as low as 0.1% can be reliably estimated (
                <xref ref-type="fig" rid="f6">Figure 6</xref> and 
                <xref ref-type="fig" rid="f7">Figure 7</xref>).</p>
            <p>In addition, we aimed to determine the threshold of detection for rare species at low sequencing depth. This statistics is of interest when researchers are interested in detecting the presence of a species that might be rare in the sample. Our results show that even very rare species can be accurately detected at low sequencing depth. When subsampling 1,000,000 reads, the frequency threshold for a species to be detected in the reduced sample was measured as 60 reads out of 1,000,000 (0.00006%). Even when the number of reads was unrealistically low (10,000), rare species could still be detected, with a detection threshold estimated to be 0.012%. While the detection threshold can vary according to sample characteristics, we can assume that for most samples rare species can be accurately detected even at low sequencing depth.</p>
            <p>Finally, we assessed the effect of a reduction in the sequencing coverage on the accuracy of 
                <italic toggle="yes">de novo</italic> assembly of the metagenome. Our results show that downsampling had a strongly negative effect on the total length of the reconstructed metagenome and on the propoprtion of reconstructed genes (
                <xref ref-type="fig" rid="f5">Figure 5F</xref> and 
                <xref ref-type="fig" rid="f8">Figure 8</xref>).</p>
            <fig fig-type="figure" id="f8" orientation="portrait" position="float">
                <label>Figure 8. </label>
                <caption>
                    <title>Completeness of the BUSCO genes in the full dataset (X axis) and in the largest of the reduced datasets (consisting of 1,000,000 reads, Y axis); error bars are based on the five replicate experiments performed for each sample.</title>
                    <p>The plot is in log-log scale.</p>
                </caption>
                <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/20298/b1002bc6-73ef-4b88-a6ab-c4309ee9c5ca_figure8.gif"/>
            </fig>
            <p>BUSCO is widely used for assessing the completeness of genome and transcriptome assemblies for individual organisms, and has benchmark datasets for several lineages. Our results clearly indicate that even 1,000,000 reads is a suboptimal depth in terms of fully sampling the genes present in the complex matrices. This observation needs to be taken into account in the phase of experimental design. Our conclusions are also important for research interested at reconstruction of an interesting part of the meta-genome, such as genes involved in antibiotic resistance
                <sup>
                    <xref ref-type="bibr" rid="ref-43">43</xref>
                </sup>. The decrease in performance observed in the genes&#x2019; recostruction will be likely observed for any gene category. Researchers aiming at a 
                <italic toggle="yes">de novo</italic> reconstruction of the metagenome (although partial) must keep in mind that several millions of reads are needed to attain reliable results. In addition, the proportion of genes reconstructed with BUSCO in the full dataset was very low for all samples, with the exception of the two samples M1, predominanty composed by one fungal species, and A1, composed by a limited number of small genomes. These results indicate that a complete reconstruction of the metagenome of a complex matrix requires at least several millions reads. In the present work we tested the feasibility of using metagenome shotgun shallow high-throughput sequencing to analyze complex samples for the presence of eukaryotes, prokaryotes and virus nucleic acids for monitoring, surveillance, quality control and traceability purposes. We show that, if the aim of the experiment is a taxonomical characterization of the sample or the identification and quantification of species, a low-coverage WGS is a good choice. On the other hand, if one of the aims of the study relies on 
                <italic toggle="yes">de novo</italic> assembly, substantial sequencing efforts are required. The number of reads required for the reconstruction of the meta-genome, depends on several factors such as number of species in the sample, their genome size and abundance and length of the sequencing reads. An estimation needs to be performed for each experiment based on specific goals and sample characteristics.</p>
        </sec>
        <sec>
            <title>Data availability</title>
            <sec>
                <title>Underlying data</title>
                <p>Raw reads generated in the present study are available at NCBI Sequence Read Archive.</p>
                <p>Sample A1 is available under accession number 
                    <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/sra/?term=SRP174028">SRP174028</ext-link>: 
                    <ext-link ext-link-type="uri" xlink:href="https://identifiers.org/insdc.sra/SRP174028">https://identifiers.org/insdc.sra/SRP174028</ext-link>.</p>
                <p>Samples F1 and F2 are available under accession number 
                    <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/sra/?term=SRP163102">SRP163102</ext-link>: 
                    <ext-link ext-link-type="uri" xlink:href="https://identifiers.org/insdc.sra/SRP163102">https://identifiers.org/insdc.sra/SRP163102</ext-link>.</p>
                <p>Samples B1 and B2 are available under accession number 
                    <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/sra/?term=SRP163096">SRP163096</ext-link>: 
                    <ext-link ext-link-type="uri" xlink:href="https://identifiers.org/insdc.sra/SRP163096">https://identifiers.org/insdc.sra/SRP163096</ext-link>;</p>
                <p>and samples M1, M2 and M3 are available under accession number 
                    <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/sra/?term=SRP163007">SRP163007</ext-link>: 
                    <ext-link ext-link-type="uri" xlink:href="https://identifiers.org/insdc.sra/SRP163007">https://identifiers.org/insdc.sra/SRP163007</ext-link>.</p>
            </sec>
            <sec>
                <title>Extended data</title>
                <p>Open Science Framework: Do you cov me. 
                    <ext-link ext-link-type="uri" xlink:href="https://dx.doi.org/10.17605/OSF.IO/Y7C39">https://doi.org/10.17605/OSF.IO/Y7C39</ext-link>
                    <sup>
                        <xref ref-type="bibr" rid="ref-44">44</xref>
                    </sup>.</p>
                <p>This project contains the raw html graphs, produced using Krona.</p>
            </sec>
            <sec>
                <title>Software availability</title>
                <p>Pipeline for performing the standard analysis included in this work available from: 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/fabiomarroni/doyoucovme">https://github.com/fabiomarroni/doyoucovme</ext-link>.</p>
                <p>Archived code at time of publication: 
                    <ext-link ext-link-type="uri" xlink:href="https://dx.doi.org/10.5281/zenodo.2593798">https://doi.org/10.5281/zenodo.2593798</ext-link>
                    <sup>
                        <xref ref-type="bibr" rid="ref-27">27</xref>
                    </sup>.</p>
                <p>License: 
                    <ext-link ext-link-type="uri" xlink:href="https://opensource.org/licenses/GPL-3.0">GNU GPL-3.0</ext-link>.</p>
            </sec>
        </sec>
    </body>
    <back>
        <ack>
            <title>Acknowledgments</title>
            <p>The authors would like to thank Dr Loretta Bolgan for fruitful scientific discussions and Corvelva (non-profit association, Veneto, Italy) to give us the permission to use their own metagenome sequencing data (samples B1 and B2) for the paper purposes; Dr Federica Cattapan (M&#x00e9;rieux NutriSciences Italia and Chelab S.r.l., Italia) to provide the DNAs of M1, M2, M3 samples and Dr Carol Hughes (Phytorigins Ltd., United Kindom) to give us the biological samples F1, F2 and to both of them to give us the permission to use their samples for whole metagenome sequencing and analysis.</p>
        </ack>
        <ref-list>
            <ref id="ref-1">
                <label>1</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Quince</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Walker</surname>
                            <given-names>AW</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Simpson</surname>
                            <given-names>JT</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Shotgun metagenomics, from sampling to analysis.</article-title>
                    <source>

                        <italic toggle="yes">Nat Biotechnol.</italic>
</source>
                    <year>2017</year>;<volume>35</volume>(<issue>9</issue>):<fpage>833</fpage>&#x2013;<lpage>44</lpage>.
                    <pub-id pub-id-type="pmid">28898207</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nbt.3935</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-2">
                <label>2</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Forbes</surname>
                            <given-names>JD</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Knox</surname>
                            <given-names>NC</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ronholm</surname>
                            <given-names>J</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Metagenomics: The Next Culture-Independent Game Changer.</article-title>
                    <source>

                        <italic toggle="yes">Front Microbiol.</italic>
</source>
                    <year>2017</year>;<volume>8</volume>:<fpage>1069</fpage>.
                    <pub-id pub-id-type="pmid">28725217</pub-id>
                    <pub-id pub-id-type="doi">10.3389/fmicb.2017.01069</pub-id>
                    <pub-id pub-id-type="pmcid">5495826</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-3">
                <label>3</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Bragg</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Tyson</surname>
                            <given-names>GW</given-names>
                        </name>
</person-group>:
                    <article-title>Metagenomics using next-generation sequencing.</article-title>
                    <source>

                        <italic toggle="yes">Methods Mol Biol.</italic>
</source>
                    <year>2014</year>;<volume>1096</volume>:<fpage>183</fpage>&#x2013;<lpage>201</lpage>.
                    <pub-id pub-id-type="pmid">24515370</pub-id>
                    <pub-id pub-id-type="doi">10.1007/978-1-62703-712-9_15</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-4">
                <label>4</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Desai</surname>
                            <given-names>N</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Antonopoulos</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gilbert</surname>
                            <given-names>JA</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>From genomics to metagenomics.</article-title>
                    <source>

                        <italic toggle="yes">Curr Opin Biotechnol.</italic>
</source>
                    <year>2012</year>;<volume>23</volume>(<issue>1</issue>):<fpage>72</fpage>&#x2013;<lpage>6</lpage>.
                    <pub-id pub-id-type="pmid">22227326</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.copbio.2011.12.017</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-5">
                <label>5</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Sunagawa</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Coelho</surname>
                            <given-names>LP</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Chaffron</surname>
                            <given-names>S</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Ocean plankton. Structure and function of the global ocean microbiome.</article-title>
                    <source>

                        <italic toggle="yes">Science.</italic>
</source>American Association for the Advancement of Science;<year>2015</year>;<volume>348</volume>(<issue>6237</issue>):<fpage>1261359</fpage>.
                    <pub-id pub-id-type="pmid">25999513</pub-id>
                    <pub-id pub-id-type="doi">10.1126/science.1261359</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-6">
                <label>6</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wilhelm</surname>
                            <given-names>RC</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Cardenas</surname>
                            <given-names>E</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Leung</surname>
                            <given-names>H</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A metagenomic survey of forest soil microbial communities more than a decade after timber harvesting.</article-title>
                    <source>

                        <italic toggle="yes">Sci data.</italic>
</source>Nature Publishing Group;<year>2017</year>;<volume>4</volume>:<fpage>170092</fpage>.
                    <pub-id pub-id-type="pmid">28765786</pub-id>
                    <pub-id pub-id-type="doi">10.1038/sdata.2017.92</pub-id>
                    <pub-id pub-id-type="pmcid">5525643</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-7">
                <label>7</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Hamady</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Knight</surname>
                            <given-names>R</given-names>
                        </name>
</person-group>:
                    <article-title>Microbial community profiling for human microbiome projects: Tools, techniques, and challenges.</article-title>
                    <source>

                        <italic toggle="yes">Genome Res.</italic>
</source>
                    <year>2009</year>;<volume>19</volume>(<issue>7</issue>):<fpage>1141</fpage>&#x2013;<lpage>52</lpage>.
                    <pub-id pub-id-type="pmid">19383763</pub-id>
                    <pub-id pub-id-type="doi">10.1101/gr.085464.108</pub-id>
                    <pub-id pub-id-type="pmcid">3776646</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-8">
                <label>8</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Qin</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Raes</surname>
                            <given-names>J</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A human gut microbial gene catalogue established by metagenomic sequencing.</article-title>
                    <source>

                        <italic toggle="yes">Nature.</italic>
</source>Nature Publishing Group;<year>2010</year>;<volume>464</volume>(<issue>7285</issue>):<fpage>59</fpage>&#x2013;<lpage>65</lpage>.
                    <pub-id pub-id-type="pmid">20203603</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nature08821</pub-id>
                    <pub-id pub-id-type="pmcid">3779803</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-9">
                <label>9</label>
                <mixed-citation publication-type="journal">
                    <collab>Human Microbiome Project Consortium</collab>:
                    <article-title>Structure, function and diversity of the healthy human microbiome.</article-title>
                    <source>

                        <italic toggle="yes">Nature.</italic>
</source>Nature Publishing Group;<year>2012</year>;<volume>486</volume>(<issue>7402</issue>):<fpage>207</fpage>&#x2013;<lpage>14</lpage>.
                    <pub-id pub-id-type="pmid">22699609</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nature11234</pub-id>
                    <pub-id pub-id-type="pmcid">3564958</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-10">
                <label>10</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Oh</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Byrd</surname>
                            <given-names>AL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Deming</surname>
                            <given-names>C</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Biogeography and individuality shape function in the human skin metagenome.</article-title>
                    <source>

                        <italic toggle="yes">Nature.</italic>
</source>Nature Publishing Group;<year>2014</year>;<volume>514</volume>(<issue>7520</issue>):<fpage>59</fpage>&#x2013;<lpage>64</lpage>.
                    <pub-id pub-id-type="pmid">25279917</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nature13786</pub-id>
                    <pub-id pub-id-type="pmcid">4185404</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-11">
                <label>11</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wilson</surname>
                            <given-names>MR</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Suan</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Duggins</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A novel cause of chronic viral meningoencephalitis: Cache Valley virus.</article-title>
                    <source>

                        <italic toggle="yes">Ann Neurol.</italic>
</source>
                    <year>2017</year>;<volume>82</volume>(<issue>1</issue>):<fpage>105</fpage>&#x2013;<lpage>14</lpage>.
                    <pub-id pub-id-type="pmid">28628941</pub-id>
                    <pub-id pub-id-type="doi">10.1002/ana.24982</pub-id>
                    <pub-id pub-id-type="pmcid">5546801</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-12">
                <label>12</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wilson</surname>
                            <given-names>MR</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Naccache</surname>
                            <given-names>SN</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Samayoa</surname>
                            <given-names>E</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Actionable diagnosis of neuroleptospirosis by next-generation sequencing.</article-title>
                    <source>

                        <italic toggle="yes">N Engl J Med.</italic>
</source>Massachusetts Medical Society;<year>2014</year>;<volume>370</volume>(<issue>25</issue>):<fpage>2408</fpage>&#x2013;<lpage>17</lpage>.
                    <pub-id pub-id-type="pmid">24896819</pub-id>
                    <pub-id pub-id-type="doi">10.1056/NEJMoa1401268</pub-id>
                    <pub-id pub-id-type="pmcid">4134948</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-13">
                <label>13</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Greninger</surname>
                            <given-names>AL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Messacar</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dunnebacke</surname>
                            <given-names>T</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Clinical metagenomic identification of 
                        <italic toggle="yes">Balamuthia mandrillaris</italic> encephalitis and assembly of the draft genome: the continuing case for reference genome sequencing.</article-title>
                    <source>

                        <italic toggle="yes">Genome Med.</italic>
</source>
                    <year>2015</year>;<volume>7</volume>(<issue>1</issue>):<fpage>113</fpage>.
                    <pub-id pub-id-type="pmid">26620704</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s13073-015-0235-2</pub-id>
                    <pub-id pub-id-type="pmcid">4665321</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-14">
                <label>14</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Forbes</surname>
                            <given-names>JD</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Knox</surname>
                            <given-names>NC</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Peterson</surname>
                            <given-names>CL</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Highlighting Clinical Metagenomics for Enhanced Diagnostic Decision-making: A Step Towards Wider Implementation.</article-title>
                    <source>

                        <italic toggle="yes">Comput Struct Biotechnol J.</italic>
</source>Elsevier;<year>2018</year>;<volume>16</volume>:<fpage>108</fpage>&#x2013;<lpage>20</lpage>.
                    <pub-id pub-id-type="pmid">30026887</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.csbj.2018.02.006</pub-id>
                    <pub-id pub-id-type="pmcid">6050174</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-15">
                <label>15</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Mayo</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Rachid</surname>
                            <given-names>CT</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Alegria</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Impact of next generation sequencing techniques in food microbiology.</article-title>
                    <source>

                        <italic toggle="yes">Curr Genomics.</italic>
</source>
                    <year>2014</year>;<volume>15</volume>(<issue>4</issue>):<fpage>293</fpage>&#x2013;<lpage>309</lpage>.
                    <pub-id pub-id-type="pmid">25132799</pub-id>
                    <pub-id pub-id-type="doi">10.2174/1389202915666140616233211</pub-id>
                    <pub-id pub-id-type="pmcid">4133952</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-16">
                <label>16</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Oniciuc</surname>
                            <given-names>EA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Likotrafiti</surname>
                            <given-names>E</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Alvarez-Molina</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The Present and Future of Whole Genome Sequencing (WGS) and Whole Metagenome Sequencing (WMS) for Surveillance of Antimicrobial Resistant Microorganisms and Antimicrobial Resistance Genes across the Food Chain.</article-title>
                    <source>

                        <italic toggle="yes">Genes (Basel).</italic>
</source>
                    <year>2018</year>;<volume>9</volume>(<issue>5</issue>): pii: E268.
                    <pub-id pub-id-type="pmid">29789467</pub-id>
                    <pub-id pub-id-type="doi">10.3390/genes9050268</pub-id>
                    <pub-id pub-id-type="pmcid">5977208</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-17">
                <label>17</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Caporaso</surname>
                            <given-names>JG</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Lauber</surname>
                            <given-names>CL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Walters</surname>
                            <given-names>WA</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample.</article-title>
                    <source>

                        <italic toggle="yes">Proc Natl Acad Sci U S A.</italic>
</source>
                    <year>2011</year>;<volume>108 Suppl 1</volume>:<fpage>4516</fpage>&#x2013;<lpage>22</lpage>.
                    <pub-id pub-id-type="pmid">20534432</pub-id>
                    <pub-id pub-id-type="doi">10.1073/pnas.1000080107</pub-id>
                    <pub-id pub-id-type="pmcid">3063599</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-18">
                <label>18</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Schoch</surname>
                            <given-names>CL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Seifert</surname>
                            <given-names>KA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Huhndorf</surname>
                            <given-names>S</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for 
                        <italic toggle="yes">Fungi</italic>.</article-title>
                    <source>

                        <italic toggle="yes">Proc Natl Acad Sci U S A.</italic>
</source>National Academy of Sciences;<year>2012</year>;<volume>109</volume>(<issue>16</issue>):<fpage>6241</fpage>&#x2013;<lpage>6</lpage>.
                    <pub-id pub-id-type="pmid">22454494</pub-id>
                    <pub-id pub-id-type="doi">10.1073/pnas.1117018109</pub-id>
                    <pub-id pub-id-type="pmcid">3341068</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-19">
                <label>19</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Hugerth</surname>
                            <given-names>LW</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Muller</surname>
                            <given-names>EE</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hu</surname>
                            <given-names>YO</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Systematic design of 18S rRNA gene primers for determining eukaryotic diversity in microbial consortia.</article-title>Voolstra CR, editor.
                    <source>

                        <italic toggle="yes">PLoS One.</italic>
</source>Public Library of Science;<year>2014</year>;<volume>9</volume>(<issue>4</issue>):<fpage>e95567</fpage>.
                    <pub-id pub-id-type="pmid">24755918</pub-id>
                    <pub-id pub-id-type="doi">10.1371/journal.pone.0095567</pub-id>
                    <pub-id pub-id-type="pmcid">3995771</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-20">
                <label>20</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Hebert</surname>
                            <given-names>PD</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Cywinska</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ball</surname>
                            <given-names>SL</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Biological identifications through DNA barcodes.</article-title>
                    <source>

                        <italic toggle="yes">Proc Biol Sci.</italic>
</source>
                    <year>2003</year>;<volume>270</volume>(<issue>1512</issue>):<fpage>313</fpage>&#x2013;<lpage>21</lpage>.
                    <pub-id pub-id-type="pmid">12614582</pub-id>
                    <pub-id pub-id-type="doi">10.1098/rspb.2002.2218</pub-id>
                    <pub-id pub-id-type="pmcid">1691236</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-21">
                <label>21</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Fazekas</surname>
                            <given-names>AJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kuzmina</surname>
                            <given-names>ML</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Newmaster</surname>
                            <given-names>SG</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>DNA barcoding methods for land plants.</article-title>
                    <source>

                        <italic toggle="yes">Methods Mol Biol.</italic>
</source>
                    <year>2012</year>;<volume>858</volume>:<fpage>223</fpage>&#x2013;<lpage>52</lpage>.
                    <pub-id pub-id-type="pmid">22684959</pub-id>
                    <pub-id pub-id-type="doi">10.1007/978-1-61779-591-6_11</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-22">
                <label>22</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Uyaguari-Diaz</surname>
                            <given-names>MI</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Chan</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Chaban</surname>
                            <given-names>BL</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A comprehensive method for amplicon-based and metagenomic characterization of viruses, bacteria, and eukaryotes in freshwater samples.</article-title>
                    <source>

                        <italic toggle="yes">Microbiome.</italic>
</source>BioMed Central;<year>2016</year>;<volume>4</volume>(<issue>1</issue>):<fpage>20</fpage>.
                    <pub-id pub-id-type="pmid">27391119</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s40168-016-0166-1</pub-id>
                    <pub-id pub-id-type="pmcid">5011856</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-23">
                <label>23</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ranjan</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Rani</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Metwally</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Analysis of the microbiome: Advantages of whole genome shotgun versus 16S amplicon sequencing.</article-title>
                    <source>

                        <italic toggle="yes">Biochem Biophys Res Commun.</italic>
</source>NIH Public Access;<year>2016</year>;<volume>469</volume>(<issue>4</issue>):<fpage>967</fpage>&#x2013;<lpage>77</lpage>.
                    <pub-id pub-id-type="pmid">26718401</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.bbrc.2015.12.083</pub-id>
                    <pub-id pub-id-type="pmcid">4830092</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-24">
                <label>24</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Siqueira</surname>
                            <given-names>JD</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dominguez-Bello</surname>
                            <given-names>MG</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Contreras</surname>
                            <given-names>M</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Complex virome in feces from Amerindian children in isolated Amazonian villages.</article-title>
                    <source>

                        <italic toggle="yes">Nat Commun.</italic>
</source>Nature Publishing Group;<year>2018</year>;<volume>9</volume>(<issue>1</issue>):<fpage>4270</fpage>.
                    <pub-id pub-id-type="pmid">30323210</pub-id>
                    <pub-id pub-id-type="doi">10.1038/s41467-018-06502-9</pub-id>
                    <pub-id pub-id-type="pmcid">6189175</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-25">
                <label>25</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Martin</surname>
                            <given-names>M</given-names>
                        </name>
</person-group>:
                    <article-title>Cutadapt removes adapter sequences from high-throughput sequencing reads.</article-title>
                    <source>

                        <italic toggle="yes">EMBnet J.</italic>
</source>
                    <year>2011</year>;<volume>17</volume>(<issue>1</issue>):<fpage>10</fpage>&#x2013;<lpage>2</lpage>.
                    <pub-id pub-id-type="doi">10.14806/ej.17.1.200</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-26">
                <label>26</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Del Fabbro</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Scalabrin</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Morgante</surname>
                            <given-names>M</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>An extensive evaluation of read trimming effects on Illumina NGS data analysis.</article-title>Seo JS, editor.
                    <source>

                        <italic toggle="yes">PLoS One.</italic>
</source>Public Library of Science;<year>2013</year>;<volume>8</volume>(<issue>12</issue>):<fpage>e85024</fpage>.
                    <pub-id pub-id-type="pmid">24376861</pub-id>
                    <pub-id pub-id-type="doi">10.1371/journal.pone.0085024</pub-id>
                    <pub-id pub-id-type="pmcid">3871669</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-27">
                <label>27</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Marroni</surname>
                            <given-names>F</given-names>
                        </name>
</person-group>:
                    <article-title>fabiomarroni/doyoucovme v1.2 (Version v1.2).</article-title>
                    <source>

                        <italic toggle="yes">Zenodo.</italic>
</source>
                    <year>2019</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://www.doi.org/10.5281/zenodo.2593798">http://www.doi.org/10.5281/zenodo.2593798</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-28">
                <label>28</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wood</surname>
                            <given-names>DE</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Salzberg</surname>
                            <given-names>SL</given-names>
                        </name>
</person-group>:
                    <article-title>Kraken: ultrafast metagenomic sequence classification using exact alignments.</article-title>
                    <source>

                        <italic toggle="yes">Genome Biol.</italic>
</source>BioMed Central;<year>2014</year>;<volume>15</volume>(<issue>3</issue>):<fpage>R46</fpage>.
                    <pub-id pub-id-type="pmid">24580807</pub-id>
                    <pub-id pub-id-type="doi">10.1186/gb-2014-15-3-r46</pub-id>
                    <pub-id pub-id-type="pmcid">4053813</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-29">
                <label>29</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ondov</surname>
                            <given-names>BD</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bergman</surname>
                            <given-names>NH</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Phillippy</surname>
                            <given-names>AM</given-names>
                        </name>
</person-group>:
                    <article-title>Interactive metagenomic visualization in a Web browser.</article-title>
                    <source>

                        <italic toggle="yes">BMC Bioinformatics.</italic>
</source>
                    <year>2011</year>;<volume>12</volume>(<issue>1</issue>):<fpage>385</fpage>.
                    <pub-id pub-id-type="pmid">21961884</pub-id>
                    <pub-id pub-id-type="doi">10.1186/1471-2105-12-385</pub-id>
                    <pub-id pub-id-type="pmcid">3190407</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-30">
                <label>30</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chao</surname>
                            <given-names>A</given-names>
                        </name>
</person-group>:
                    <article-title>Non-parametric estimation of the classes in a population.</article-title>
                    <source>

                        <italic toggle="yes">Scand J Statist.</italic>
</source>Scandinavian Journal of Statistics;<year>1984</year>;<volume>11</volume>(<issue>4</issue>):<fpage>265</fpage>&#x2013;<lpage>70</lpage>.
                    <ext-link ext-link-type="uri" xlink:href="https://www.google.com/url?sa=t&amp;rct=j&quot;q=&amp;esrc=s&amp;source=web&amp;cd=2&amp;cad=rja&amp;uact=8&amp;ved=2ahUKEwix-uupqLDeAhWDbX0KHZl8ALQQFjABegQIBBAC&amp;url=http%3A%2F%2Fdns2.asia.edu.tw%2F~ysho%2FYSHO-English%2F1000%2520Taiwan%2520(Independent)%2FPDF%2FSca%2520J%2520Sta11%2C%2520265.pdf&amp;usg=AOvVaw3ifgmccRGR_EKdfYZWiuyI">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-31">
                <label>31</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Good</surname>
                            <given-names>IJ</given-names>
                        </name>
</person-group>:
                    <article-title>The Population Frequencies of Species and the Estimation of Population Parameters.</article-title>
                    <source>

                        <italic toggle="yes">Biometrika.</italic>
</source>Oxford University Press Biometrika Trust;<year>1953</year>;<volume>40</volume>(<issue>3/4</issue>):<fpage>237</fpage>&#x2013;<lpage>264</lpage>.
                    <pub-id pub-id-type="doi">10.2307/2333344</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-32">
                <label>32</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Shannon</surname>
                            <given-names>CE</given-names>
                        </name>
</person-group>:
                    <article-title>A Mathematical Theory of Communication.</article-title>
                    <source>

                        <italic toggle="yes">Bell Syst Tech J.</italic>
</source>
                    <year>1948</year>;<volume>27</volume>(<issue>3</issue>):<fpage>379</fpage>&#x2013;<lpage>423</lpage>.
                    <pub-id pub-id-type="doi">10.1002/j.1538-7305.1948.tb01338.x</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-33">
                <label>33</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Pielou</surname>
                            <given-names>EC</given-names>
                        </name>
</person-group>:
                    <article-title>The measurement of diversity in different types of biological collections.</article-title>
                    <source>

                        <italic toggle="yes">J Theor Biol.</italic>
</source>Academic Press;<year>1966</year>;<volume>13</volume>:<fpage>131</fpage>&#x2013;<lpage>44</lpage>.
                    <pub-id pub-id-type="doi">10.1016/0022-5193(66)90013-0</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-34">
                <label>34</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Oksanen</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Blanchet</surname>
                            <given-names>G</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Friendly</surname>
                            <given-names>M</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>vegan: Community Ecology Package</article-title>.<year>2017</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://cran.r-project.org/web/packages/vegan/vegan.pdf">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-35">
                <label>35</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Liu</surname>
                            <given-names>CM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Luo</surname>
                            <given-names>R</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct 
                        <italic toggle="yes">de Bruijn</italic> graph.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2015</year>;<volume>31</volume>(<issue>10</issue>):<fpage>1674</fpage>&#x2013;<lpage>6</lpage>.
                    <pub-id pub-id-type="pmid">25609793</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btv033</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-36">
                <label>36</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Sim&#x00e3;o</surname>
                            <given-names>FA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Waterhouse</surname>
                            <given-names>RM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ioannidis</surname>
                            <given-names>P</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>Oxford University Press;<year>2015</year>;<volume>31</volume>(<issue>19</issue>):<fpage>3210</fpage>&#x2013;<lpage>2</lpage>.
                    <pub-id pub-id-type="pmid">26059717</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btv351</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-37">
                <label>37</label>
                <mixed-citation publication-type="journal">
                    <collab>R Core Team</collab>:
                    <article-title>R: A language and environment for statistical computing.</article-title>R Foundation for Statistical Computing, Vienna, Austria.<year>2018</year>.</mixed-citation>
            </ref>
            <ref id="ref-38">
                <label>38</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wally</surname>
                            <given-names>N</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Schneider</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Thannesberger</surname>
                            <given-names>J</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Plasmid DNA contaminant in molecular reagents.</article-title>
                    <source>

                        <italic toggle="yes">Sci Rep.</italic>
</source>
                    <year>2019</year>;<volume>9</volume>(<issue>1</issue>):<fpage>1652</fpage>.
                    <pub-id pub-id-type="pmid">30733546</pub-id>
                    <pub-id pub-id-type="doi">10.1038/s41598-019-38733-1</pub-id>
                    <pub-id pub-id-type="pmcid">6367390</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-39">
                <label>39</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Salter</surname>
                            <given-names>SJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Cox</surname>
                            <given-names>MJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Turek</surname>
                            <given-names>EM</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Reagent and laboratory contamination can critically impact sequence-based microbiome analyses.</article-title>
                    <source>

                        <italic toggle="yes">BMC Biol.</italic>
</source>
                    <year>2014</year>;<volume>12</volume>(<issue>1</issue>):<fpage>87</fpage>.
                    <pub-id pub-id-type="pmid">25387460</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s12915-014-0087-z</pub-id>
                    <pub-id pub-id-type="pmcid">4228153</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-40">
                <label>40</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Hern&#x00e1;ndez Rodr&#x00ed;guez</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Mart&#x00ed;nez G&#x00f3;mez</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Guti&#x00e9;rrez Palomino</surname>
                            <given-names>P</given-names>
                        </name>
</person-group>:
                    <article-title>
                        <italic toggle="yes">Elaeophora elaphi</italic> n. sp. (Filarioidea: Onchocercidae) parasite of the red deer 
                        <italic toggle="yes">(Cervus elaphus)</italic>. With a key of species of the genus 
                        <italic toggle="yes">Elaeophora</italic>.</article-title>
                    <source>

                        <italic toggle="yes">Ann Parasitol Hum Comp.</italic>
</source>EDP Sciences;<year>1986</year>;<volume>61</volume>(<issue>4</issue>):<fpage>457</fpage>&#x2013;<lpage>63</lpage>.
                    <pub-id pub-id-type="pmid">3813427</pub-id>
                    <pub-id pub-id-type="doi">10.1051/parasite/1986614457</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-41">
                <label>41</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wommack</surname>
                            <given-names>KE</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bhavsar</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ravel</surname>
                            <given-names>J</given-names>
                        </name>
</person-group>:
                    <article-title>Metagenomics: read length matters.</article-title>
                    <source>

                        <italic toggle="yes">Appl Environ Microbiol.</italic>
</source>American Society for Microbiology;<year>2008</year>;<volume>74</volume>(<issue>5</issue>):<fpage>1453</fpage>&#x2013;<lpage>63</lpage>.
                    <pub-id pub-id-type="pmid">18192407</pub-id>
                    <pub-id pub-id-type="doi">10.1128/AEM.02181-07</pub-id>
                    <pub-id pub-id-type="pmcid">2258652</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-42">
                <label>42</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Fouhy</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Clooney</surname>
                            <given-names>AG</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Stanton</surname>
                            <given-names>C</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>16S rRNA gene sequencing of mock microbial populations- impact of DNA extraction method, primer choice and sequencing platform.</article-title>
                    <source>

                        <italic toggle="yes">BMC Microbiol.</italic>
</source>BioMed Central;<year>2016</year>;<volume>16</volume>(<issue>1</issue>):<fpage>123</fpage>.
                    <pub-id pub-id-type="pmid">27342980</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s12866-016-0738-z</pub-id>
                    <pub-id pub-id-type="pmcid">4921037</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-43">
                <label>43</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Adu-Oppong</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gasparrini</surname>
                            <given-names>AJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dantas</surname>
                            <given-names>G</given-names>
                        </name>
</person-group>:
                    <article-title>Genomic and functional techniques to mine the microbiome for novel antimicrobials and antimicrobial resistance genes.</article-title>
                    <source>

                        <italic toggle="yes">Ann N Y Acad Sci.</italic>
</source>
                    <year>2017</year>;<volume>1388</volume>(<issue>1</issue>):<fpage>42</fpage>&#x2013;<lpage>58</lpage>.
                    <pub-id pub-id-type="pmid">27768825</pub-id>
                    <pub-id pub-id-type="doi">10.1111/nyas.13257</pub-id>
                    <pub-id pub-id-type="pmcid">5280215</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-44">
                <label>44</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Marroni</surname>
                            <given-names>F</given-names>
                        </name>
</person-group>:
                    <article-title>Do you cov me</article-title>.<year>2019</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://www.doi.org/10.17605/OSF.IO/Y7C39">http://www.doi.org/10.17605/OSF.IO/Y7C39</ext-link>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report48341">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.20298.r48341</article-id>
            <title-group>
                <article-title>Reviewer response for version 2</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Dal Grande</surname>
                        <given-names>Francesco</given-names>
                    </name>
                    <xref ref-type="aff" rid="r48341a1">1</xref>
                    <xref ref-type="aff" rid="r48341a2">2</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-1865-6281</uri>
                </contrib>
                <aff id="r48341a1">
                    <label>1</label>LOEWE Centre for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany</aff>
                <aff id="r48341a2">
                    <label>2</label>Senckenberg Biodiversity and Climate Research Centre, Senckenberg Gesellschaft f&#x00fc;r Naturforschung, Frankfurt am Main, Germany</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>30</day>
                <month>5</month>
                <year>2019</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2019 Dal Grande F</copyright-statement>
                <copyright-year>2019</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport48341" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.16804.2"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>reject</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>In this manuscript the authors aimed at evaluating the use of shallow shotgun metagenomic sequencing for the characterisation of species diversity and the reconstruction of genomes in complex Illumina read sets. Overall, the manuscript is well written and contains interesting information that may be useful to others in figuring out a required metagenomic sequencing depth for a given goal. &#x00a0;</p>
            <p> </p>
            <p> The manuscript has been vastly improved in the current version, however I feel that it still needs a thorough revision to address a few major issues in order to ensure the general validity of the findings.</p>
            <p> </p>
            <p> The three major issues to address are, in my opinion, the following: 
                <list list-type="order">
                    <list-item>
                        <p>
                            <bold>Overestimation of diversity</bold>: Authors decided to base their analyses of diversity on the raw output from kraken2. However, as mentioned by the authors themselves, "species represented by only one read are unlikely to be real". This is quite evident in the report from the 20-species mock community comprising instead &gt;2000 species. I strongly recommend the use of a threshold (e.g., 0.005% of the total amount of reads) to filter out likely false positives. For this purpose, the authors could take advantage of the mock community to evaluate results based on different thresholds and thereby optimise threshold selection.&#x00a0;</p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Inaccuracy of species-level abundances</bold>: in their analysis the authors assumed that read abundances reflect species abundance. However, this is often not the case, especially when closely related taxa are present in the sample; the accuracy of abundance estimation further depends on the database used (Lu 
                            <italic>et al</italic> 2017). The authors themselves hint at this when discussing the misclassification of 
                            <italic>Staphylococcus lugdunensis,</italic> likely due to the presence of other confounding 
                            <italic>Staphylococcus</italic> reads. To address this issue, the authors could use Bracken (from the same developers of kraken, Lu 
                            <italic>et al. </italic>2017). Bracken uses the classification results of kraken to reestimate relative species abundances taking into account how much sequence from each species is identical to other genomes in the database.</p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Inaccurate assessment of genome reconstruction ability</bold>: considering the classification biases mentioned above and the complexity of the investigated metagenomic data sets, it might be better to base the assessment of the effects of coverage reduction on metagenome reconstruction solely on the mock community data. First, authors would need to bin the metagenomic contigs into individual species (using kraken2 and/or other binning approaches). The individual bins (i.e., species) should then be evaluated for completeness using BUSCO and &#x00a0;compared.</p>
                    </list-item>
                </list> </p>
            <p> In summary, this work (and, by extension, future studies using a similar approach) could greatly benefit from the inclusion of a baseline estimate for species diversity and metagenome reconstruction, even if it is derived from a single mock community. The additional data sets could then be used to validate these estimates against real data.</p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>Partly</p>
            <p>Is the description of the method technically sound?</p>
            <p>Partly</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>Partly</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>Partly</p>
            <p>Reviewer Expertise:</p>
            <p>metagenomics, metatranscriptomics, community ecology, symbiosis, population genomics, metabarcoding, biotic interactions</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-48341-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Bracken: estimating species abundance in metagenomics data</article-title>.
                        <source>
                            <italic>PeerJ Computer Science</italic>
                        </source>.<year>2017</year>;<volume>3</volume>:
                        <elocation-id>10.7717/peerj-cs.104</elocation-id>
                        <pub-id pub-id-type="doi">10.7717/peerj-cs.104</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
        <sub-article article-type="response" id="comment4771-48341">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Cattonaro</surname>
                            <given-names>Federica</given-names>
                        </name>
                        <aff>IGA Technology Services Srl, Italy</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>23</day>
                    <month>7</month>
                    <year>2019</year>
                </pub-date>
            </front-stub>
            <body>
                <p>
                    <bold>
                        <italic>In this manuscript the authors aimed at evaluating the use of shallow shotgun metagenomic sequencing for the characterisation of species diversity and the reconstruction of genomes in complex Illumina read sets. Overall, the manuscript is well written and contains interesting information that may be useful to others in figuring out a required metagenomic sequencing depth for a given goal. &#x00a0;</italic>
                    </bold>
                </p>
                <p>
                    <bold>
                        <italic>The manuscript has been vastly improved in the current version, however I feel that it still needs a thorough revision to address a few major issues in order to ensure the general validity of the findings.</italic>
                    </bold>
                </p>
                <p>We thank the reviewer for the suggestions. We implemented them and updated the manuscript accordingly.</p>
                <p>
                    <bold>
                        <italic>The three major issues to address are, in my opinion, the following: </italic>
                    </bold> 
                    <list list-type="order">
                        <list-item>
                            <p>
                                <bold>
                                    <italic>Overestimation of diversity: Authors decided to base their analyses of diversity on the raw output from kraken2. However, as mentioned by the authors themselves, "species represented by only one read are unlikely to be real". This is quite evident in the report from the 20-species mock community comprising instead &gt;2000 species. I strongly recommend the use of a threshold (e.g., 0.005% of the total amount of reads) to filter out likely false positives. For this purpose, the authors could take advantage of the mock community to evaluate results based on different thresholds and thereby optimise threshold selection.&#x00a0;&#x00a0;</italic>
                                </bold>
                            </p>
                            <p>See answer to point 2.</p>
                        </list-item>
                        <list-item>
                            <p>
                                <bold>
                                    <italic>Inaccuracy of species-level abundances: in their analysis the authors assumed that read abundances reflect species abundance. However, this is often not the case, especially when closely related taxa are present in the sample; the accuracy of abundance estimation further depends on the database used (Lu et al 2017). The authors themselves hint at this when discussing the misclassification of Staphylococcus lugdunensis, likely due to the presence of other confounding Staphylococcus reads. To address this issue, the authors could use Bracken (from the same developers of kraken, Lu et al. 2017). Bracken uses the classification results of kraken to reestimate relative species abundances taking into account how much sequence from each species is identical to other genomes in the database.</italic>
                                </bold>
                            </p>
                            <p>We took advantage of suggestions 1 and 2 (and from suggestions from reviewer 1) to improve the species abundances estimation. After classifying reads with kraken2, we used bracken to re-estimate species abundance only for species represented by at least 10 reads. Then, using the only gold standard we had (the mock community) we measured performance at difference detection threshold. Our results suggested that a detection threshold of 0.1% was the one resulting in the higher F1 score, minimizing false negatives and false positives while maximizing true positives.</p>
                        </list-item>
                        <list-item>
                            <p>
                                <bold>
                                    <italic>Inaccurate assessment of genome reconstruction ability: considering the classification biases mentioned above and the complexity of the investigated metagenomic data sets, it might be better to base the assessment of the effects of coverage reduction on metagenome reconstruction solely on the mock community data. First, authors would need to bin the metagenomic contigs into individual species (using kraken2 and/or other binning approaches). The individual bins (i.e., species) should then be evaluated for completeness using BUSCO and compared.</italic>
                                </bold>
                            </p>
                            <p>Results presented in version 2 of our paper are already based on binning approaches, in which we classified contigs using kraken, performed BUSCO for each species and then averaged the proportion of BUSCO genes across species. However, in version 2 we made (in our opinion) a mistake, since we averaged the proportion of BUSCO genes across all species for which at least one BUSCO gene was reconstructed. This led to a slight overestimation of the number of reconstructed BUSCO genes. We thus repeated the analysis by averaging the proportion of BUSCO genes over all the species that were above the detection threshold, including those for which no BUSCO gene was reconstructed. The new approach is now explained in the methods section, and the new plot is now Figure 7. In addition, we liked the idea of using the mock community, and we performed a new analysis, now shown in Figure 6. The result are very interesting and are briefly discussed. Basically, with the full set of reads (around 5M), the majority of BUSCO genes could be reconstructed for species with a nominal abundance of 18% and 1.8%, but not for the rarer species (for which basically no gene could be reconstructed).&#x00a0; When only 1M reads are used for the assembly, the proportion of reconstructed BUSCO genes is nearly unchanged in abundant species and drops to less than 10% in species with a nominal frequency of 1.8%. The results and the implications for study designs are briefly discussed in the paper.</p>
                        </list-item>
                    </list>
                </p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report46099">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.20298.r46099</article-id>
            <title-group>
                <article-title>Reviewer response for version 2</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Cobo Diaz</surname>
                        <given-names>Jos&#x00e9; F.</given-names>
                    </name>
                    <xref ref-type="aff" rid="r46099a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-0898-2358</uri>
                </contrib>
                <aff id="r46099a1">
                    <label>1</label>Laboratoire Universitaire de Biodiversit&#x00e9; et Ecologie Microbienne, IBSAM, ESIAB, &#x00a0;Universit&#x00e9; de Brest, Plouzan&#x00e9;, France</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>25</day>
                <month>3</month>
                <year>2019</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2019 Cobo Diaz JF</copyright-statement>
                <copyright-year>2019</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport46099" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.16804.2"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>reject</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>I appreciate the changes make along the introduction, because the objective of the present study is now more clear. Although the manuscript was improved considerably, there is still a big problem with the data analysis, mainly in reads filtering.</p>
            <p> </p>
            <p> Now that you have included a mock community sample, you need to use this sample to adapt the parameters of reads filtering, clustering step (I asume you have done some kind of clustering since you talk about singletons) and taxonomic assignation until you have the number of species expected, 20 in this case. You can also have some less due to problems with species assignation, but it is crazy to use a 20 species mock community and say that you have 2571 species in this sample. For example, singletons (clustering groups or OTUs (Operational Taxonomical Units) with a unique sequence) are usually removed on metabarcoding pipelines, and in some cases OTUs with less than 0.1% of abundance are removed, assuming that these sequences are sequencing errors (and PCR errors in metabarcoding). Therefore, you have to estimate the minimum percentage of abundance to be considered real (and not due to errors) with the mock sample and apply this cut off value to the rest of samples.</p>
            <p> </p>
            <p> In the same line, to say that 2,507 and 4,597 species were found in vaccines is not correct, where you can expect the DNA from varicella (the other viruses are ssRNA) and the DNA from human and chicken cells used for culture.</p>
            <p> </p>
            <p> Some small changes I suggest: 
                <list list-type="bullet">
                    <list-item>
                        <p>Rewrite or suppress last paragraph of introduction, which looks more appropriate to Methodology.</p>
                    </list-item>
                    <list-item>
                        <p>Add some disadvantages of use metabarcoding approach (being the main one the bias due to primers, with over/under-estimation of some taxa, depending of the primers used).</p>
                    </list-item>
                    <list-item>
                        <p>At the end of the samples description, you need to put what means SRA (and add the corresponding web-address).</p>
                    </list-item>
                    <list-item>
                        <p>In samples description, grammatical mistake with human faecal (have to be human fecal).</p>
                    </list-item>
                    <list-item>
                        <p>Remove this sentence from results: To ensure that our conclusions have a general validity, we selected samples originating from very different sources with different compositions, and sequenced them at different depths.</p>
                    </list-item>
                    <list-item>
                        <p>Figure 3, with species and genus level is enough.</p>
                    </list-item>
                </list> Thus, the read filtering and hence all the statistical analysis have to be re-make. I not expect big changes, also at taxonomical level (where only a reduction of "rare species" and unclassified sequences is expected), but it is not convenient to present the results with such great over-estimation of species richness.</p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>No</p>
            <p>Is the description of the method technically sound?</p>
            <p>Partly</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>Partly</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>No</p>
            <p>Reviewer Expertise:</p>
            <p>microbial ecology, metabarcoding sequencing, NGS data analysis, bacterial communities, fungal communities</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.</p>
        </body>
        <sub-article article-type="response" id="comment4772-46099">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Cattonaro</surname>
                            <given-names>Federica</given-names>
                        </name>
                        <aff>IGA Technology Services Srl, Italy</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>23</day>
                    <month>7</month>
                    <year>2019</year>
                </pub-date>
            </front-stub>
            <body>
                <p>
                    <bold>
                        <italic>I appreciate the changes make along the introduction, because the objective of the present study is now more clear. Although the manuscript was improved considerably, there is still a big problem with the data analysis, mainly in reads filtering.</italic>
                    </bold>
                </p>
                <p>
                    <bold>
                        <italic>Now that you have included a mock community sample, you need to use this sample to adapt the parameters of reads filtering, clustering step (I asume you have done some kind of clustering since you talk about singletons) and taxonomic assignation until you have the number of species expected, 20 in this case. You can also have some less due to problems with species assignation, but it is crazy to use a 20 species mock community and say that you have 2571 species in this sample. For example, singletons (clustering groups or OTUs (Operational Taxonomical Units) with a unique sequence) are usually removed on metabarcoding pipelines, and in some cases OTUs with less than 0.1% of abundance are removed, assuming that these sequences are sequencing errors (and PCR errors in metabarcoding). Therefore, you have to estimate the minimum percentage of abundance to be considered real (and not due to errors) with the mock sample and apply this cut off value to the rest of samples.</italic>
                    </bold>
                </p>
                <p>
                    <bold>
                        <italic>In the same line, to say that 2,507 and 4,597 species were found in vaccines is not correct, where you can expect the DNA from varicella (the other viruses are ssRNA) and the DNA from human and chicken cells used for culture.</italic>
                    </bold>
                </p>
                <p>According to your suggestions (and to similar suggestions received from reviewer 3), we now adopted more stringent criteria for determining the presence of a species. Following the suggestion of both reviewers, we leverage the mock community to define a threshold. We use Bracken to refine the species abundance estimation (already providing a very permissive threshold, i.e. ignoring OTUs with less than 10 reads). We then performed a performance analysis to compare Bracken results with the known composition of the mock community, and chose the threshold maximizing the F1 score (harmonic average of precision and recall). The threshold resulting in the best tradeoff was 0.1%. &#x00a0;</p>
                <p>As a side effect of filtering OTUs with less than 0.1% frequency we do not have any narrow-sense singleton. As a consequence, the number of observed taxa and Chao1 diversity index coincide, and the Good estimator is always 1. We thus removed these two statistics from our panel plot.</p>
                <p>In addition, we removed the paragraph on the &#x201c;detection threshold&#x201d; and the corresponding Table 2, since we are now determining a threshold 
                    <italic>a-priori</italic> based on the mock community and this parts are not needed any more.</p>
                <p>
                    <bold>
                        <italic>Some small changes I suggest: </italic>
                    </bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>
                                <bold>
                                    <italic>Rewrite or suppress last paragraph of introduction, which looks more appropriate to Methodology. </italic>
                                </bold>
                            </p>
                        </list-item>
                    </list> We removed the last paragraph. 
                    <list list-type="bullet">
                        <list-item>
                            <p>
                                <bold>
                                    <italic>Add some disadvantages of use metabarcoding approach (being the main one the bias due to primers, with over/under-estimation of some taxa, depending of the primers used). </italic>
                                </bold>
                            </p>
                        </list-item>
                    </list> We added a sentence and a reference regarding limitation of metabarcoding approaches in the introduction. 
                    <list list-type="bullet">
                        <list-item>
                            <p>
                                <bold>
                                    <italic>At the end of the samples description, you need to put what means SRA (and add the corresponding web-address). </italic>
                                </bold>
                            </p>
                        </list-item>
                    </list> Done. 
                    <list list-type="bullet">
                        <list-item>
                            <p>
                                <bold>
                                    <italic>In samples description, grammatical mistake with human faecal (have to be human fecal). </italic>
                                </bold>
                            </p>
                        </list-item>
                    </list> Amended. 
                    <list list-type="bullet">
                        <list-item>
                            <p>
                                <bold>
                                    <italic>Remove this sentence from results: To ensure that our conclusions have a general validity, we selected samples originating from very different sources with different compositions, and sequenced them at different depths. </italic>
                                </bold>
                            </p>
                        </list-item>
                    </list> Sentence removed. 
                    <list list-type="bullet">
                        <list-item>
                            <p>
                                <bold>
                                    <italic>Figure 3, with species and genus level is enough. </italic>
                                </bold>
                            </p>
                        </list-item>
                    </list> While we were modifying the Figure as per reviewer&#x2019;s request we realized that indeed the results presented at the species level in Figure 3 are also presented in the first panel of Figure 4. Since the results at the genus species did not add much information, we decided to remove Figure 3.</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report42422">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.18370.r42422</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Cobo Diaz</surname>
                        <given-names>Jos&#x00e9; F.</given-names>
                    </name>
                    <xref ref-type="aff" rid="r42422a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-0898-2358</uri>
                </contrib>
                <aff id="r42422a1">
                    <label>1</label>Laboratoire Universitaire de Biodiversit&#x00e9; et Ecologie Microbienne, IBSAM, ESIAB, &#x00a0;Universit&#x00e9; de Brest, Plouzan&#x00e9;, France</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>4</day>
                <month>1</month>
                <year>2019</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2019 Cobo Diaz JF</copyright-statement>
                <copyright-year>2019</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport42422" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.16804.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>reject</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The authors proposed and evaluated the influence of reduce sequencing effort (amount of sequences) for a whole metagenome shotgun analysis, using the Illumina platform, in the species composition and diversity index of the communities studied. Although the idea and hypothesis are good, some problems were found in the experimental design and data analysis.</p>
            <p> According to the questions proposed in the peer review form, it is not a new method, only the adaptation of a current methodology to optimize the cost and increase the potential numbers of samples analyzed per run of Illumina platform. Although the introduction is clearly explained, the reasons for use shotgun sequencing, mainly to analyze viruses data and functional data for all the organism, no emphasis on such points was done in the results and discussion. The samples used (vaccines, horse fecal samples and food samples) and the introduction remark the detection of pathogens as the main objective of the approach used, including viruses, which can not be screened by amplicons approaches, like metabarcoding sequencing. I suggest adapting the text and manuscript to focus on pathogens (mainly viruses) found along the sub-samples taken for each sample. At that point, some contaminated samples (or not contaminated samples mixed with known amounts DNA from pathogen viruses) have to be used to determine the lowest pathogen concentration that could be detected for each shotgun sequencing coverage proposed.</p>
            <p> Many problems were found with the methodology employed, mainly the parameters used in each step and/or software employed for data filtering and analysis, which are critical for the results, which can have strong variations depending of the parameters used. Hence, the methodology proposed does not allow any replication of the method used. Moreover, there are some mistakes for species designation in the study, with at least 2508 species found in vaccine samples indicating big problems along read filtering and data analysis, because this number of species is often found in more complex systems, such as soils samples from agricultural fields. Moreover, go to species classification using some taxonomical markers, such ITS or 16SrRNA, is risky with sequences lower than 400 bp, and sometimes with bigger sequences. In the current manuscript, the use of non taxonomical marker sequences and 150 bp lengths increase enormously the number of sequences not correctly assigned to species level, and in several cases also for higher taxonomical levels (genus, family...). Therefore, I suggest to clarify how the species assignment was done, because it looks like that each gene-species was considered as one species, and each gene found for a single species was counted as a new species.</p>
            <p> Alpha diversity indexes employed are not the best ones, in my opinion, to describe or compare the sub-samples proposed in this manuscript. The chao1 index, an estimator of richness, has a strong influence on the number of singletons obtained in the samples, which due to the complexity of the samples-data tends to be high. Shannon index is influenced by both richness (number of taxa) and evenness (equability, Pielou index), and the reduction of richness due to the loss of rare taxa has a strong influence on this index. I propose to use the number of observed taxa instead of estimated taxa, and any evenness index, like the Pielou index, instead of the Shannon index. Moreover, the use of a coverage index, such Good&#x2019;s coverage index, could be useful to compare the loss of information associated to sampled size or coverage.</p>
            <p> </p>
            <p> In conclusion, although the raw data can contains some important information, the manuscript has to be improved with new &#x201c;pathogen contaminated&#x201d; samples, and be re-written to focus on the detection of pathogens in the samples, which due to the low abundance of the samples could not be detected depending of the shotgun coverage.</p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>No</p>
            <p>Is the description of the method technically sound?</p>
            <p>Partly</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>Partly</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>No</p>
            <p>Reviewer Expertise:</p>
            <p>microbial ecology, metabarcoding sequencing, NGS data analysis, bacterial communities, fungal communities</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report40445">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.18370.r40445</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Sanchez-Flores</surname>
                        <given-names>Alejandro</given-names>
                    </name>
                    <xref ref-type="aff" rid="r40445a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-0476-3139</uri>
                </contrib>
                <aff id="r40445a1">
                    <label>1</label>Institute of Biotechnology, National Autonomous University of Mexico (UNAM)), Cuernavaca, Mexico</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>27</day>
                <month>11</month>
                <year>2018</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2018 Sanchez-Flores A</copyright-statement>
                <copyright-year>2018</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport40445" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.16804.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>reject</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The authors propose and evaluate a whole metagenome shotgun analysis via a&#x00a0;low sequencing yield approach, using the Illumina platform.</p>
            <p> </p>
            <p> In general, the idea and hypothesis are good, but the experimental design itself lacks important controls and there are many variables that are not analyzed and that can potentially bias the results.</p>
            <p> </p>
            <p> My main concern is that the used samples have many variables and despite using a "replicate" for each case, samples within the same type were&#x00a0;very different. Also the nature of each sample could have an effect in the DNA isolation, in particular for the vaccine ones. Also, regarding the vaccines, it is not clear to me, if what they are looking for is DNA of potential contaminants, since all viruses in the vaccine are ssRNA. That would be my guess, but is not clear from the text.</p>
            <p> </p>
            <p> The main problem is that to test the influence of the sequencing yield, it would be extremely important to know the initial DNA concentration of each organism in the sample. Therefore, a mock metagenome or controlled sample would be much better as a reference to compare real life cases. In real life cases, the presence of certain organisms detected by the presence of its DNA, is not necessarily an indicator of the availability of alive organisms. Depending on the case, the presence of just the organism DNA could be an indicator of contamination which in the case of vaccines could be really bad. However, in the case of food material, finding DNA of pathogens, has to be associated with microbiology tests. However, with low sequencing yield, is very probable that very DNA in low amounts will be missed, even if this is not changing diversity indexes such as Chao1 and Shannon.</p>
            <p> </p>
            <p> Finally, the main difference where low yield has a significant impact can be observed in the fecal samples. This is expected since among all the tested samples, fecal ones are the most diverse and sub-sampling will really affect them as observed in Figure 3.</p>
            <p> </p>
            <p> Since the composition of each sample is not known 
                <italic>a priori</italic>, then there are some factors that can contribute to biases. As mentioned, the DNA concentration but also its integrity (fragmentation)&#x00a0;will affect the library construction; the cited kit requires DNA amplification which will have a bias towards GC rich genomic regions; library size was not described and was not mentioned if the samples were pooled with other libraries with different insert sizes, which affect not only the sequencing quality but the yield.</p>
            <p> </p>
            <p> In terms of bioinformatics analysis, it will be required to put the parameters used for each program, in case someone wants to reproduce this. For Kraken2, it is important to know what is the kmer size to index the database. For MEGAHIT assembly it will be important to know the kmer and step sizes used. For the completeness assessment, the authors used BUSCO, but apparently they are using the whole assembly to assess the completeness. This is not correct, since they must first separate in bins which genomes they have really reconstructed and then they can assess the completeness of them. Probably they can report the an average completeness value for all the reconstructed genomes. By doing the binning they can have a better analysis of what was really reconstructed and how complete it was.</p>
            <p> </p>
            <p> The use of Krona in Figure 2 is not very convenient. The whole point of a Krona graph is that is interactive. If authors want to provide the Krona data to be downloaded it would be possible and recommended. Having said that, I recommend to use bar plots to represent the relative abundance and composition of the samples at a given taxa level.</p>
            <p> </p>
            <p> Again, the idea is very good but the work needs to be improved before indexing.</p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the method technically sound?</p>
            <p>No</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>No</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>Partly</p>
            <p>Reviewer Expertise:</p>
            <p>Genomics, Transcriptomics, Metagenomics, Bioinformatics</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.</p>
        </body>
        <sub-article article-type="response" id="comment4267-40445">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Cattonaro</surname>
                            <given-names>Federica</given-names>
                        </name>
                        <aff>IGA Technology Services Srl, Italy</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>30</day>
                    <month>11</month>
                    <year>2018</year>
                </pub-date>
            </front-stub>
            <body>
                <p>We are grateful for the constructive comments. We agree with all of them and we are planning corrective actions, listed below.</p>
                <p>
                    <bold>
                        <italic>My main concern is that the used samples have many variables and despite using a "replicate" for each case, samples within the same type were&#x00a0;very different.&#x00a0;</italic>
                    </bold>
                </p>
                <p>The observation is correct. Actually, the diversity of the samples was sought by purpose in order to be able to generalize the conclusions of our paper. The fact that diversity estimate and species abundance estimation remain reliable even with strong down-sampling for all of the samples is encouraging us to think that this is a general (although not necessarily universal) observation. The same is true for the observation that de-novo assembly quickly loses accuracy when decreasing the number of sequenced reads. Maybe this wasn&#x2019;t made clear enough in the paper, and we will clarify it.</p>
                <p>
                    <bold>
                        <italic>Also the nature of each sample could have an effect in the DNA isolation, in particular for the vaccine ones. </italic>
                    </bold>
                </p>
                <p>Quantities of DNA isolated from vaccine samples (B1 and B2) were estimated to be ~2 &#x00b5;g using Qbit fluorimeter. However, we will provide a table with all the details about quantity, concentration, quality and size of starting DNA for all samples used in the study.</p>
                <p>
                    <bold>
                        <italic>Also, regarding the vaccines, it is not clear to me, if what they are looking for is DNA of potential contaminants, since all viruses in the vaccine are ssRNA. That would be my guess, but is not clear from the text.</italic>
                    </bold>
                </p>
                <p>The vaccine composition declared by the producer is the following:</p>
                <p>Live attenuated viruses: Measles (ssRNA) Swartz strain, cultured in embryo chicken cell cultures; Mumps (ssRNA) strain RIT 4385, derived from the Jeryl Linn strain, cultured in embryo chicken cell cultures; Rubella (ssRNA) Wistar RA 27/3 strain, grown in human diploid cells (MRC-5); Varicella (dsDNA) OKA strain grown in human diploid cells (MRC-5).</p>
                <p>By DNA-seq we expected to find Varicella (dsDNA) OKA strain DNA (which was found and confirmed by variant analysis with respect to AB097932.1 Human herpesvirus 3 DNA, sub strain vOka). In addition, we found also human and chicken DNA. For human&#x2019;s, we confirmed MRC-5 cell origin by mitochondrial genome variant analysis.</p>
                <p>Genotyping analyses gave us confidence on the validity of the obtained results, even though they were beyond the scope of this work.</p>
                <p>To identify vaccine&#x2019;s ssRNA viruses we extracted RNA and performed RNA-seq from the same B1 and B2 samples. This aspect also goes beyond the scope of this work.</p>
                <p>
                    <bold>
                        <italic>The main problem is that to test the influence of the sequencing yield, it would be extremely important to know the initial DNA concentration of each organism in the sample. Therefore, a mock metagenome or controlled sample would be much better as a reference to compare real life cases. </italic>
                    </bold>
                </p>
                <p>A mock community experiment is already on-going by using &#x2018;10 Strain Staggered Mix Genomic Material (ATCC&#x00ae; MSA-1001&#x2122;)&#x2019;. Of course, the data obtained will be integrated in the analysis results.</p>
                <p>
                    <bold>
                        <italic>In real life cases, the presence of certain organisms detected by the presence of its DNA, is not necessarily an indicator of the availability of alive organisms. Depending on the case, the presence of just the organism DNA could be an indicator of contamination which in the case of vaccines could be really bad. However, in the case of food material, finding DNA of pathogens, has to be associated with microbiology tests. </italic>
                    </bold>
                </p>
                <p>We agree with the observation of the reviewer. However, the aim of this work is to determine if low-pass whole genome sequencing can be an appropriate approach to broadly describe a complex matrix; finding and confirming contaminants in vaccines or DNA pathogens in food samples was beyond of the scope of the paper.</p>
                <p>
                    <bold>
                        <italic>However, with low sequencing yield, is very probable that very DNA in low amounts will be missed, even if this is not changing diversity indexes such as Chao1 and Shannon. Finally, the main difference where low yield has a significant impact can be observed in the fecal samples. This is expected since among all the tested samples, fecal ones are the most diverse and sub-sampling will really affect them as observed in Figure 3.</italic>
                    </bold>
                </p>
                <p>We agree with the reviewer; we add some thoughts just to clarify. We indeed observed that extremely rare species (with frequencies lower than 1/10000) are lost when subsampling to the most extreme levels. When subsampling to 100K reads we are losing species with a frequency around 1/100,000 (very approximate estimate). However, the effect of losing such species on the global sample diversity as estimated by Shannon diversity index is negligible (see Figure 4, in which we show that reduction in sequencing depth has no dramatic effect on Shannon&#x2019;s diversity index). The situation is different for the Chao 1 estimator. This is expected and is due to the way Chao1 is computed: this estimator relies heavily on the number of singletons (i.e. species represented by only one read). By subsampling, singletons (i.e. the rarest species) are very likely to be lost. The same phenomenon can be inferred by looking at Figures 5 and 6. Those represent a scatterplot of the relative abundance of species in full sample and reduce samples (100K and 10k reads, respectively). The plots are shown in log log scale to emphasize differences for low-frequency species. Only low-frequency species have some variation in frequency estimation. However, even when sampling only 10K read, species with frequency around 0.1% (i.e. 1/1000) are appropriately quantified. All of these observations led us to conclude that coverage reduction doesn&#x2019;t prevent a satisfactory characterization of complex matrices (with the only exception of Chao 1 estimator).</p>
                <p>
                    <bold>
                        <italic>Since the composition of each sample is not known a priori, then there are some factors that can contribute to biases. As mentioned, the DNA concentration but also its integrity (fragmentation) will affect the library construction; the cited kit requires DNA amplification which will have a bias towards GC rich genomic regions; library size was not described. </italic>
                    </bold>
                </p>
                <p>The Nugen Ovation&#x00ae; Ultralow System V4 kit used is a standard kit for NGS library preparation (
                    <ext-link ext-link-type="uri" xlink:href="https://www.nugen.com/sites/default/files/DS_v2-Ovation_Ultralow_V2.pdf">https://www.nugen.com/sites/default/files/DS_v2-Ovation_Ultralow_V2.pdf</ext-link>
                </p>
                <p>It is a standard protocol widely used by the scientific community to perform DNA-seq also from low input DNA quantities (1 ng), even if in our case input DNA was of moderate quantity. Mock community experiment will shed light on eventual biases.</p>
                <p>DNA concentration and integrity as well as input DNA quantities used in library construction and libraries insert size will be reported in the version 2 of the paper.</p>
                <p>
                    <bold>
                        <italic>It was not mentioned if the samples were pooled with other libraries with different insert sizes, which affect not only the sequencing quality but the yield.</italic>
                    </bold>
                </p>
                <p>Samples were sequenced in different runs and pooled with other libraries of similar insert sizes. The number of reads obtained per sample reflects and respects their quantities, 
                    <italic>i.e.</italic> nmols that were loaded on the sequencer.</p>
                <p>
                    <bold>
                        <italic>In terms of bioinformatics analysis, it will be required to put the parameters used for each program, in case someone wants to reproduce this. For Kraken2, it is important to know what is the kmer size to index the database. For MEGAHIT assembly it will be important to know the kmer and step sizes used. </italic>
                    </bold>
                </p>
                <p>All these details will be provided in the version 2 of the paper.</p>
                <p>
                    <bold>
                        <italic>For the completeness assessment, the authors used BUSCO, but apparently they are using the whole assembly to assess the completeness. This is not correct, since they must first separate in bins which genomes they have really reconstructed and then they can assess the completeness of them. Probably they can report the an average completeness value for all the reconstructed genomes. By doing the binning they can have a better analysis of what was really reconstructed and how complete it was.</italic>
                    </bold>
                </p>
                <p>This is a good point. While our aim was to estimate the total proportion of BUSCO genes that were reconstructed, irrespective of the species of the organism to which they belong, we understand that a practical application is likely to require separating the reconstructed genomes. We will integrate our analysis by binning the reconstructed genomes.</p>
                <p>
                    <bold>
                        <italic>The use of Krona in Figure 2 is not very convenient. The whole point of a Krona graph is that is interactive. If authors want to provide the Krona data to be downloaded it would be possible and recommended. Having said that, I recommend to use bar plots to represent the relative abundance and composition of the samples at a given taxa level.</italic>
                    </bold>
                </p>
                <p>We will either provide a link to interactive krona graphs and/or bar plots reporting the relative abundance and composition of the samples.</p>
            </body>
        </sub-article>
    </sub-article>
</article>
