<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="other" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.19675.1</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Software Tool Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Large-scale sequence comparisons with 
                    <italic>sourmash</italic>
                </article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 1; peer review: 2 approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="no" equal-contrib="yes">
                    <name>
                        <surname>Pierce</surname>
                        <given-names>N. Tessa</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-2942-5331</uri>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no" equal-contrib="yes">
                    <name>
                        <surname>Irber</surname>
                        <given-names>Luiz</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-4371-9659</uri>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no" equal-contrib="yes">
                    <name>
                        <surname>Reiter</surname>
                        <given-names>Taylor</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no" equal-contrib="yes">
                    <name>
                        <surname>Brooks</surname>
                        <given-names>Phillip</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-3987-244X</uri>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Brown</surname>
                        <given-names>C. Titus</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Funding Acquisition</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Resources</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-6001-2677</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Department of Population Health and Reproduction, University of California, Davis, Davis, California, 95616, USA</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:ctbrown@ucdavis.edu">ctbrown@ucdavis.edu</email>
                </corresp>
                <fn>
                    <p id="fn1">*Equal contributors</p>
                </fn>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>4</day>
                <month>7</month>
                <year>2019</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2019</year>
            </pub-date>
            <volume>8</volume>
            <elocation-id>1006</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>28</day>
                    <month>6</month>
                    <year>2019</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2019 Pierce NT et al.</copyright-statement>
                <copyright-year>2019</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/8-1006/pdf"/>
            <abstract>
                <p>The sourmash software package uses MinHash-based sketching to create &#x201c;signatures&#x201d;, compressed representations of DNA,  RNA,  and  protein  sequences,  that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate  sequence  similarity  between  very  large  data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under the BSD license at 
                    <ext-link ext-link-type="uri" xlink:href="http://github.com/dib&#x2212;lab/sourmash">http://github.com/dib&#x2212;lab/sourmash</ext-link>.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>sequence analysis</kwd>
                <kwd>MinHash</kwd>
                <kwd>k-mer</kwd>
                <kwd>sourmash</kwd>
                <kwd>bioinformatics</kwd>
            </kwd-group>
            <funding-group>
                <award-group id="fund-1" xlink:href="http://dx.doi.org/10.13039/100000936">
                    <funding-source>Gordon and Betty Moore Foundation</funding-source>
                    <award-id>GBMF4551</award-id>
                </award-group>
                <award-group id="fund-2" xlink:href="http://dx.doi.org/10.13039/501100008982">
                    <funding-source>National Science Foundation</funding-source>
                    <award-id>1711984</award-id>
                </award-group>
                <funding-statement>This work is funded in part by the Gordon and Betty Moore Foundation&#x2019;s Data-Driven Discovery Initiative [GBMF4551 to CTB]. NTP was supported by a National Science Foundation Postdoctoral Fellowship in Biology [1711984].</funding-statement>
                <funding-statement>
                    <italic>The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</italic>
                </funding-statement>
            </funding-group>
        </article-meta>
    </front>
    <body>
        <sec sec-type="intro">
            <title>Introduction</title>
            <p>Bioinformatic analyses rely on sequence comparison for many applications, including variant analysis, taxonomic classification and functional annotation. As the Sequence Read Archive now contains over 20 Petabases of data
                <sup>
                    <xref ref-type="bibr" rid="ref-1">1</xref>
                </sup>, there is great need for methods to quickly and efficiently conduct similarity searches on a massive scale. MinHash techniques
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>
                </sup> utilize random sampling of k-mer content to generate small subsets known as "sketches" such that Jaccard similarity (the intersection over the union) of two sequence data sets remains approximately equal to their true Jaccard similarity
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>,
                    <xref ref-type="bibr" rid="ref-3">3</xref>
                </sup>. The many-fold size reductions gained via MinHash opens the door to extremely large scale searches.</p>
            <p>While the initial k-mer MinHash implementation focused on enabling Jaccard similarity comparisons
                <sup>
                    <xref ref-type="bibr" rid="ref-3">3</xref>
                </sup>, it has since been modified and extended to enable k-mer abundance comparisons
                <sup>
                    <xref ref-type="bibr" rid="ref-4">4</xref>
                </sup>, decrease runtime and memory requirements
                <sup>
                    <xref ref-type="bibr" rid="ref-5">5</xref>
                </sup>, and work on streaming input data
                <sup>
                    <xref ref-type="bibr" rid="ref-6">6</xref>
                </sup>. Furthermore, as Jaccard similarity is impacted by the relative size of the sets being compared, containment searches
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>,
                    <xref ref-type="bibr" rid="ref-7">7</xref>,
                    <xref ref-type="bibr" rid="ref-8">8</xref>
                </sup> have been developed to enable detection of a small set within a larger set, such as a genome within a metagenome.</p>
            <p>Here we present version 2.0 of sourmash
                <sup>
                    <xref ref-type="bibr" rid="ref-9">9</xref>
                </sup>, a Python library for building and utilizing MinHash sketches of DNA, RNA, and protein data. sourmash incorporates and extends standard MinHash techniques for sequencing data, with a particular focus towards enabling efficient containment queries using large databases. This is accomplished with two modifications: (1) building sketches via a modulo approach
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>
                </sup>, and (2) implementing a modified Sequence Bloom Tree
                <sup>
                    <xref ref-type="bibr" rid="ref-10">10</xref>
                </sup> to enable both similarity and containment searches. In most cases, these features enable sourmash database comparisons in sub-linear time.</p>
            <p>Standard genomic MinHash techniques, first implemented in Ondov BD 
                <italic toggle="yes">et al.</italic>
                <sup>
                    <xref ref-type="bibr" rid="ref-3">3</xref>
                </sup>,  retain the minimum 
                <italic toggle="yes">n</italic> k-mer hashes as a representative subset. sourmash extends these methods by incorporating a user-defined "scaled" factor to build sourmash sketches via a modulo approach, rather than the standard bottom-hash approach
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>
                </sup>. Sketches built with this approach retain the same fraction, rather than number, of k-mer hashes, compressing both large and small datasets at the same rate.</p>
            <p>This enables comparisons between datasets of disparate sizes but can sacrifice some of the memory and storage benefits of standard MinHash techniques, as the signature size scales with the number of unique k-mers rather than remaining fixed
                <sup>
                    <xref ref-type="bibr" rid="ref-8">8</xref>
                </sup>. In sourmash, use of the "scaled" factor enables user modification of the trade-off between search precision and sketch size, with the caveat that searches and comparisons can only be conducted using signatures generated with identical "scaled" values (downsampled at the same rate).</p>
            <p>To enable large-scale database searches using these signatures, sourmash implements a modified Sequence Bloom Tree (SBT), the SBT-MinHash (SBTMH), that allows both similarity (sourmash search) and containment (sourmash gather) searches for taxonomic exploration and classification. Notably, Jaccard similarity searches using this modified SBT require storage of the cardinality of the smallest MinHash below each node in order to properly bound similarity. sourmash also implements a second database format, "LCA", for in-memory search when sufficient RAM is available or database size is tractable. The LCA format can be leveraged to return additional information, such as taxonomic lineage.</p>
            <p>In addition to these modifications, sourmash implements k-mer abundance tracking
                <sup>
                    <xref ref-type="bibr" rid="ref-4">4</xref>
                </sup> within signatures to allow abundance comparisons across datasets and facilitate metagenome, metatranscriptome, and transcriptome analyses, and is compatible with streaming approaches. The sourmash library is implemented in C++, Rust
                <sup>
                    <xref ref-type="bibr" rid="ref-11">11</xref>
                </sup>, and Python, and can be accessed both via command line and Python API. The code is available under the BSD license  at  
                <ext-link ext-link-type="uri" xlink:href="https://github.com/dib-lab/sourmash">http://github.com/dib-lab/sourmash</ext-link>.</p>
        </sec>
        <sec>
            <title>Implementation</title>
            <p>
                <monospace>sourmash</monospace> provides a user-friendly, extensible platform for MinHash signature generation and manipulation for DNA, RNA, and protein data. Sourmash is designed to facilitate containment queries for taxonomic exploration and identification while maintaining all functionality available via standard genomic MinHash techniques.</p>
            <sec>
                <title>sourmash Signatures</title>
                <p>
                    <monospace>sourmash</monospace> modifies standard genomic MinHash techniques in two ways. First, 
                    <monospace>sourmash</monospace> scales the number of retained hashes to better represent and compare datasets of varying size and complexity. Second, 
                    <monospace>sourmash</monospace> optionally tracks the abundance of each retained hash, to better represent data of metagenomic and transcriptomic origin and allow abundance comparisons.</p>
                <p>
                    <bold>
                        <italic toggle="yes">Scaling.</italic>
                    </bold> 
                    <monospace>sourmash</monospace> implements a method inspired by modulo sketches
                    <sup>
                        <xref ref-type="bibr" rid="ref-2">2</xref>
                    </sup> to dynamically scale hash subset retention size (
                    <italic toggle="yes">n</italic>). When using scaled signatures, users provide a scaling factor (
                    <italic toggle="yes">s</italic>) that divides the hash space into 
                    <italic toggle="yes">s</italic> equal bands, retaining hashes within the minimum band as the sketch. These scaled signatures can be converted to standard bottom-hash signatures, if the subset retention size 
                    <italic toggle="yes">n</italic> is equal to or smaller than the number of hashes in the scaled signature. 
                    <monospace>sourmash</monospace> provides a signature utility, downsample, to convert sketches when possible. Finally, to maintain compatibility with sketches generated by other programs such as Mash
                    <sup>
                        <xref ref-type="bibr" rid="ref-3">3</xref>
                    </sup>, 
                    <monospace>sourmash</monospace> generates standard bottom-hash MinHash sketches if users specify the hash subset size 
                    <italic toggle="yes">n</italic> rather than scaling factor.</p>
                <p>
                    <bold>
                        <italic toggle="yes">Streaming compatibility.</italic>
                    </bold> Scaled signature generation is streaming compatible and provides some advantages over streaming calculation using standard MinHash. As data streams in, standard MinHash replaces hashes based on the minimum hash value to maintain a fixed number of hashes in the signature. In contrast, no hash is ever removed from a scaled signature as more data is received. As a result, for searches of a database using streamed-in data, all prior matches remain valid (although their significance may change as more data is received). This allows us to place algorithmic guarantees on containment searches using streaming data.</p>
                <p>
                    <bold>
                        <italic toggle="yes">Abundance tracking.</italic>
                    </bold> 
                    <monospace>sourmash</monospace> extends MinHash functionality by implementing abundance tracking of k-mers. k-mer counts are incremented after hashing as each k-mer is added to the hash table. 
                    <monospace>sourmash</monospace> tracks abundance for k-mers in the minimum band and stores this information in the signature. These values accompany the hashes in downstream comparison processes, making signatures better representations of repetitive sequences of metagenomic and transcriptomic origin.</p>
                <p>
                    <bold>
                        <italic toggle="yes">Signatures.</italic>
                    </bold> MinHash sketches associated with a single sequence file are stored together in a &#x201c;signature&#x201d; file, which forms the basis of all 
                    <monospace>sourmash</monospace> comparisons. Signatures may include sketches generated with different 
                    <italic toggle="yes">k</italic> sizes or molecule type (nucleotide or protein) and are stored in JSON format to maintain human readability and ensure proper interoperability.</p>
                <p>Signatures can only be compared against signatures and databases made from the same parameters (
                    <italic toggle="yes">k</italic> size(s), scaled value, nucleotide or protein level).  If signatures differ in their scaled value or size(
                    <italic toggle="yes">n</italic>), the larger signatures can be downsampled to become comparable with smaller signatures using the 
                    <italic toggle="yes">signature utilities</italic>, below. 
                    <monospace>sourmash</monospace> also provides 6-frame nucleotide translation to generate protein signatures from nucleotide input if desired.</p>
                <p>
                    <bold>
                        <italic toggle="yes">Signature utilities.</italic>
                    </bold> 
                    <monospace>sourmash</monospace> provides a number of utilities to facilitate set operations between signatures (
                    <monospace>merge</monospace>, 
                    <monospace>intersect</monospace>, 
                    <monospace>extract</monospace>, 
                    <monospace>downsample</monospace>, 
                    <monospace>flatten</monospace>, 
                    <monospace>subtract</monospace>, 
                    <monospace>overlap</monospace>), and handling (
                    <monospace>describe</monospace>, 
                    <monospace>rename</monospace>, 
                    <monospace>import</monospace>, 
                    <monospace>export</monospace>) of 
                    <monospace>sourmash</monospace> signatures. These can be accessed via the 
                    <monospace>sourmash signature</monospace> subcommand.</p>
            </sec>
            <sec>
                <title>SBT-MinHash</title>
                <p>
                    <monospace>sourmash</monospace> implements a modified Sequence Bloom Tree (SBT
                    <sup>
                        <xref ref-type="bibr" rid="ref-10">10</xref>
                    </sup>), the SBT-MinHash (SBTMH), which can efficiently capture large volumes of MinHashes (e.g., all microbes in GenBank) and support multiple search regimes that improve on run time of linear searches.</p>
                <p>
                    <bold>
                        <italic toggle="yes">Implementation.</italic>
                    </bold> The SBTMH is a n-ary tree (binary by default), where leaf nodes are MinHash signatures and internal nodes are Bloom Filters. Each Bloom Filter contains all the values from its children, so the root node contains all the values from all signatures. SBTMH is designed to be extensible such that signatures can be subsequently added without the need for full regeneration.  Adding a new signature to SBTMH causes parent nodes up to the root to be updated, but other nodes are not affected.</p>
                <p>SBTMH trees can be combined if desired: In the simplest case, adding a new root and updating it with the content of the previous roots is sufficient, and this preserves all node information without changes. As an example, separate indices can be created for each RefSeq subdivision (bacteria, archaea, fungi, etc) and be combined depending on the application (such as an analysis for bacteria + archaea, but not fungi).  In practice, this is most useful for updating the SBTMH, as both 
                    <monospace>search</monospace> and 
                    <monospace>gather</monospace> support search over multiple databases without the need for rebuilding a single large database.</p>
                <p> 

                    <italic toggle="yes">
                        <bold>
                            <italic toggle="yes">Searching SBTMH.</italic>
                        </bold>
                    </italic> Similarity searches start at the root of the SBTMH, and check for query elements present in each internal node. If the similarity does not reach the threshold, the subtree under that node does not need to be searched. If a leaf is reached, it is returned as a match to the query signature. In order to enable similarity (in addition to containment) searches using this modified SBTMH, nodes store the cardinality of the smallest signature below each node in order to properly bound similarity. The full SBTMH does not need to be imported to RAM during searches, making this method best for rapid searching with minimal memory requirements. However, if sufficient RAM is available, searches of databases (or many signatures) may be completed in memory via an alternate database format (discussed below).</p>
                <p>
                    <italic toggle="yes">
                        <bold>
                            <italic toggle="yes">SBTMH utilities.</italic>
                        </bold>
                    </italic> 
                    <monospace>sourmash</monospace> provides several utilities for construction, use, and handling of SBTMH databases. These include 
                    <monospace>sbt index</monospace> to index a collection of signatures as an SBTMH for fast searching, 
                    <monospace>sbt append</monospace> to add signatures, and 
                    <monospace>sbt combine</monospace> to join two SBTMH databases.</p>
            </sec>
            <sec>
                <title>LCA database</title>
                <p>
                    <monospace>sourmash</monospace> implements an alternate database format, LCA, to support in-memory queries. This implementation utilizes two named lists to store MinHashed databases: the first containing MinHashes, and the second containing taxonomic information, with both lists named by sample name. This structure facilitates direct look-up of MinHashes, and thus can be leveraged to return additional information, such as taxonomic lineage. The LCA database structure can be prepared using the 
                    <monospace>sourmash lca index</monospace> command.</p>
            </sec>
        </sec>
        <sec>
            <title>Assessing sequence similarity</title>
            <sec>
                <title>Pairwise comparisons</title>
                <p>For sequence comparison, 
                    <monospace>sourmash compare</monospace> reimplements Jaccard sequence  similarity  comparison to enable comparison between scaled MinHashes. When abundance tracking of k-mers is enabled, 
                    <monospace>compare</monospace> instead calculates the cosine similarity, although we recommend using more accurate approaches for detailed comparisons
                    <sup>
                        <xref ref-type="bibr" rid="ref-6">6</xref>
                    </sup>.</p>
            </sec>
            <sec>
                <title>Database searches</title>
                <p>In addition to conducting pairwise comparisons, two types of database searches are implemented: breadth-first similarity searches (
                    <monospace>sourmash search</monospace>) and best-first containment searches (
                    <monospace>sourmash gather</monospace>), which support different biological queries. These searches can be conducted using either database format.</p>
                <p>
                    <bold>
                        <italic toggle="yes">Similarity queries.</italic>
                    </bold> Breadth-first 
                    <monospace>sourmash search</monospace> can be used to obtain all MinHashes in the SBTMH that are present in the query signature (above a specified threshold). This style of search is streaming-compatible, as the query MinHash can be augmented as the search is occurring.</p>
                <p>
                    <bold>
                        <italic toggle="yes">Containment queries</italic>
                    </bold> Best-first 
                    <monospace>sourmash gather</monospace> implements a greedy algorithm where the SBTMH is descended on a linear path through a set of internal nodes until the highest containment leaf is reached. The hashes in this leaf are then subtracted from the query MinHash and the process is repeated until the threshold minimum is reached. 
                    <monospace>sourmash</monospace> post-processes similarity statistics after the search such that it reports percent identity and unique identity for each match.</p>
                <p>
                    <bold>
                        <italic toggle="yes">Taxonomy-resolved searches</italic>
                    </bold> 
                    <monospace>sourmash</monospace> can conduct taxonomy-resolved searches uses the &#x201c;least common ancestor&#x201d; approach (as in Kraken
                    <sup>
                        <xref ref-type="bibr" rid="ref-12">12</xref>
                    </sup>), to identify k-mers in a query. From this it can either find a consensus taxonomy between all the k-mers (
                    <monospace>sourmash classify</monospace>) or it can summarize the mixture of k-mers present in one or more signatures (
                    <monospace>sourmash summarize</monospace>).</p>
            </sec>
            <sec>
                <title>Operation</title>
                <p>
                    <monospace>sourmash</monospace> is a tool for building and utilizing MinHash signatures of DNA, RNA, and protein sequences. A straightforward workflow consists of generating a signature using 
                    <monospace>sourmash compute</monospace>, and comparing it against other signatures or databases of signatures via 
                    <monospace>sourmash compare, search, gather, lca search,</monospace> or 
                    <monospace>lca gather</monospace>. 
                    <monospace>sourmash</monospace> has no particular memory requirements, but does need to hold the largest single sequence in memory while generating a signature. For example, computing a signature from a 100Mb human microbiome sample requires 30MB of RAM, and searching it against a sourmash Genbank signature database takes 1&#x2013;6 minutes and requires 2&#x2013;6 GB of RAM, depending on the search type. "LCA" databases are smaller on disk but require more memory to be searched.</p>
                <p>Below we provide several use cases to demonstrate the utility of 
                    <monospace>sourmash</monospace> for sequence comparisons, starting with signature generation and proceeding into signature comparisons, tetranucleotide frequency clustering analysis, and taxonomic classification. We primarily demonstrate nucleotide-level applications in this paper; protein-level analyses will be explored further in future work. Additional information and tutorials are available at 
                    <ext-link ext-link-type="uri" xlink:href="https://sourmash.readthedocs.io">https://sourmash.readthedocs.io</ext-link>.</p>
            </sec>
        </sec>
        <sec>
            <title>Use cases</title>
            <sec>
                <title>Installation</title>
                <p>
                    <monospace>sourmash</monospace> is available for both Linux and OSX, and runs under either 
                    <monospace>Python 2.7.x</monospace> or 
                    <monospace>Python 3.5+</monospace>. To install sourmash, we recommend using 
                    <ext-link ext-link-type="uri" xlink:href="https://docs.conda.io/projects/conda/en/latest/">conda</ext-link>. For these examples, we used sourmash v.2.0.1, installed with conda v 4.6.14.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">conda install &#x2013;c conda&#x2013;forge \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">&#x2013;c bioconda sourmash</styled-content>
                    </preformat>
                </p>
                <p>Alternate  installation  instructions  are  available  at 
                    <ext-link ext-link-type="uri" xlink:href="http://sourmash.readthedocs.io">sourmash.readthedocs.io</ext-link>.</p>
            </sec>
            <sec>
                <title>Creating a signature</title>
                <p>All 
                    <monospace>sourmash</monospace> comparisons work on signatures, compressed representations of biological sequencing data. To create a signature from sequences with abundance tracking:</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;"># download the genome</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">curl &#x2013;L  https://osf.io/bjh2y/download  \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">&#x2013;o GCF_000005845.2_ASM584v2_genomic.fna.gz</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># calculate the signature</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">sourmash compute &#x2013;k 21,31,51 \</styled-content>
  
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;scaled 2000 \</styled-content>
  
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;track&#x2013;abundance \</styled-content>
  
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;o GCF_000005845.2_ASM584v2_genomic.sig  \</styled-content>
  
                        <styled-content style="font-size:15px;color:#000000;">GCF_000005845.2_ASM584v2_genomic.fna.gz</styled-content>
                    </preformat>
                </p>
                <p>Because a signature can contain multiple MinHashes, multiple k-sizes can be specified per a sequence. Only one scaled size can be used.</p>
                <p>By default, the name of the file becomes the name of the signature. To name the signature from the first line of the sequencing file, use 
                    <monospace>&#x2212;&#x2212;name&#x2212;from&#x2212;first</monospace>. Although the 
                    <monospace>&#x2212;&#x2212;track&#x2212;abundance</monospace> flag is optional, since downstream comparison methods contain the flag 
                    <monospace>&#x2212;&#x2212;ignore&#x2212;abundance</monospace> to ignore them, we recommend calculating all signatures with abundance tracking.</p>
                <p>To create a signature from protein sequences:</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;"># download amino acid sequences</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">curl &#x2013;L https://osf.io/y9kra/download \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">&#x2013;o GCF_000146045.2_R64_protein.faa.gz</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># calculate the signature</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">sourmash compute &#x2013;k 11,21,31 \</styled-content>
    
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;scaled 2000 \</styled-content>
    
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;track&#x2013;abundance \</styled-content>
    
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;o GCF_000146045.2_R64_protein.sig \</styled-content>
    
                        <styled-content style="font-size:15px;color:#000000;">GCF_000146045.2_R64_protein.faa.gz</styled-content>
                    </preformat>
                </p>
                <p>Signatures can also be made directly from reads. Depending on the downstream use cases, we recommend different preparation methods. When the user aims to 
                    <monospace>compare</monospace> the signature to other signatures, we recommend k-mer trimming the reads before computing the signature. Because 
                    <monospace>compare</monospace> does an all-by-all comparison of signatures, errors in the reads will falsely deflate the similarity metric. We recommend trimming RNA-seq or metagenome reads with 
                    <monospace>trim&#x2212;low&#x2212;abund.py</monospace> in the 
                    <ext-link ext-link-type="uri" xlink:href="https://pypi.org/project/khmer/">khmer package</ext-link>
                    <sup>
                        <xref ref-type="bibr" rid="ref-13">13</xref>
                    </sup>, a dependency of 
                    <monospace>sourmash</monospace>.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;"># download the reads</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">curl &#x2013;L &#x2013;o ERR458584.fq.gz \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">https://osf.io/pfxth/download</styled-content>
     

                        <styled-content style="font-size:15px;color:#000000;"># trim the reads</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">trim&#x2013;low&#x2013;abund.py ERR458584.fq.gz \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;V &#x2013;Z 10 &#x2013;C 3 &#x2013;&#x2013;gzip &#x2013;M 3e9 \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;o  ERR458584.khmer.fq.gz</styled-content>
    

                        <styled-content style="font-size:15px;color:#000000;"># calculate the signature from trimmed reads
sourmash compute &#x2013;k 21,31,51 \</styled-content>
    
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;scaled 2000 \</styled-content>
    
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;track&#x2013;abundance \</styled-content>
    
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;o ERR458584.khmer.sig \</styled-content>
    
                        <styled-content style="font-size:15px;color:#000000;">ERR458584.khmer.fq.gz</styled-content>
                    </preformat>
                </p>
                <p>When using methods that compare a signature against a database such as 
                    <monospace>gather</monospace> or 
                    <monospace>search</monospace>, k-mer trimming need not be used. These methods use exact matching of hashes in the signature to those in the databases. k-mer trimming could increase false negatives, but results on k-mer trimmed data will more accurately represent the proportions of content in the data.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;"># calculate the signature from raw reads</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">sourmash compute &#x2013;k 21,31,51 \
    
                            <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;scaled 2000 \</styled-content>
    
                            <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;track&#x2013;abundance \</styled-content>
    
                            <styled-content style="font-size:15px;color:#000000;">&#x2013;o ERR458584.sig \</styled-content>
    
                            <styled-content style="font-size:15px;color:#000000;">ERR458584.fq.gz</styled-content>
                        </styled-content>
                    </preformat>
                </p>
            </sec>
            <sec>
                <title>Comparing many signatures</title>
                <p>Signatures calculated with abundance tracking enable rapid comparison of sequences where k-mer frequency is variable, and can be leveraged for quality control and summarization methods. For example, principle component analysis (PCA) and multidimensional scaling (MDS) are standard quality control and summarization methods for count data generated during RNA-seq analysis
                    <sup>
                        <xref ref-type="bibr" rid="ref-14">14</xref>
                    </sup>. 
                    <monospace>sourmash</monospace> can be used to build this MDS plot in a reference-free (or assembly-free) manner, using k-mer abundances of the input reads. We also find this useful for comparing other types of RNA sequencing samples (mRNA, ribo-depleted, 3&#x2019; tag-seq, metatranscriptomes, and transcriptomes).</p>
            </sec>
            <sec>
                <title>MDS</title>
                <p>Here, we use a set of four 
                    <italic toggle="yes">Saccharomyces cerevisiae</italic> RNA-seq samples: replicate wild-type samples and replicate mutant (
                    <italic toggle="yes">SNF2</italic>) samples
                    <sup>
                        <xref ref-type="bibr" rid="ref-15">15</xref>
                    </sup>. To use 
                    <monospace>sourmash</monospace> to build an MDS plot, we first trim the data to remove low abundance k-mers via khmer
                    <sup>
                        <xref ref-type="bibr" rid="ref-13">13</xref>
                    </sup>. We demonstrate the streaming capability of 
                    <monospace>sourmash</monospace> by downloading, k-mer trimming, and calculating a signature with one command. This allows the user to generate signatures without needing to store large files locally.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">curl &#x2013;L https://osf.io/pfxth/download \</styled-content>
  
                        <styled-content style="font-size:15px;color:#000000;">| trim&#x2013;low&#x2013;abund.py &#x2013;V &#x2013;Z 10 \</styled-content>
  
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;C 3 &#x2013;M 3e9 &#x2013;o &#x2013; &#x2013;\</styled-content>
  
                        <styled-content style="font-size:15px;color:#000000;">| sourmash compute &#x2013;k 31 \</styled-content>
  
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;scaled 2000 &#x2013;&#x2013;track&#x2013;abundance \</styled-content>
  
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;o ERR458584.khmer.sig &#x2013;</styled-content>
                    </preformat>
                </p>
                <p>The signature will be named from the input filename, in this case &#x2212;. We can change the name to reflect its contents using the 
                    <monospace>signature rename</monospace> function.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">sourmash signature rename \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;k 31 &#x2013;o ERR458584.khmer&#x2013;named.sig \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">ERR458584.khmer.sig \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">ERR458584.khmer</styled-content>
                    </preformat>
                </p>
                <p>We can also check that the name has been changed.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">sourmash signature describe \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">ERR458584.khmer&#x2013;named.sig</styled-content>
                    </preformat>
                </p>
                <p>Using signatures from four samples, we can compare the files with the 
                    <monospace>compare</monospace> function. Here we download signatures calculated and renamed using the above commands. We output the comparison matrix as a csv for downstream use in other platforms.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;"># download signatures</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">curl &#x2013;L &#x2013;o yeast&#x2013;sigs.tar.gz  \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">https://osf.io/pk2w5/download</styled-content>
      

                        <styled-content style="font-size:15px;color:#000000;"># uncompress the signatures</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">tar xvf yeast&#x2013;sigs.tar.gz</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># compare the signatures</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">sourmash compare &#x2013;k 31 \</styled-content>
  
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;csv yeast&#x2013;comp.csv \</styled-content>
  
                        <styled-content style="font-size:15px;color:#000000;">*named.sig</styled-content>
                    </preformat>
                </p>
                <p>We then import the 
                    <monospace>compare</monospace> similarity matrix into R (v3.4.1) to produce an MDS plot with wild-type samples (ERR459011, ERR459102) in yellow and mutant samples (ERR458584, ERR458829) in blue.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;"># Read data into R</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">comp_mat &lt;&#x2013; read.csv("yeast&#x2013;comp.csv")</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># Set row labels</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">rownames(comp_mat) &lt;&#x2013; colnames(comp_mat)</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># Transform for plotting</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">comp_mat &lt;&#x2013; as.matrix(comp_mat)</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># Make an MDS plot</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">fit &lt;&#x2013; dist(comp_mat)</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">fit &lt;&#x2013; cmdscale(fit)</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">plot(fit[ , 2] ~ fit[ , 1],</styled-content>
      
                        <styled-content style="font-size:15px;color:#000000;">xlab = "Dim 1",</styled-content>
      
                        <styled-content style="font-size:15px;color:#000000;">ylab = "Dim 2",</styled-content>
      
                        <styled-content style="font-size:15px;color:#000000;">xlim= c(&#x2013;.6, .9),</styled-content>
      
                        <styled-content style="font-size:15px;color:#000000;">main = "sourmash Compare MDS")</styled-content>

                        <styled-content style="font-size:15px;color:#000000;"># add labels to the plot</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">text(fit[ , 2]~ fit[ , 1],</styled-content>
      
                        <styled-content style="font-size:15px;color:#000000;">labels = row.names(fit),</styled-content>
      
                        <styled-content style="font-size:15px;color:#000000;">pos = 4, font = 1,</styled-content>
      
                        <styled-content style="font-size:15px;color:#000000;">data = fit,</styled-content>
      
                        <styled-content style="font-size:15px;color:#000000;">col = c("blue", "blue",</styled-content>
               
                        <styled-content style="font-size:15px;color:#000000;">"orange", "orange"))</styled-content>
                    </preformat>
                </p>
                <p>For comparison, we also produced an MDS plot using a more traditional approach, utilizing 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/COMBINE-lab/salmon">Salmon</ext-link> (v0.11.3)
                    <sup>
                        <xref ref-type="bibr" rid="ref-16">16</xref>
                    </sup> to quantify abundance relative to an 
                    <italic toggle="yes">S. cerevisiae</italic> reference, and 
                    <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/release/bioc/html/edgeR.html">edgeR</ext-link> (v3.22.5)
                    <sup>
                        <xref ref-type="bibr" rid="ref-17">17</xref>
                    </sup> to build an MDS plot (
                    <xref ref-type="fig" rid="f1">Figure 1</xref>; code available online at 
                    <ext-link ext-link-type="uri" xlink:href="https://osf.io/97rt4/">https://osf.io/97rt4/</ext-link>).</p>
            </sec>
            <sec>
                <title>Tetranucleotide Frequency Clustering</title>
                <p>We can also use 
                    <monospace>sourmash</monospace> with abundance tracking for tetranucleotide frequency clustering. Tetranucleotide usage is species-specific, with strongest conservation in DNA coding regions
                    <sup>
                        <xref ref-type="bibr" rid="ref-18">18</xref>
                    </sup>. This is often used in metagenomics as one method to &#x201c;bin&#x201d; assembled contiguous sequences together that are from the same species
                    <sup>
                        <xref ref-type="bibr" rid="ref-19">19</xref>
                    </sup>. Recently, tetranucleotide frequency clustering using 
                    <monospace>sourmash</monospace> was used to detect microbial contamination in the domesticated olive genome
                    <sup>
                        <xref ref-type="bibr" rid="ref-20">20</xref>
                    </sup>. Here we reimplement this approach using 100 of the 11,038 scaffolds in the draft genome. We calculate the signature using a k-mer size of 4, use all 4-mers, and track abundance. Then we use 
                    <monospace>sourmash compare</monospace> to calculate the similarity between each scaffold. (The 
                    <monospace>&#x2212;&#x2212;singleton</monospace> flag calculates a signature for each sequence in the fasta file.)</p>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>The MDS plots produced from the reference-free 
                            <monospace>sourmash compare</monospace> similarity matrix and the transcript quantification analysis (salmon and edgeR) are similar.</title>
                        <p>Wild-type 
                            <italic toggle="yes">S. cerevisiae</italic> samples (ERR459011, ERR459102) are in yellow and mutant samples (ERR458584, ERR458829) in blue.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/21579/2d3f5c16-6f0f-4b0f-b03a-67cb8e230234_figure1.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;"># download the subsampled genome</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">curl &#x2212;L https://osf.io/xusfa/download \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    &#x2212;o Oe6.scaffolds.sub.fa.gz</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># calculate a signature for each scaffold</styled-content> 

                        <styled-content style="font-size:15px;color:#000000;">sourmash compute &#x2212;k 4 \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    &#x2212;&#x2212;scaled 1 \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    &#x2212;&#x2212;track&#x2212;abundance \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    &#x2212;&#x2212;singleton \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    &#x2212;o Oe6.scaffolds.sub.sig \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    Oe6.scaffolds.sub.fa.gz</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># pairwise compare all scaffolds</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">sourmash compare &#x2212;k 4 \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    &#x2212;o Oe6.scaffolds.sub.comp \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    Oe6.scaffolds.sub.sig</styled-content>
                    </preformat>
                </p>
                <p>Although 
                    <monospace>sourmash compare</monospace> supports export to a csv file, 
                    <monospace>sourmash</monospace> also has a built in visualization function, 
                    <monospace>plot</monospace>. We will use this to visualize scaffold similarity.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">sourmash plot &#x2212;&#x2212;labels \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    &#x2212;&#x2212;vmin .4 \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    Oe6.scaffolds.sub.comp</styled-content>
</preformat>
                </p>
                <p>In 
                    <xref ref-type="fig" rid="f2">Figure 2</xref>, we see that there is high similarity between 98 of the scaffolds, but that Oe6_s01156 and Oe6_s01003 are outliers with tetranucleotide frequency similarity around 40% to olive scaffolds. These two scaffolds are contaminants
                    <sup>
                        <xref ref-type="bibr" rid="ref-20">20</xref>
                    </sup>.</p>
                <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                    <label>Figure 2. </label>
                    <caption>
                        <title>Heatmap and dendrogram generated using 
                            <monospace>sourmash</monospace> signatures built from scaffolds in the domesticated olive genome.</title>
                        <p>Two scaffolds are outliers when using tetranucleotide frequency to calculate similarity (highlighted in green on the dendrogram).</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/21579/2d3f5c16-6f0f-4b0f-b03a-67cb8e230234_figure2.gif"/>
                </fig>
            </sec>
            <sec>
                <title>Comparisons to detect outliers</title>
                <p>MinHash comparisons are useful for outlier detection. Below we compare 50 genomes that contain the word "
                    <italic toggle="yes">Escherichia coli</italic>." We have pre-calculated the signatures for each of these genomes. We then use the 
                    <monospace>plot</monospace> function to visualize our comparison.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;"># download the signatures into a folder</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">mkdir escherichia&#x2212;sigs</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">cd escherichia&#x2212;sigs</styled-content>


                        <styled-content style="font-size:15px;color:#000000;">curl &#x2212;L https://osf.io/pc76j/download \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    &#x2212;o escherichia&#x2212;sigs.tar.gz</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># decompress the signatures</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">tar xzf escherichia&#x2212;sigs.tar.gz</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">rm escherichia&#x2212;sigs.tar.gz</styled-content>


                        <styled-content style="font-size:15px;color:#000000;">cd ..</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># pairwise compare the signatures</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">sourmash compare &#x2212;k 31 \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    &#x2212;o ecoli.comp \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    escherichia&#x2212;sigs/*sig</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># plot the comparison</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">sourmash plot &#x2212;&#x2212;labels \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    ecoli.comp</styled-content>
                    </preformat>
                </p>
                <p>We see that the minimum similarity in the matrix is 0%. If all 50 signatures were from the same species, we would expect to observe higher minimum similarity at a k-mer size of 31. When we look closely, we see one signature has 0% similarity with all other signatures because it is a phage (
                    <xref ref-type="fig" rid="f3">Figure 3</xref>).</p>
                <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                    <label>Figure 3. </label>
                    <caption>
                        <title>Heatmap and dendrogram generated using 
                            <monospace>sourmash</monospace> signatures built from 50 genomes that contained the word &#x201c;
                            <italic toggle="yes">Escherichia coli</italic>&#x201d;.</title>
                        <p>One signature is an outlier (highlighted in blue on the dendrogram).</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/21579/2d3f5c16-6f0f-4b0f-b03a-67cb8e230234_figure3.gif"/>
                </fig>
            </sec>
            <sec>
                <title>Classifying signatures</title>
                <p>The 
                    <monospace>search</monospace> and 
                    <monospace>gather</monospace> functions allow the user to classify the contents of a signature by comparing it to a database of signatures. Prepared databases for microbial genomes in RefSeq and GenBank are available at 
                    <ext-link ext-link-type="uri" xlink:href="https://sourmash.readthedocs.io/en/latest/databases.html">https://sourmash.readthedocs.io/en/latest/databases.html</ext-link>. However, it is also simple to create a custom database with signatures.</p>
                <p>Below we make a database that contains 50 
                    <italic toggle="yes">Escherichia coli</italic> genomes.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">mkdir escherichia&#x2013;sigs</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">cd escherichia&#x2013;sigs</styled-content>


                        <styled-content style="font-size:15px;color:#000000;">curl &#x2013;L https://osf.io/pc76j/download \</styled-content>
    
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;o escherichia&#x2013;sigs.tar.gz</styled-content>
     

                        <styled-content style="font-size:15px;color:#000000;">tar xzf escherichia&#x2013;sigs.tar.gz</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">rm escherichia&#x2013;sigs.tar.gz</styled-content>


                        <styled-content style="font-size:15px;color:#000000;">cd ..</styled-content>


                        <styled-content style="font-size:15px;color:#000000;">sourmash index &#x2013;k 31 ecolidb \</styled-content>
     escherichia&#x2013;sigs /*.sig</preformat>
                </p>
                <p>This database can be queried with 
                    <monospace>search</monospace> and 
                    <monospace>gather</monospace> using any signature calculated with a k-size of 31.</p>
                <p>For example, below we download a small set of k-mer trimmed 
                    <italic toggle="yes">Escherichia coli</italic> reads and generate a signature with k=31.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">curl &#x2013;L &#x2013;o ecoli&#x2013;reads.khmer.fq.gz \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">https://osf.io/26xm9/download \</styled-content>
     

                        <styled-content style="font-size:15px;color:#000000;">sourmash compute &#x2013;k 31</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;scaled 2000 \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">ecoli&#x2013;reads.khmer.fq.gz \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;o  ecoli&#x2013;reads.sig</styled-content>
                    </preformat>
                </p>
                <p>Then, we search the 50-genome database created above.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">sourmash search &#x2013;k 31 \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">ecoli&#x2013;reads.sig ecolidb \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;containment</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">49 matches;   showing first 3:</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">similarity     match</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;     &#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;</styled-content>
 
                        <styled-content style="font-size:15px;color:#000000;">65.4%         NZ_JMGW01000001.1 Escherichia coli 1&#x2013;176&#x2013;05_S4_C2 e117605 ...</styled-content>
 
                        <styled-content style="font-size:15px;color:#000000;">64.9%         NZ_GG774190.1 Escherichia coli MS 196&#x2013;1 Scfld2538, whole  ...</styled-content>
 
                        <styled-content style="font-size:15px;color:#000000;">63.7%         NZ_JMGU01000001.1 Escherichia coli 2&#x2013;011&#x2013;08_S3_C2 e201108 ...</styled-content>
                    </preformat>
                </p>
                <p>Breadth-first 
                    <monospace>sourmash search</monospace> finds all matches in the SBTMH that are present in the query signature (above a specified threshold).</p>
                <p>Now try the same search using 
                    <monospace>sourmash gather</monospace>.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">sourmash gather &#x2013;k 31 \</styled-content>
    
                        <styled-content style="font-size:15px;color:#000000;">ecoli&#x2013;reads.sig ecolidb</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">loaded query: ecoli_ref &#x2013;5m.khmer.fq.gz ... (k=31, DNA)</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">loaded 1 databases.</styled-content>


                        <styled-content style="font-size:15px;color:#000000;">overlap     p_query  p_match</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;    &#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;  &#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">4.1 Mbp       65.4%     83.5%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_JMGW01000001.1 Escherichia coli 1&#x2013;...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">2.4 Mbp        2.7%    3.3%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_GG749254.1 Escherichia coli FVEC14...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">3.4 Mbp        1.4%     1.7%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_MOGK01000001.1 Escherichia coli st...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">3.1 Mbp        0.6%     0.7%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_LVOV01000001.1 Escherichia coli st...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">3.1 Mbp        0.3%     0.4%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_MIWP01000001.1 Escherichia coli st...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">3.0 Mbp        0.3%     0.4%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_APWY01000001.1 Escherichia coli 17...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">3.5 Mbp        0.2%     0.2%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_JNLZ01000001.1 Escherichia coli 3&#x2013;...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">4.0 Mbp        0.2%     0.2%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_GG774190.1 Escherichia coli MS 196...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">2.3 Mbp        0.1%     0.2%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_KB732756 .1 Escherichia coli KTE66...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">2.0 Mbp        0.1%     0.1%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_BBUW01000001.1 Escherichia coli O1...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">2.3 Mbp        0.1%     0.1%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_MOZX01000101.1 Escherichia coli st...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">2.3 Mbp        0.1%     0.1%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_JSMW01000001.1 Escherichia coli st...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">4.0 Mbp        0.1%     0.1%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_JMGU01000001.1 Escherichia coli 2&#x2013;...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">2.0 Mbp        0.0%     0.0%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_MOZC01000010.1 Escherichia coli st...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">2.1 Mbp        0.0%     0.0%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_MKJG01000001.1 Escherichia coli st...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">1.8 Mbp        0.0%     0.0%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_AEKA01000453.1 Escherichia sp. TW1...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">2.6 Mbp        0.0%     0.0%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_LEAD01000071.1 Escherichia coli st...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">3.5 Mbp        0.0%     0.0%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">NZ_MIWF01000001.1 Escherichia coli st...</styled-content>


                        <styled-content style="font-size:15px;color:#000000;">found 18 matches total;</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">the recovered matches hit 71.5% of the query</styled-content>
                    </preformat>
                </p>
                <p>Best-first 
                    <monospace>sourmash gather</monospace> finds the best match first, e.g. here the first 
                    <italic toggle="yes">E. coli</italic> genome has an 83% match to 65.4% of our query signature. The hashes that matched (65.4% of the query) are then subtracted, and the database is queried with the remaining hashes (34.6% of original query). This process is repeated until the threshold is reached. 
                    <monospace>sourmash</monospace> post-processes similarity statistics after the search such that it reports percent identity and unique identity for each match.</p>
                <p>
                    <monospace>sourmash gather</monospace> is also useful for rapid metagenome decomposition. Below we calculate a signature of a metagenome using raw reads, and then use 
                    <monospace>gather</monospace> to perform a best-first search against all microbial genomes in Genbank. This approach was recently used to classify unknown genomes in a "mock" metagenome
                    <sup>
                        <xref ref-type="bibr" rid="ref-21">21</xref>
                    </sup>. The mock community was made to contain 64 genomes, but additional genomic material was inadvertently added prior to sequencing. Below we will use 
                    <monospace>gather</monospace> to investigate the content in the mock metagenome that did not map to the 64 reference genomes. For details on how this signature was created, please see Awad 
                    <italic toggle="yes">et al.</italic>
                    <sup>
                        <xref ref-type="bibr" rid="ref-22">22</xref>
                    </sup>. Note that the GenBank database is approximately 7.8 Gb compressed, and 50 Gb decompressed. Searches of the current Gen-Bank database run fastest if allowed to use 11 Gb of RAM.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;"># download the signature</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">curl &#x2013;L &#x2013;o unmapped&#x2013;qc&#x2013;to&#x2013;ref.fq.sig \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">https://osf.io//download \</styled-content>
     

                        <styled-content style="font-size:15px;color:#000000;"># download the gather k 31 Genbank database</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">curl &#x2013;L &#x2013;o genbank&#x2013;d2&#x2013;k31.tar.gz \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">https:// s3&#x2013;us&#x2013;west&#x2013;2.amazonaws.com/</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">sourmash&#x2013;databases/2018&#x2013;03&#x2013;29/</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">genbank&#x2013;d2&#x2013;k31.tar.gz</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># run gather</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">sourmash gather &#x2013;k 31 \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">&#x2013;o unmapped&#x2013;qc&#x2013;to&#x2013;ref.csv \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">unmapped&#x2013;qc&#x2013;to&#x2013;ref.fq.sig \</styled-content>
     
                        <styled-content style="font-size:15px;color:#000000;">genbank&#x2013;d2&#x2013;k31</styled-content>
                    </preformat>
                </p>
                <p>The output to the terminal begins:</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">loaded query: unmapped&#x2013;qc&#x2013;to&#x2013;ref.fq... (k=31, DNA)</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">downsampling query from scaled=10000 to 10000</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">loaded 1 databases.</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">overlap     p_query p_match</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;   &#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013; &#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">1.6 Mbp        1.1%   19.9%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">BA000019.2 Nostoc sp. PCC 7120 DNA, c...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">1.2 Mbp        0.8%   50.8%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">LN831027.1 Fusobacterium nucleatum su...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">1.2 Mbp        0.8%   31.0%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">CP001957.1 Haloferax volcanii DS2 pla...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">1.1 Mbp        0.8%   16.7%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">BX119912.1 Rhodopirellula baltica SH ...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">1.0 Mbp        0.7%   29.0%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">CH959311.1 Sulfitobacter sp. EE&#x2013;36 sc...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">0.8 Mbp        0.6%   37.3%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">AP008226.1 Thermus thermophilus HB8 g...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">0.8 Mbp        0.6%   56.7%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">CP001941.1 Aciduliprofundum boonei T4...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">0.8 Mbp        0.5%   23.3%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">FOVK01000036.1 Proteiniclasticum rumi...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">0.7 Mbp        0.5%   15.0%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">CP000031.2 Ruegeria pomeroyi DSS&#x2013;3, c...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">0.7 Mbp        0.5%   11.3%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">CP000875.1 Herpetosiphon aurantiacus ...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">0.6 Mbp        0.4%   22.9%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">BA000023.2 Sulfolobus tokodaiistr.  7...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">0.6 Mbp        0.4%   13.6%</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">AP009153.1 Gemmatimonas aurantiaca T&#x2013;...</styled-content>
                    </preformat>
                </p>
                <p>We see that 20.1% of k-mers match 82 genomes in GenBank. The majority of matches are to genomes present in the mock community. However, some species like 
                    <italic toggle="yes">Proteiniclasticum ruminis</italic> were not members of the mock community. These results also highlight how 
                    <monospace>sourmash gather</monospace> behaves with inexact matches such as strain variants. For example, we see two matches between 
                    <italic toggle="yes">P. ruminis</italic> strains among all matches. This likely indicates that a 
                    <italic toggle="yes">P. ruminis</italic> strain that has not been sequenced before is in our sample, and that it shares more k-mers of size 31 in common with one strain than the other. (See Brown CT 
                    <italic toggle="yes">et al.</italic>
                    <sup>
                        <xref ref-type="bibr" rid="ref-23">23</xref>
                    </sup> for further analysis of this strain.)</p>
                <p>
                    <monospace>sourmash gather</monospace> and 
                    <monospace>search</monospace> also support custom databases. Using a custom database with 
                    <monospace>sourmash gather</monospace>, we can identify the dominant contamination in the domesticated olive genome
                    <sup>
                        <xref ref-type="bibr" rid="ref-20">20</xref>
                    </sup>. Below, we will use a database containing all fungal genomes in NCBI. We will then use the streaming compatibility of 
                    <monospace>sourmash</monospace> to download and calculate the signature. Lastly, we will search the olive genome against the fungal genomes using gather.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;"># download the fungal database</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">curl &#x2013;L &#x2013;o fungi&#x2013;genomic.tar.gz \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    https://osf.io/7yzc4/download</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># decompress the database</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">tar xf fungi&#x2013;genomic.tar.gz</styled-content>


                        <styled-content style="font-size:15px;color:#000000;"># download the olive genome</styled-content>

                        <styled-content style="font-size:15px;color:#000000;"># calculate the signature</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">curl &#x2013;L https://osf.io/k9358/download \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">  | zcat \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">  | sourmash compute &#x2013;k 31 \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">  &#x2013;&#x2013;scaled 2000 &#x2013;&#x2013;track&#x2013;abundance \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">  &#x2013;o Oe6.scaffolds.sig &#x2013;</styled-content>
  

                        <styled-content style="font-size:15px;color:#000000;"># perform gather</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">sourmash gather &#x2013;k 31 \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    &#x2013;&#x2013;scaled 2000 \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    &#x2013;o Oe6.scaffolds.csv \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    Oe6.scaffolds.sig \</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">    fungi&#x2013;k31</styled-content>
                    </preformat>
                </p>
                <p>Using gather, we see two matches both within the genus 
                    <italic toggle="yes">Aureobasidium</italic>. This is the dominant contaminant within the genome
                    <sup>
                        <xref ref-type="bibr" rid="ref-20">20</xref>
                    </sup>.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#000000;">loaded query: Oe6.scaffolds.fa... (k=31, DNA)</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">loaded 1 databases.</styled-content>



                        <styled-content style="font-size:15px;color:#000000;">overlap     p_query p_match avg_abund</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;   &#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013; &#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013; &#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;&#x2013;</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">140.0 kbp      0.0%    1.0%       1.2</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">LVWM01000001.1 Aureobasidium pullulan...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">68.0 kbp       0.0%    0.1%       1.0</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">MSDY01000045.1 Aureobasidium sp. FSWF...</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">found less than 30.0 kbp in common. =&gt; exiting</styled-content>


                        <styled-content style="font-size:15px;color:#000000;">found 2 matches total;</styled-content>

                        <styled-content style="font-size:15px;color:#000000;">the recovered matches hit 0.0% of the query</styled-content>
                    </preformat>
                </p>
            </sec>
        </sec>
        <sec sec-type="conclusions">
            <title>Conclusions</title>
            <p>The 
                <monospace>sourmash</monospace> package provides a collection of tools to conduct sequence comparisons and taxonomic classification, and makes comparison against large-scale databases such as GenBank and SRA tractable on laptops. 
                <monospace>sourmash</monospace> signatures are small and irreversible, which means they can be used to facilitate pre-publication data sharing that may help improve classification databases and facilitate comparisons among similar datasets.</p>
        </sec>
        <sec>
            <title>Data availability</title>
            <sec>
                <title>Underlying data</title>
                <p>Open Science Framework: sourmash-use-cases. 
                    <ext-link ext-link-type="uri" xlink:href="https://dx.doi.org/10.17605/OSF.IO/KESH2">https://doi.org/10.17605/OSF.IO/KESH2</ext-link>
                </p>
                <p>This project contains the following underlying data:</p>
                <list list-type="bullet">
                    <list-item>
                        <p> data-files</p>
                        <list list-type="bullet">
                            <list-item>
                                <label>&#x2013;</label>
                                <p> ecoli-reads.khmer.fq.gz (Small set of k-mer trimmed 
                                    <italic toggle="yes">Escherichia coli</italic> reads)</p>
                            </list-item>
                            <list-item>
                                <label>&#x2013;</label>
                                <p> ERR458584.fq.gz	(
                                    <italic toggle="yes">Saccharomyces	cerevisiae</italic> SRA Record ERR458584 SNF2 mutant,
                                    <sup>
                                        <xref ref-type="bibr" rid="ref-15">15</xref>
                                    </sup>)</p>
                            </list-item>
                            <list-item>
                                <label>&#x2013;</label>
                                <p> Oe6.scaffolds.fa.gz (domesticated olive (
                                    <italic toggle="yes">Olea europaea</italic>) genome
                                    <sup>
                                        <xref ref-type="bibr" rid="ref-20">20</xref>
                                    </sup>)</p>
                            </list-item>
                            <list-item>
                                <label>&#x2013;</label>
                                <p> Oe6.scaffolds.sub.fa.gz (Subsampled set of scaffolds from the domesticated olive (
                                    <italic toggle="yes">Olea europaea</italic>) genome
                                    <sup>
                                        <xref ref-type="bibr" rid="ref-20">20</xref>
                                    </sup>)</p>
                            </list-item>
                            <list-item>
                                <label>&#x2013;</label>
                                <p> yeast_ktrimmed.tar (kmer-trimmed 
                                    <italic toggle="yes">Saccharomyces cerevisiae</italic> reads
                                    <sup>
                                        <xref ref-type="bibr" rid="ref-15">15</xref>
                                    </sup>)</p>
                            </list-item>
                        </list>
                    </list-item>
                </list>
                <list list-type="bullet">
                    <list-item>
                        <p> index-files</p>
                        <list list-type="bullet">
                            <list-item>
                                <label>&#x2013;</label>
                                <p> escherichia-sigs.tar.gz (Sourmash signatures of 50 randomly selected 
                                    <italic toggle="yes">Escherichia coli</italic> genomes)</p>
                            </list-item>
                            <list-item>
                                <label>&#x2013;</label>
                                <p> fungi-genomic.tar.gz (Sourmash signature database of all fungal genomes in NCBI as of 12/2018)</p>
                            </list-item>
                        </list>
                    </list-item>
                </list>
                <list list-type="bullet">
                    <list-item>
                        <p> signature-files</p>
                        <list list-type="bullet">
                            <list-item>
                                <label>&#x2013;</label>
                                <p> GCF_000005845.2_ASM584v2_genomic.fna.gz (
                                    <italic toggle="yes">Escherichia coli</italic> genome str. K-12 substr. MG1655)</p>
                            </list-item>
                            <list-item>
                                <label>&#x2013;</label>
                                <p> GCF_000146045.2_R64_genomic.fna.gz (
                                    <italic toggle="yes">Saccharomyces cerevisiae</italic> S288C genome)</p>
                            </list-item>
                            <list-item>
                                <label>&#x2013;</label>
                                <p> GCF_000146045.2_R64_protein.faa.gz (
                                    <italic toggle="yes">Saccharomyces cerevisiae</italic> S288C protein sequence)</p>
                            </list-item>
                        </list>
                    </list-item>
                </list>
            </sec>
            <sec>
                <title>Extended data</title>
                <p>Open Science Framework: sourmash-use-cases. 
                    <ext-link ext-link-type="uri" xlink:href="https://dx.doi.org/10.17605/OSF.IO/KESH2">https://doi.org/10.17605/OSF.IO/KESH2</ext-link>
                </p>
                <p>This project contains the following extended data:</p>
                <list list-type="bullet">
                    <list-item>
                        <p> yeast-mds.txt (Code to generate MDS plots via Salmon and edgeR)</p>
                    </list-item>
                </list>
                <p>Data are available under the terms of the 
                    <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/publicdomain/zero/1.0/">Creative Commons Zero "No rights reserved" data waiver</ext-link> (CC0 1.0 Public domain dedication).</p>
            </sec>
        </sec>
        <sec>
            <title>Software availability</title>
            <p>Source code available from: 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/ dib-lab/sourmash/">https://github.com/ dib-lab/sourmash/</ext-link>
            </p>
            <p>Archived source code at time of publication: 
                <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5281/zenodo.3240653">http://doi.org/10.5281/zenodo.3240653</ext-link>
            </p>
            <p>Licence: 
                <ext-link ext-link-type="uri" xlink:href="https://opensource.org/licenses/BSD-3-Clause">3-Clause BSD License</ext-link>
            </p>
        </sec>
    </body>
    <back>
        <ref-list>
            <ref id="ref-1">
                <label>1</label>
                <mixed-citation publication-type="journal">
                    <article-title>Sequence read archive overview</article-title>.<year>2018</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-2">
                <label>2</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Broder</surname>
                            <given-names>AZ</given-names>
                        </name>
			</person-group>:
                    <article-title>On the resemblance and containment of documents</article-title>. In
                    <italic toggle="yes">Compression and complexity of sequences 1997. proceedings.</italic>IEEE.<year>1997</year>;<fpage>21</fpage>&#x2013;<lpage>29</lpage>.
                    <ext-link ext-link-type="uri" xlink:href="https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-3">
                <label>3</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Ondov</surname>
                            <given-names>BD</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Treangen</surname>
                            <given-names>TJ</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Melsted</surname>
                            <given-names>P</given-names>
                        </name>
				
                        <etal/>
			</person-group>:
                    <article-title>Mash: fast genome and metagenome distance estimation using MinHash.</article-title>
                    <source>
				
                        <italic toggle="yes">Genome Biol.</italic>
			</source>
                    <year>2016</year>;<volume>17</volume>(<issue>1</issue>):<fpage>132</fpage>.
                    <pub-id pub-id-type="pmid">27323842</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s13059-016-0997-x</pub-id>
                    <pub-id pub-id-type="pmcid">4915045</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-4">
                <label>4</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Bovee</surname>
                            <given-names>R</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Greenfield</surname>
                            <given-names>N</given-names>
                        </name>
			</person-group>:
                    <article-title>Finch: a tool adding dynamic abundance filtering to genomic minhashing</article-title>.<year>2018</year>;<volume>3</volume>(<issue>22</issue>):<fpage>505</fpage>.
                    <pub-id pub-id-type="doi">10.21105/joss.00505</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-5">
                <label>5</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Zhao</surname>
                            <given-names>XF</given-names>
                        </name>
			</person-group>:
                    <article-title>BinDash, software for fast genome distance estimation on a typical personal laptop.</article-title>
                    <source>
				
                        <italic toggle="yes">Bioinformatics.</italic>
			</source>
                    <year>2019</year>;<volume>35</volume>(<issue>4</issue>):<fpage>671</fpage>&#x2013;<lpage>673</lpage>.
                    <pub-id pub-id-type="pmid">30052763</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/bty651</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-6">
                <label>6</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Rowe</surname>
                            <given-names>WP</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Carrieri</surname>
                            <given-names>AP</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Alcon-Giner</surname>
                            <given-names>C</given-names>
                        </name>
				
                        <etal/>
			</person-group>:
                    <article-title>Streaming histogram sketching for rapid microbiome analytics.</article-title>
                    <source>
				
                        <italic toggle="yes">Microbiome.</italic>
			</source>
                    <year>2019</year>;<volume>7</volume>(<issue>1</issue>):<fpage>40</fpage>.
                    <pub-id pub-id-type="pmid">30878035</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s40168-019-0653-2</pub-id>
                    <pub-id pub-id-type="pmcid">6420756</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-7">
                <label>7</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Koslicki</surname>
                            <given-names>D</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Zabeti</surname>
                            <given-names>H</given-names>
                        </name>
			</person-group>:
                    <article-title>Improving minhash via the containment index with applications to metagenomic analysis.</article-title>
                    <source>
				
                        <italic toggle="yes">Appl Math Comput.</italic>
			</source>
                    <year>2019</year>;<volume>354</volume>:<fpage>206</fpage>&#x2013;<lpage>215</lpage>.
                    <pub-id pub-id-type="doi">10.1016/j.amc.2019.02.018</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-8">
                <label>8</label>
                <mixed-citation publication-type="journal">
                    <article-title>Mash screen: What&#x2019;s in my sequencing run</article-title>?<year>2017</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://genomeinformatics.github.io/mash-screen/">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-9">
                <label>9</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Brown</surname>
                            <given-names>CT</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Irber</surname>
                            <given-names>L</given-names>
                        </name>
			</person-group>:
                    <article-title>sourmash: a library for MinHash sketching of DNA.</article-title>
                    <source>
				
                        <italic toggle="yes">J Open Source Softw.</italic>
			</source>
                    <year>2016</year>;<volume>1</volume>(<issue>5</issue>):<fpage>27</fpage>.
                    <pub-id pub-id-type="doi">10.21105/joss.00027</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-10">
                <label>10</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Solomon</surname>
                            <given-names>B</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Kingsford</surname>
                            <given-names>C</given-names>
                        </name>
			</person-group>:
                    <article-title>Fast search of thousands of short-read sequencing experiments.</article-title>
                    <source>
				
                        <italic toggle="yes">Nat Biotechnol.</italic>
			</source>
                    <year>2016</year>;<volume>34</volume>(<issue>3</issue>):<fpage>300</fpage>&#x2013;<lpage>2</lpage>.
                    <pub-id pub-id-type="pmid">26854477</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nbt.3442</pub-id>
                    <pub-id pub-id-type="pmcid">4804353</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-11">
                <label>11</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Matsakis</surname>
                            <given-names>ND</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Klock FS</surname>
                            <given-names>II</given-names>
                        </name>
			</person-group>:
                    <article-title>The rust language.</article-title>
                    <source>
				
                        <italic toggle="yes">Ada Lett.</italic>
			</source>
                    <year>2014</year>;<volume>34</volume>(<issue>3</issue>):<fpage>103</fpage>&#x2013;<lpage>104</lpage>.
                    <pub-id pub-id-type="doi">10.1145/2692956.2663188</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-12">
                <label>12</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Wood</surname>
                            <given-names>DE</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Salzberg</surname>
                            <given-names>SL</given-names>
                        </name>
			</person-group>:
                    <article-title>Kraken: ultrafast metagenomic sequence classification using exact alignments.</article-title>
                    <source>
				
                        <italic toggle="yes">Genome Biol.</italic>
			</source>
                    <year>2014</year>;<volume>15</volume>(<issue>3</issue>):<fpage>R46</fpage>.
                    <pub-id pub-id-type="pmid">24580807</pub-id>
                    <pub-id pub-id-type="doi">10.1186/gb-2014-15-3-r46</pub-id>
                    <pub-id pub-id-type="pmcid">4053813</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-13">
                <label>13</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Crusoe</surname>
                            <given-names>MR</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Alameldin</surname>
                            <given-names>HF</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Awad</surname>
                            <given-names>S</given-names>
                        </name>
				
                        <etal/>
			</person-group>:
                    <article-title>The khmer software package: enabling efficient nucleotide sequence analysis [version 1; peer review: 2 approved, 1 approved with reservations].</article-title>
                    <source>
				
                        <italic toggle="yes">F1000Res.</italic>
			</source>
                    <year>2015</year>;<volume>4</volume>:<fpage>900</fpage>.
                    <pub-id pub-id-type="pmid">26535114</pub-id>
                    <pub-id pub-id-type="doi">10.12688/f1000research.6924.1</pub-id>
                    <pub-id pub-id-type="pmcid">4608353</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-14">
                <label>14</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Conesa</surname>
                            <given-names>A</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Madrigal</surname>
                            <given-names>P</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Tarazona</surname>
                            <given-names>S</given-names>
                        </name>
				
                        <etal/>
			</person-group>:
                    <article-title>A survey of best practices for RNA-seq data analysis.</article-title>
                    <source>
				
                        <italic toggle="yes">Genome Biol.</italic>
			</source>
                    <year>2016</year>;<volume>17</volume>(<issue>1</issue>):<fpage>13</fpage>.
                    <pub-id pub-id-type="pmid">26813401</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s13059-016-0881-8</pub-id>
                    <pub-id pub-id-type="pmcid">4728800</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-15">
                <label>15</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Schurch</surname>
                            <given-names>NJ</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Schofield</surname>
                            <given-names>P</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Gierli&#x0144;ski</surname>
                            <given-names>M</given-names>
                        </name>
				
                        <etal/>
			</person-group>:
                    <article-title>How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?</article-title>
                    <source>
				
                        <italic toggle="yes">RNA.</italic>
			</source>
                    <year>2016</year>;<volume>22</volume>(<issue>6</issue>):<fpage>839</fpage>&#x2013;<lpage>51</lpage>.
                    <pub-id pub-id-type="pmid">27022035</pub-id>
                    <pub-id pub-id-type="doi">10.1261/rna.053959.115</pub-id>
                    <pub-id pub-id-type="pmcid">4878611</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-16">
                <label>16</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Patro</surname>
                            <given-names>R</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Duggal</surname>
                            <given-names>G</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Love</surname>
                            <given-names>MI</given-names>
                        </name>
				
                        <etal/>
			</person-group>:
                    <article-title>Salmon provides fast and bias-aware quantification of transcript expression.</article-title>
                    <source>
				
                        <italic toggle="yes">Nat Methods.</italic>
			</source>
                    <year>2017</year>;<volume>14</volume>(<issue>4</issue>):<fpage>417</fpage>&#x2013;<lpage>419</lpage>.
                    <pub-id pub-id-type="pmid">28263959</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.4197</pub-id>
                    <pub-id pub-id-type="pmcid">5600148</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-17">
                <label>17</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Robinson</surname>
                            <given-names>MD</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>McCarthy</surname>
                            <given-names>DJ</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Smyth</surname>
                            <given-names>GK</given-names>
                        </name>
			</person-group>:
                    <article-title>edger: a bioconductor package for differential expression analysis of digital gene expression data.</article-title>
                    <source>
				
                        <italic toggle="yes">Bioinformatics.</italic>
			</source>
                    <year>2010</year>;<volume>26</volume>(<issue>1</issue>):<fpage>139</fpage>&#x2013;<lpage>140</lpage>.
                    <pub-id pub-id-type="pmid">19910308</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btp616</pub-id>
                    <pub-id pub-id-type="pmcid">2796818</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-18">
                <label>18</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Pride</surname>
                            <given-names>DT</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Meinersmann</surname>
                            <given-names>RJ</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Wassenaar</surname>
                            <given-names>TM</given-names>
                        </name>
				
                        <etal/>
			</person-group>:
                    <article-title>Evolutionary implications of microbial genome tetranucleotide frequency biases.</article-title>
                    <source>
				
                        <italic toggle="yes">Genome Res.</italic>
			</source>
                    <year>2003</year>;<volume>13</volume>(<issue>2</issue>):<fpage>145</fpage>&#x2013;<lpage>158</lpage>.
                    <pub-id pub-id-type="pmid">12566393</pub-id>
                    <pub-id pub-id-type="doi">10.1101/gr.335003</pub-id>
                    <pub-id pub-id-type="pmcid">420360</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-19">
                <label>19</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Albertsen</surname>
                            <given-names>M</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Hugenholtz</surname>
                            <given-names>P</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Skarshewski</surname>
                            <given-names>A</given-names>
                        </name>
				
                        <etal/>
			</person-group>:
                    <article-title>Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes.</article-title>
                    <source>
				
                        <italic toggle="yes">Nat Biotechnol.</italic>
			</source>
                    <year>2013</year>;<volume>31</volume>(<issue>6</issue>):<fpage>533</fpage>&#x2013;<lpage>538</lpage>.
                    <pub-id pub-id-type="pmid">23707974</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nbt.2579</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-20">
                <label>20</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Reiter</surname>
                            <given-names>T</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Brown</surname>
                            <given-names>CT</given-names>
                        </name>
			</person-group>:
                    <article-title>Microbial contamination in the genome of the domesticated olive</article-title>.<year>2018</year>.
                    <pub-id pub-id-type="doi">10.1101/499541</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-21">
                <label>21</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Shakya</surname>
                            <given-names>M</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Quince</surname>
                            <given-names>C</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Campbell</surname>
                            <given-names>JH</given-names>
                        </name>
				
                        <etal/>
			</person-group>:
                    <article-title>Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities.</article-title>
                    <source>
				
                        <italic toggle="yes">Environ Microbiol.</italic>
			</source>
                    <year>2013</year>;<volume>15</volume>(<issue>6</issue>):<fpage>1882</fpage>&#x2013;<lpage>1899</lpage>.
                    <pub-id pub-id-type="pmid">23387867</pub-id>
                    <pub-id pub-id-type="doi">10.1111/1462-2920.12086</pub-id>
                    <pub-id pub-id-type="pmcid">3665634</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-22">
                <label>22</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Awad</surname>
                            <given-names>S</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Irber</surname>
                            <given-names>L</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Brown</surname>
                            <given-names>CT</given-names>
                        </name>
			</person-group>:
                    <article-title>Evaluating metagenome assembly on a simple defined community with many strain variants</article-title>.<year>2017</year>.
                    <pub-id pub-id-type="doi">10.1101/155358</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-23">
                <label>23</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
				
                        <name name-style="western">
                            <surname>Brown</surname>
                            <given-names>CT</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>Moritz</surname>
                            <given-names>D</given-names>
                        </name>
				
                        <name name-style="western">
                            <surname>O&#x2019;brien</surname>
                            <given-names>M</given-names>
                        </name>
				
                        <etal/>
			</person-group>:
                    <article-title>Exploring neighborhoods in large metagenome assembly graphs reveals hidden sequence diversity.</article-title>
                    <source>
				
                        <italic toggle="yes">BioRxiv.</italic>
			</source>
                    <year>2019</year>;<lpage>462788</lpage>.
                    <pub-id pub-id-type="doi">10.1101/462788</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report52218">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.21579.r52218</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Chikhi</surname>
                        <given-names>Rayan</given-names>
                    </name>
                    <xref ref-type="aff" rid="r52218a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-1099-8735</uri>
                </contrib>
                <aff id="r52218a1">
                    <label>1</label>C3BI USR 3756, CNRS (French National Center for Scientific Research), Institut Pasteur, Paris, France</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>2</day>
                <month>9</month>
                <year>2019</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2019 Chikhi R</copyright-statement>
                <copyright-year>2019</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport52218" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.19675.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The authors present sourmash 2, a tool that implements a novel combination of SBTs and MinHashes, which are both fascinating computational concepts; thus, their mix is quite an interesting one. Sourmash 2 enables to perform large-scale sequences-vs-database similarity searches. The article offers a comprehensive guide for many of the software features, with biologically relevant scenarios. This is a useful contribution that is highly relevant to current needs in biology. There are a few technical issues with the current manuscript version that I list below. But otherwise, most of my remarks are for adding some extra perspective. I believe the manuscript can be approved after the technical fixes.</p>
            <p> </p>
            <p> Major remarks: 
                <list list-type="order">
                    <list-item>
                        <p>A quick recap of the state of the art in containment search would be helpful. Here you claim to use &#x2018;a modulo approach&#x2019;. Mash screen and containment minhash use different approaches (see e.g. the blog post of &#x2018;Mash screen&#x2019;). It would be nice if, in this paper, the usage of the modulo approach was put into perspective compared to those two aforementioned methods.</p>
                    </list-item>
                    <list-item>
                        <p>In fact, in the blog post cited as reference 8, Ondov writes that &#x201c;the modulo approach is problematic for metagenomic applications (e.g. finding a virus in a metagenome).&#x201d; The problem is indirectly mentioned in the manuscript (&#x201c;can sacrifice some of the memory and storage benefits of standard MinHash techniques, as the signature size scales with the number of unique k-mers&#x201d;). It would be neat to get the authors&#x2019; comparative perspective here as to why using modulo is the better approach.</p>
                    </list-item>
                    <list-item>
                        <p>My main comment would perhaps be the lack of comparison with other software. I do not know if this is a requirement for F1000Research in the &#x201c;Software Tool Article&#x201d; category. I suppose that sourmash is the only tool that implements SBT-Minhashes, so of course here there is no competitor in that category. It would however be nice to have some indication on whether sourmash is best-in-class in each of the proposed features (the uses cases), or whether other tools already exist and somehow do a similar job. And, the other way around, which areas where sourmash is really the only tool capable of doing X in reasonable time. I do not expect a comprehensive benchmark, but some informal indication would already be appreciated.</p>
                    </list-item>
                    <list-item>
                        <p>What are roughly the limits of similarity queries? E.g. sequences shorter than X or having identity below Y% have no chances to be reported.</p>
                    </list-item>
                    <list-item>
                        <p>A summary of all the features demonstrated in the main text could be helpful. For instance, reading only the introduction, it is not explicit that a natural application of sourmash is outlier/contaminant detection.</p>
                    </list-item>
                </list> Minor remarks: 
                <list list-type="order">
                    <list-item>
                        <p>&#x201c;Sequence Read Archive now contains over 20 Petabases of data1&#x201d;: seems to be over 30 PB in 2019 according to the plot in reference 1.</p>
                    </list-item>
                    <list-item>
                        <p>It is not clear what the &#x2018;LCA&#x2019; term stands for in the context of the database format introduced here. Is it the lowest common ancestor?</p>
                    </list-item>
                    <list-item>
                        <p>The description of LCA (in section &#x201c;LCA database&#x201d;) is imprecise. What does a &#x201c;named list&#x201d; mean in this context? A Figure would be helpful to see a small example.</p>
                    </list-item>
                    <list-item>
                        <p>A sentence in the manuscript mentions &#x2018;a second database format&#x2019;. The &#x2018;first format&#x2019; is supposedly the SBTMH but it is only implicit that SBTMH is a &#x2018;format&#x2019;.</p>
                    </list-item>
                    <list-item>
                        <p>The introduction mentions a bunch of features implemented in other tools (&#x201c;Jaccard similarity comparisons, .., &#x00a0;k-mer abundance comparisons, decrease runtime and memory requirements, and work on streaming input data.&#x201d;) Are all of these implemented in sourmash2, or only a subset of them? (It seems to me that most are implemented.)</p>
                    </list-item>
                    <list-item>
                        <p>The description of the modulo approach used is imprecise. How is the hash space divided into s equal &#x2018;bands&#x2019; (undefined term), precisely? Also, I suppose this somewhat different from the modulo approach proposed by Broder, and clarified in Mash screen&#x2019;s blog post, but how so?</p>
                    </list-item>
                    <list-item>
                        <p>The concept of &#x2018;hash subset retention&#x2019; is not well defined. I suppose it is the set of hashes that result from a MinHash computation.</p>
                    </list-item>
                    <list-item>
                        <p>Abundance filtering (as in Finch) is not performed in sourmash2, right?</p>
                    </list-item>
                    <list-item>
                        <p>Some of the &#x2018;signature utilities&#x2019; are self-explanatory. However, what is the difference between &#x2018;intersect&#x2019; and &#x2018;overlap&#x2019;? What is &#x2018;flatten&#x2019;?</p>
                    </list-item>
                    <list-item>
                        <p>Regarding the sentence: &#x201c;although we recommend using more accurate approaches for detailed comparisons.&#x201d; To make the paper self-contained, a short explanation would be needed to delineate what sort of concrete use-case(s) is/are meant behind the term &#x2018;detailed comparisons&#x2019;.</p>
                    </list-item>
                    <list-item>
                        <p>The &#x201c;similarity queries&#x201d; and &#x201c;containment queries&#x201d; sections could benefit from at least one use-case example per query. This is to illustrate the two sections, which are a bit obscure without examples. (I realize that use cases are given later in the codes examples, so perhaps a forward-reference could work, albeit less elegant.) A proposal for similarity queries: &#x2018;find all genomes in the SBTMH (leaves) that are similar to a query genome&#x2019;.</p>
                    </list-item>
                    <list-item>
                        <p>Awkward formulation of that part of the sentence: &#x00a0;[..] &#x2018;using a k-mer size of 4, use all 4-mers, and track abundance.&#x2019; (although I understood that the k-mer size was 4 and the &#x2018;use all k-mers&#x2019; refers to a scaling factor of 1)</p>
                    </list-item>
                    <list-item>
                        <p>The &#x201c;Tetranucleotide Frequency Clustering&#x201c; section is quite nice. It should be emphasized however this isn&#x2019;t really a minhash sketch: all 4-mers are considered.</p>
                    </list-item>
                    <list-item>
                        <p>Regarding the sentence &#x201c;We see that the minimum similarity in the matrix is 0%&#x201d;, how is that seen? visual inspection of ecoli.comp.matrix.png?</p>
                    </list-item>
                    <list-item>
                        <p>Regarding the sentence &#x201c;This process is repeated until the threshold is reached.&#x201d;: I forgot.. which threshold?</p>
                    </list-item>
                    <list-item>
                        <p>Regarding the sentence &#x201c;We see that 20.1% of k-mers match 82 genomes in GenBank.&#x201d;: how is this seen? Also &#x201c;we see two matches between P. ruminis strains among all matches.&#x201d; In the output above that text, I see only one match. (I could not test that section due to the missing download URL.)</p>
                    </list-item>
                </list> </p>
            <p> Regarding the software commands: 
                <list list-type="bullet">
                    <list-item>
                        <p>As an important note, one cannot easily copy-paste the command lines as short dashes (-) are converted to long dashes (&#x2013;). Nevertheless, I still automatically replaced all the dashes and tested all command lines. I&#x2019;ll report any problem below.</p>
                    </list-item>
                    <list-item>
                        <p>Extra &#x2018;\&#x2019; at the end of the command: &#x201c;curl &#x2013;L &#x2013;o ecoli&#x2013;reads.khmer.fq.gz &#x00a0; &#x00a0; https://osf.io/26xm9/download \&#x201d;.</p>
                    </list-item>
                    <list-item>
                        <p>Missing url in command (and also extra &#x2018;\&#x2019; at the end): &#x201c;[..] unmapped&#x2013;qc&#x2013;to&#x2013;ref.fq.sig https://osf.io//download \&#x201d; &#x00a0; Thus I could not test this part.</p>
                    </list-item>
                    <list-item>
                        <p>The streaming operation at the beginning of the MDS section overwrites the file &#x2018;ERR458584.khmer.sig&#x2018; produced before, perhaps make a note of that.</p>
                    </list-item>
                    <list-item>
                        <p>The url &#x201c;curl &#x2013;L &#x2013;o genbank&#x2013;d2&#x2013;k31.tar.gz \</p>
                        <p> &#x00a0; &#x00a0; &#x00a0;https:// s3&#x2013;us&#x2013;west&#x2013;2.amazonaws.com/</p>
                        <p> &#x00a0; &#x00a0; &#x00a0;sourmash&#x2013;databases/2018&#x2013;03&#x2013;29/</p>
                        <p> &#x00a0; &#x00a0; &#x00a0;genbank&#x2013;d2&#x2013;k31.tar.gz&#x201c; has extra new lines</p>
                    </list-item>
                    <list-item>
                        <p>In the command &#x201c;sourmash index &#x2013;k 31 ecolidb \</p>
                        <p> &#x00a0; &#x00a0; &#x00a0;escherichia&#x2013;sigs /*.sig&#x201d;, a space is wrongly inserted after &#x201c;escherichia-sigs&#x201d;</p>
                    </list-item>
                    <list-item>
                        <p>And also, in the command that follows, one of the &#x2018;\&#x2019; is extra and another &#x2018;\&#x2019; is missing.</p>
                    </list-item>
                    <list-item>
                        <p>The SBT path inside the file &#x2018;fungi-k31.sbt.json&#x2019; is wrongly hardcoded. Also, when fixing it, I get &#x201c;WARNING: this is an old index version, please run `sourmash migrate` to update it.&#x201d; Although it did end up working.</p>
                    </list-item>
                </list>
            </p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Yes</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Partly</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Partly</p>
            <p>Reviewer Expertise:</p>
            <p>Algorithms and data structures for sequence bioinformatics</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report52588">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.21579.r52588</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Solomon</surname>
                        <given-names>Brad</given-names>
                    </name>
                    <xref ref-type="aff" rid="r52588a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r52588a1">
                    <label>1</label>Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>27</day>
                <month>8</month>
                <year>2019</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2019 Solomon B</copyright-statement>
                <copyright-year>2019</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport52588" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.19675.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>This manuscript presents an improved software library, sourmash 2.0, for the efficient construction and analysis of MinHash sketches of genomic and proteomic sequence data. The primary innovations of sourmash 2.0 are the novel applications and modified implementations of several existing sketching and indexing methods. In particular, the two main advancements are (1) a &#x2018;modulo approach&#x2019; to sketch construction and the development of a modified Sequence Bloom Tree (SBT) index over MinHash sketches. In addition, unlike conventional MinHash methods, sourmash 2.0 can also track the abundance of retained hash elements, allowing some degree of abundance comparison and abundance estimation of genomic datasets.&#x00a0;</p>
            <p> </p>
            <p> Multiple use cases for sourmash 2.0 are provided including sketch construction, similarity comparisons, containment querying, classification, and metagenome decomposition. The manuscript is very thorough in describing sketch construction with example run commands for full-length genomes, proteomes, as well as raw reads. The ease of use is further demonstrated with a single line command that, while involving multiple pipes, is capable of constructing a signature through a download data stream. Similarly, several uses of sketch comparisons are demonstrated including visualization and outlier detection. Lastly the use of the SBT MinHash (SBTMH) is demonstrated for classification or containment of arbitrary sequence queries. The download instructions and example run commands are fully functional and the input datasets are reasonably sized for toy examples (most on the order of &lt;100 MB).</p>
            <p> </p>
            <p> Excluding the description of the modulo approach of sketch construction, the manuscript itself is technically sound. The topic is high impact -- sketching approaches are increasingly popular solutions to a multitude of research topics in computational biology. Many of these potential use cases are demonstrated successfully in the manuscript. That said, there are numerous existing solutions to each of the use cases presented in the manuscript and little to no attempt was made to provide benchmarking information or to demonstrate the improvements sourmash 2.0 has over its competitors. While sourmash 2.0 has an ease of use that will undoubtedly facilitate use, it does not adequately demonstrate an improved capacity for analysis over these other tools. This is primarily a suggestion for improvement as, based on the F1000 Research guidelines, the manuscript is correct and valid in its current state.</p>
            <p> </p>
            <p> Major Comments: 
                <list list-type="order">
                    <list-item>
                        <p>The &#x2018;modulo approach&#x2019; for sketch construction, despite being one of the main innovations of the method, is particularly unclear in the manuscript. The cited literature (Broder 1997) describes an approach that sub-samples hash values based on a modulo factor to address the inherent weakness of a Minhash in a mixture of several distinct components. However the description of the sourmash implementation instead describes splitting the hash space into &#x2018;equal bands&#x2019; and selecting only the minimum band. As the existing modulo approach has no guarantees on equal-sized (or even equal-fraction as the manuscript claims elsewhere) sub-sampling, this appears to be a novel and significant contribution to the field. However there are no details that explain (1) how the hash space is divided, (2) how the minimum band is selected, and (3) how downsampling is performed.&#x00a0;</p>
                    </list-item>
                    <list-item>
                        <p>Sourmash 2.0 is motivated by &#x201c;a particular focus towards enabling efficient containment queries using large databases&#x201d;. However the manuscript does not include any true comparisons about sourmash&#x2019;s performance against existing tools, alternative approaches, or benchmarking information for even conventionally sized datasets. This greatly limits the potential impact of sourmash given there are many competing sketch strategies and an even larger range of available implementations.</p>
                        <p> </p>
                        <p> While it is unreasonable to expect a full review of the available methods, the inclusion of even a single &#x2018;large-scale&#x2019; dataset in the test set or use cases would go a long way towards demonstrating the scalability of sourmash. Selecting a biologically relevant subset from a public genomic repository such as the NIH SRA, TCGA, or GTEx (to name just a few) would alleviate the need to host such a dataset while allowing large-scale reproducibility and benchmarking.</p>
                    </list-item>
                </list> Minor Comments: 
                <list list-type="order">
                    <list-item>
                        <p>The &#x2018;Salmon &amp; edgeR&#x2019; MDS plot in Figure 1 does not have points associated with the text labels. As there is no consistency in the placement of labels versus nodes in the first plot (Sourmash Compare MDS), even approximate values are difficult to determine.&#x00a0;</p>
                    </list-item>
                    <list-item>
                        <p>The commands listed in the manuscript are using the wrong character (&#x2018;&#x2013;&#x2019; vs &#x2018;-&#x2019;). None of them run properly without manual adjustments or retyping. I imagine this is a formatting issue more than a coding one but it would make reproducing the results a lot simpler if it was resolved.</p>
                    </list-item>
                </list> Other:&#x00a0; 
                <list list-type="order">
                    <list-item>
                        <p>The inclusion of limited abundance information is a particularly interesting improvement over standard MinHash sketches. The manuscript suggests that the abundance tracking can play a significant role when &#x2018;comparing many signatures&#x2019; but there is no concrete claim to assess. While outside the scope or focus of this work, a follow-up piece which explores the theoretical or practical impact of systematically sub-sampled counting information would be potentially high impact.&#x00a0;</p>
                    </list-item>
                </list>
            </p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Partly</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Partly</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Yes</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Computational Biology, Algorithms and Data Structures, Genomics</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
    </sub-article>
</article>
