<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="other" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.17927.1</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Software Tool Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Biobtree: A tool to search, map and visualize bioinformatics identifiers and special keywords</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 1; peer review: 1 approved with reservations, 1 not approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Gur</surname>
                        <given-names>Tamer</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD Cambridge, UK</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:tgur@ebi.ac.uk">tgur@ebi.ac.uk</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>4</day>
                <month>2</month>
                <year>2019</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2019</year>
            </pub-date>
            <volume>8</volume>
            <elocation-id>145</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>23</day>
                    <month>1</month>
                    <year>2019</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2019 Gur T</copyright-statement>
                <copyright-year>2019</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/8-145/pdf"/>
            <abstract>
                <p>Due to their nature, bioinformatics datasets are often closely related to each other. For this reason, search, mapping and visualization of these relations are often performed manually or programmatically via identifiers or special keywords such as gene symbols. Although various tools exist for these situations, the growing volume of bioinformatics datasets, emerging new software tools and approaches motivates new solutions. To provide a new tool for these current cases, I present the Biobtree bioinformatics tool. Biobtree effectively fetches and indexes identifiers and special keywords with their related identifiers from supported datasets, optionally with user pre-defined datasets and provides a web interface, web services and direct B+ tree data structure based single uniform database output. Biobtree can handle billions of identifiers and runs via a single executable file with no installation and dependency required. It also aims to provide a relatively small codebase for easy maintenance, addition of new features and extension to larger datasets. Biobtree is available to download from 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/tamerh/biobtree">GitHub</ext-link>.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>bioinformatics</kwd>
                <kwd>identifiers</kwd>
                <kwd>search</kwd>
                <kwd>mapping</kwd>
                <kwd>visualization</kwd>
            </kwd-group>
            <funding-group>
                <funding-statement>The author(s) declared that no grants were involved in supporting this work.</funding-statement>
            </funding-group>
        </article-meta>
    </front>
    <body>
        <sec sec-type="intro">
            <title>Introduction</title>
            <p>Bioinformatics datasets often consist of entries, where each entry is represented by unique identifier. Depending on the dataset, each entry contains various types of information such as sequence data, biological function, chemical structure or literature reference etc. In addition, entries often contain cross-referencing information to other dataset entries via identifiers. Let&#x2019;s take as an example entry the proto-oncogene vav protein in humans, which is encoded by the 
                <italic toggle="yes">VAV1</italic> gene. If we display this protein on the UniProt 
                <ext-link ext-link-type="uri" xlink:href="https://www.uniprot.org/uniprot/P15498">website</ext-link> we see cross references to many other datasets. These cross references represent relations of datasets with each other. Various tools exist to deal with such data; however, the growing volume of bioinformatics datasets, emerging new software tools and analysis approaches motivates new solutions. Biobtree
                <sup>
                    <xref ref-type="bibr" rid="ref-1">1</xref>
                </sup> presented herein is capable of improved and rapid processing of large numbers of unique identifiers of entries and related identifiers that are specified via cross-reference data.</p>
            <p>In some datasets, in addition to unique identifiers there is information that is strongly related to entries but not necessarily unique for each entry. Species names or UniProt secondary accessions are example of this type. Information that is strongly related to the entries but not necessarily unique is a second data source for Biobtree. In Biobtree, these types are called special keywords and each of these can be related to multiple entries among the datasets.</p>
            <p>Biobtree retrieves all these identifiers, related identifiers and special keywords from various bioinformatics resources and stores it in a single database. The data resources currently used are 
                <ext-link ext-link-type="uri" xlink:href="https://www.ebi.ac.uk/chebi/">ChEBI</ext-link>
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>
                </sup>, 
                <ext-link ext-link-type="uri" xlink:href="https://www.genenames.org/">HGNC</ext-link>
                <sup>
                    <xref ref-type="bibr" rid="ref-3">3</xref>
                </sup>, 
                <ext-link ext-link-type="uri" xlink:href="http://www.hmdb.ca/">HMDB</ext-link>
                <sup>
                    <xref ref-type="bibr" rid="ref-4">4</xref>
                </sup>, 
                <ext-link ext-link-type="uri" xlink:href="https://www.ebi.ac.uk/interpro/">InterPro</ext-link>
                <sup>
                    <xref ref-type="bibr" rid="ref-5">5</xref>
                </sup>, 
                <ext-link ext-link-type="uri" xlink:href="https://europepmc.org/">Europe PMC</ext-link>
                <sup>
                    <xref ref-type="bibr" rid="ref-6">6</xref>
                </sup> and 
                <ext-link ext-link-type="uri" xlink:href="https://www.uniprot.org/">UniProt</ext-link>
                <sup>
                    <xref ref-type="bibr" rid="ref-7">7</xref>
                </sup>. 
                <xref ref-type="table" rid="T1">Table 1</xref> shows details of these datasets.</p>
            <table-wrap id="T1" orientation="portrait" position="anchor">
                <label>Table 1. </label>
                <caption>
                    <title>List of datasets.</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1">Dataset</th>
                            <th align="left" colspan="1" rowspan="1">Description</th>
                            <th align="left" colspan="1" rowspan="1">File Name</th>
                            <th align="left" colspan="1" rowspan="1">Location</th>
                            <th align="left" colspan="1" rowspan="1">Format</th>
                            <th align="left" colspan="1" rowspan="1">Special
                                <break/>Keywords</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">ChEBI</td>
                            <td align="left" colspan="1" rowspan="1">ChEBI reference
                                <break/>accession data</td>
                            <td align="left" colspan="1" rowspan="1">database_accession.
                                <break/>tsv</td>
                            <td align="left" colspan="1" rowspan="1">
                                <ext-link ext-link-type="uri" xlink:href="ftp.ebi.ac.uk/chebi/Flat_file_tab_delimited/">ftp.ebi.ac.uk/chebi/Flat_file_tab_delimited/</ext-link>
                            </td>
                            <td align="center" colspan="1" rowspan="1">TSV</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">HGNC</td>
                            <td align="left" colspan="1" rowspan="1">Human gene
                                <break/>nomenclature</td>
                            <td align="left" colspan="1" rowspan="1">hgnc_complete_set.
                                <break/>json</td>
                            <td align="left" colspan="1" rowspan="1">
                                <ext-link ext-link-type="uri" xlink:href="ftp.ebi.ac.uk/genenames/new/json/">ftp.ebi.ac.uk/genenames/new/json/</ext-link>
                            </td>
                            <td align="center" colspan="1" rowspan="1">JSON</td>
                            <td align="center" colspan="1" rowspan="1">name,symbol</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">HMDB</td>
                            <td align="left" colspan="1" rowspan="1">Human metabolome
                                <break/>database</td>
                            <td align="left" colspan="1" rowspan="1">hmdb_metabolites.
                                <break/>zip</td>
                            <td align="left" colspan="1" rowspan="1">
                                <ext-link ext-link-type="uri" xlink:href="http://www.hmdb.ca/system/downloads/current/">http://www.hmdb.ca/system/downloads/current/</ext-link>
                            </td>
                            <td align="center" colspan="1" rowspan="1">XML</td>
                            <td align="center" colspan="1" rowspan="1">name,
                                <break/>synonyms</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">InterPro</td>
                            <td align="left" colspan="1" rowspan="1">Protein Families</td>
                            <td align="left" colspan="1" rowspan="1">interpro.xml.gz</td>
                            <td align="left" colspan="1" rowspan="1">
                                <ext-link ext-link-type="uri" xlink:href="ftp://ftp.ebi.ac.uk/pub/databases/interpro/current">ftp://ftp.ebi.ac.uk/pub/databases/interpro/current</ext-link>
                            </td>
                            <td align="center" colspan="1" rowspan="1">XML</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Literature
                                <break/>mappings</td>
                            <td align="left" colspan="1" rowspan="1">Literature pmid, pmcid
                                <break/>and doi mappings</td>
                            <td align="left" colspan="1" rowspan="1">PMID_PMCID_DOI.
                                <break/>csv.gz</td>
                            <td align="left" colspan="1" rowspan="1">
                                <ext-link ext-link-type="uri" xlink:href="ftp://ftp.ebi.ac.uk/pub/databases/pmc/DOI/">ftp://ftp.ebi.ac.uk/pub/databases/pmc/DOI/</ext-link>
                            </td>
                            <td align="center" colspan="1" rowspan="1">CSV</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Taxonomy</td>
                            <td align="left" colspan="1" rowspan="1">NCBI Taxonomy</td>
                            <td align="left" colspan="1" rowspan="1">taxonomy.xml.gz</td>
                            <td align="left" colspan="1" rowspan="1">
                                <ext-link ext-link-type="uri" xlink:href="ftp://ftp.ebi.ac.uk/pub/databases/taxonomy/">ftp://ftp.ebi.ac.uk/pub/databases/taxonomy/</ext-link>
                            </td>
                            <td align="center" colspan="1" rowspan="1">XML</td>
                            <td align="center" colspan="1" rowspan="1">scientificName</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Uniparc</td>
                            <td align="left" colspan="1" rowspan="1">UniProt Sequence
                                <break/>Archive</td>
                            <td align="left" colspan="1" rowspan="1">uniparc_all.xml.gz</td>
                            <td align="left" colspan="1" rowspan="1">
                                <ext-link ext-link-type="uri" xlink:href="ftp.ebi.ac.uk/pub/databases/uniprot/current_release/uniparc/">ftp.ebi.ac.uk/pub/databases/uniprot/current_release/uniparc/</ext-link>
                            </td>
                            <td align="center" colspan="1" rowspan="1">XML</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">UniProt
                                <break/>reviewed</td>
                            <td align="left" colspan="1" rowspan="1">UniProt Knowledgebase
                                <break/>reviewed</td>
                            <td align="left" colspan="1" rowspan="1">uniprot_sprot.xml.gz</td>
                            <td align="left" colspan="1" rowspan="1">
                                <ext-link ext-link-type="uri" xlink:href="ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/">ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/</ext-link>
                            </td>
                            <td align="center" colspan="1" rowspan="1">XML</td>
                            <td align="center" colspan="1" rowspan="1">accession</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">UniProt
                                <break/>unreviewed</td>
                            <td align="left" colspan="1" rowspan="1">UniProt Knowledgebase
                                <break/>unreviewed</td>
                            <td align="left" colspan="1" rowspan="1">uniprot_trembl.xml.gz</td>
                            <td align="left" colspan="1" rowspan="1">
                                <ext-link ext-link-type="uri" xlink:href="ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/">ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/</ext-link>
                            </td>
                            <td align="center" colspan="1" rowspan="1">XML</td>
                            <td align="center" colspan="1" rowspan="1">accession</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Uniref50</td>
                            <td align="left" colspan="1" rowspan="1">UniProt sequence
                                <break/>clusters</td>
                            <td align="left" colspan="1" rowspan="1">uniref50.xml.gz</td>
                            <td align="left" colspan="1" rowspan="1">
                                <ext-link ext-link-type="uri" xlink:href="ftp.ebi.ac.uk/pub/databases/uniprot/current_release/uniref/uniref50/">ftp.ebi.ac.uk/pub/databases/uniprot/current_release/uniref/uniref50/</ext-link>
                            </td>
                            <td align="center" colspan="1" rowspan="1"/>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Uniref90</td>
                            <td align="left" colspan="1" rowspan="1">UniProt sequence
                                <break/>clusters</td>
                            <td align="left" colspan="1" rowspan="1">uniref90.xml.gz</td>
                            <td align="left" colspan="1" rowspan="1">
                                <ext-link ext-link-type="uri" xlink:href="ftp.ebi.ac.uk/pub/databases/uniprot/current_release/uniref/uniref90/">ftp.ebi.ac.uk/pub/databases/uniprot/current_release/uniref/uniref90/</ext-link>
                            </td>
                            <td align="center" colspan="1" rowspan="1">XML</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Uniref100</td>
                            <td align="left" colspan="1" rowspan="1">UniProt sequence
                                <break/>clusters</td>
                            <td align="left" colspan="1" rowspan="1">uniref100.xml.gz</td>
                            <td align="left" colspan="1" rowspan="1">
                                <ext-link ext-link-type="uri" xlink:href="ftp.ebi.ac.uk/pub/databases/uniprot/current_release/uniref/uniref100/">ftp.ebi.ac.uk/pub/databases/uniprot/current_release/uniref/uniref100/</ext-link>
                            </td>
                            <td align="center" colspan="1" rowspan="1">XML</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
            <p>Based on stored data, Biobtree provides search, map and visualization functionalities via provided web services or a web interface. For instance, all the UniProt proteins entries belonging to a gene name, or, all Ensembl
                <sup>
                    <xref ref-type="bibr" rid="ref-8">8</xref>
                </sup> genome transcripts identifiers and ENA
                <sup>
                    <xref ref-type="bibr" rid="ref-9">9</xref>
                </sup> sequence identifiers that map to a protein identifier can be accessed. These relations, determined via identifiers, are stored bidirectionally so all actions can also be done in the opposite way.</p>
            <p>Biobtree is managed from a single executable file for each major operating system without requiring any installation or compilation. As a database, Biobtree uses a B+ tree data structure based 
                <ext-link ext-link-type="uri" xlink:href="https://symas.com/lmdb/">LMDB</ext-link> key value store. LMDB provides fast batch inserts and reads and allows effective operation on a large number of records. LMDB is embedded into Biobtree&#x2019;s executable binary code so it does not require a separate installation.</p>
        </sec>
        <sec sec-type="methods">
            <title>Methods</title>
            <sec>
                <title>Implementation</title>
                <p>Biobtree
                    <sup>
                        <xref ref-type="bibr" rid="ref-1">1</xref>
                    </sup> has been implemented in GO programming language. To use C programming language based LMDB in Biobtree GO environment 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/bmatsuo/lmdb-go">lmdb-go</ext-link> binding library has been used. To implement the Web interface Javascript programming language, 
                    <ext-link ext-link-type="uri" xlink:href="https://vuejs.org/">Vue</ext-link> and 
                    <ext-link ext-link-type="uri" xlink:href="https://bulma.io/">Bulma</ext-link> web frameworks has been used. Biobtree workflow consists of three phases, which will be explained in later sections. These phases are named 
                    <italic toggle="yes">update</italic>, 
                    <italic toggle="yes">generate</italic> and 
                    <italic toggle="yes">web</italic> and are controlled by a Biobtree command line interface (CLI)</p>
            </sec>
        </sec>
        <sec>
            <title>Update phase</title>
            <p>The purpose of the update phase is to retrieve dataset identifiers and special keywords from remote servers or a local disk and produce files that contains identifiers and special keywords with their referred identifiers as keys and values in a sorted order. It is essential that the produced files are sorted to make fast batch inserts to LMDB database in the next phase. The updating phase is started via the Biobtree CLI with the update command. For example, the following command starts the update phase for the hgnc dataset.</p>
            <p>
                <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                    <styled-content style="font-size:15px;">$ biobtree --d hgnc update</styled-content>
                </preformat>
            </p>
            <p>Updating reads selected datasets as a stream and saves Biobtree-related data in a series of files. An advantage of reading dataset as streams is that it does not require fully downloading the dataset to the local disk. Datasets can have different formats like XML, JSON, TSV or CSV. Biobtree has specialized parsers for each dataset and parses them to produce its output files. When Biobtree runs the first time it retrieves its configuration, license and web interface files from the source code repository. Configuration files contain Biobtree runtime settings and dataset definitions.</p>
            <sec>
                <title>Integrate user dataset</title>
                <p> User data can be integrated to Biobtree. This feature creates an alternative for data providers to serve their data. Data should be gzipped and in an xml format compliant with UniProt xml schema 
                    <ext-link ext-link-type="uri" xlink:href="ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot.xsd">definition</ext-link>. After the file path of the data is configured in a Biobtree configuration file, updating starts similarly:</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">$ biobtree --d my_data update</styled-content>
                    </preformat>
                </p>
            </sec>
            <sec>
                <title>Updating in multiple computers</title>
                <p>Biobtree supports executing the update phase over multiple computers. This is useful when it is necessary to use multiple computer processors at the same time such as with large datasets. The following two commands can be run on different computers with additional 
                    <italic toggle="yes">idx</italic> argument to guarantee that the produced files have unique names.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">$ biobtree --d uniparc &#x2013;idx 1 update</styled-content>

                        <styled-content style="font-size:15px;">$ biobtree --d uniref50 &#x2013;idx 2 update</styled-content>
                    </preformat>
                </p>
                <p>Although Biobtree supports the updating phase occurring over multiple computers, for the next phase all the produced files have to be in a single location.</p>
            </sec>
        </sec>
        <sec>
            <title>Generate phase</title>
            <p>The purpose of this phase is to merge all the files produced in the update phase by keeping the sorted order and generate the final key and values in the generated LMDB database. Keys consist of identifiers and special keywords and values are identifiers that are referred to by these keys with their dataset information. If the values size for each key are above a certain threshold they are saved in pages. The following command starts the generate phase.</p>
            <p>
                <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                    <styled-content style="font-size:15px;">$ biobtree generate</styled-content>
                </preformat>
            </p>
            <p>The generated database output is used in next web phase but it can be also used directly. Example source codes for using the database directly can be found in the project github page.</p>
        </sec>
        <sec>
            <title>Web phase</title>
            <p>The purpose of this phase is to provide web services and a web interface via the produced output of the generate phase. The following command starts web phase</p>
            <p>
                <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                    <styled-content style="font-size:15px;">$ biobtree web</styled-content>
                </preformat>
            </p>
            <p>With this command the Biobtree web server is started instantly, serving the REST, gRPC and web interface services.</p>
            <sec>
                <title>REST service</title>
                <p> To make queries in the produced database, Biobtree provides RESTful endpoints with json-formatted responses. For example, each dataset has a unique identifier and other meta information like name, url template, etc. Dataset-unique identifiers are used in all the services to distinguish the dataset. These meta information is retrieved via the following endpoint:</p>
                <p>&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;
                    <ext-link ext-link-type="uri" xlink:href="http://localhost:8888/ws/meta">http://localhost:8888/ws/meta</ext-link>
                </p>
                <p>The following are used to query single or multiple identifiers or special keywords:</p>
                <p>&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;
                    <ext-link ext-link-type="uri" xlink:href="http://localhost:8888/ws/?idlist=vav_human">http://localhost:8888/ws/?idlist=vav_human</ext-link>
                </p>
                <p>&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;
                    <ext-link ext-link-type="uri" xlink:href="http://localhost:8888/ws/?idlist=vav_human,tpi1,brca2">http://localhost:8888/ws/?idlist=vav_human,tpi1,brca2</ext-link>
                </p>
                <p>To make a paging query for a certain identifier:</p>
                <p>&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;
                    <ext-link ext-link-type="uri" xlink:href="http://localhost:8888/ws/?id=vav_human&amp;dataset=1&amp;page=1">http://localhost:8888/ws/?id=vav_human&amp;dataset=1&amp;page=1</ext-link>
                </p>
                <p>To make a filtering query based on a dataset:</p>
                <p>&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;
                    <ext-link ext-link-type="uri" xlink:href="http://localhost:8888/ws/?id=vav_human&amp;dataset=1&amp;filters=102">http://localhost:8888/ws/?id=vav_human&amp;dataset=1&amp;filters=102</ext-link>
                </p>
                <p>To make a paging query with active filtering:</p>
                <p>&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;
                    <ext-link ext-link-type="uri" xlink:href="http://localhost:8888/ws/?id=vav_human&amp;dataset=1&amp;filters=102&amp;page=1">http://localhost:8888/ws/?id=vav_human&amp;dataset=1&amp;filters=102&amp;page=1</ext-link>
                </p>
            </sec>
            <sec>
                <title>gRPC service</title>
                <p> RESTful services are often json-based and are very convenient in json-based applications. But for non&#x2013;json-based applications, it requires an extra process of serialization and deserialization. To address this, Biobtree provides a gRPC service with same functionality as its RESTful service. Sample codes for using gRPC in different languages can be found on the project github page. The following is snippet of Biobtree gRPC service definitions:</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">service BiobtreeService {

rpc Get     (BiobtreeGetRequest)     returns (BiobtreeGetResponse);
rpc GetPage (BiobtreeGetPageRequest) returns (BiobtreeGetPageResponse);
rpc Filter  (BiobtreeFilterRequest)  returns (BiobtreeFilterResponse);
rpc Meta    (BiobtreeMetaRequest)    returns (BiobtreeMetaResponse);

}</styled-content>
                    </preformat>
                </p>
            </sec>
            <sec>
                <title>Web interface</title>
                <p>The Web interface allow user to visualize the produced database via the RESTful service. Once the web phase is started it is accessed via the browser from the following address:</p>
                <p>&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;
                    <ext-link ext-link-type="uri" xlink:href="http://localhost:8888/ui">http://localhost:8888/ui</ext-link>
                </p>
                <p>The Web interface provides searching of multiple identifiers and special keywords, visualizing and filtering results, and executing bulk queries. On the result page for each result, it provides a url to access the main website where the data are originally produced. 
                    <xref ref-type="fig" rid="f1">Figure 1</xref> and 
                    <xref ref-type="fig" rid="f2">Figure 2</xref> show the main and result page of the web interface.</p>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>Web interface main page.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/19605/fe81097f-ee99-4c2f-967d-dfbdbcf8cb99_figure1.gif"/>
                </fig>
                <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                    <label>Figure 2. </label>
                    <caption>
                        <title>Web interface result page.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/19605/fe81097f-ee99-4c2f-967d-dfbdbcf8cb99_figure2.gif"/>
                </fig>
            </sec>
        </sec>
        <sec>
            <title>Operation</title>
            <p>Biobtree executable file is available from 
                <ext-link ext-link-type="uri" xlink:href="http://www.github.com/tamerh/biobtree/">GitHub</ext-link> page for Windows, MacOS and Linux operating systems. For default datasets and configuration Biobtree uses up to 4 GB of memory. For large datasets it is advised to use computer which has large RAM space such as 16 GB to speed up finishing update and generate phases. RAM usage for these phases can be managed from configuration file via 
                <italic toggle="yes">kvgenChunkSize</italic> and 
                <italic toggle="yes">batchSize</italic> variables. For CPU Biobtree uses all available CPU powers when it needed. This default behaviour can be restricted via CLI with 
                <italic toggle="yes">maxcpu</italic> argument.</p>
        </sec>
        <sec>
            <title>Benchmarks</title>
            <p>Software benchmarks should often be taken with a grain of salt, especially where the input data is large, because benchmarks results can be affected by many factors like the processor, operating system, storage, network speed, input data, application parameters etc. Considering these, the purpose of Biobtree benchmarks is mainly to show overall capabilities and resource usage of Biobtree. 
                <xref ref-type="table" rid="T2">Table 2</xref> shows the benchmark details and results. Benchmarks were primarily computed at 
                <ext-link ext-link-type="uri" xlink:href="https://www.digitalocean.com/">DigitalOcean</ext-link> London datacentres using their CPU optimized droplets with block storage volumes. Hundred thousand sample query which used in the benchmarks can be found on the project 
                <ext-link ext-link-type="uri" xlink:href="https://www.github.com/tamerh/biobtree">GitHub</ext-link> page.</p>
            <table-wrap id="T2" orientation="portrait" position="anchor">
                <label>Table 2. </label>
                <caption>
                    <title>Benchmarks results. </title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1">-</th>
                            <th align="center" colspan="1" rowspan="1">Benchmark-1</th>
                            <th align="center" colspan="1" rowspan="1">Benchmark-2</th>
                            <th align="center" colspan="1" rowspan="1">Benchmark-3</th>
                            <th align="center" colspan="1" rowspan="1">Benchmark-4</th>
                            <th align="center" colspan="1" rowspan="1">Benchmark-5</th>
                            <th align="center" colspan="1" rowspan="1">Benchmark-6</th>
                            <th align="center" colspan="1" rowspan="1">Benchmark-7</th>
                            <th align="center" colspan="1" rowspan="1">Benchmark-8</th>
                            <th align="center" colspan="1" rowspan="1">Benchmark-9</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Description</td>
                            <td align="center" colspan="1" rowspan="1">Runs all the
                                <break/>phases
                                <break/>and 100K
                                <break/>query
                                <break/>for default
                                <break/>datasets
                                <break/>at macbook
                                <break/>laptop</td>
                            <td align="center" colspan="1" rowspan="1">Runs default
                                <break/>dataset
                                <break/>update phase
                                <break/>at digital
                                <break/>ocean.</td>
                            <td align="center" colspan="1" rowspan="1">Runs uniref50
                                <break/>update phase
                                <break/>at digital
                                <break/>ocean.</td>
                            <td align="center" colspan="1" rowspan="1">Runs uniref90
                                <break/>update phase
                                <break/>at digital
                                <break/>ocean.</td>
                            <td align="center" colspan="1" rowspan="1">Runs uniref100
                                <break/>update phase
                                <break/>at digital
                                <break/>ocean.</td>
                            <td align="center" colspan="1" rowspan="1">Runs uniprot
                                <break/>unreviewed
                                <break/>update phase
                                <break/>at digital
                                <break/>ocean.</td>
                            <td align="center" colspan="1" rowspan="1">Runs uniparc
                                <break/>update phase
                                <break/>at digital
                                <break/>ocean.</td>
                            <td align="center" colspan="1" rowspan="1">Runs generate
                                <break/>phase
                                <break/>for all data
                                <break/>at digital
                                <break/>ocean.</td>
                            <td align="center" colspan="1" rowspan="1">Runs 100K
                                <break/>query against
                                <break/>all data
                                <break/>at digital
                                <break/>ocean.</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Operating System</td>
                            <td align="center" colspan="1" rowspan="1">macOS High
                                <break/>Sierra</td>
                            <td align="center" colspan="1" rowspan="1">Ubuntu
                                <break/>16.04.5 x64</td>
                            <td align="center" colspan="1" rowspan="1">Ubuntu
                                <break/>16.04.5 x64</td>
                            <td align="center" colspan="1" rowspan="1">Ubuntu
                                <break/>16.04.5 x64</td>
                            <td align="center" colspan="1" rowspan="1">Ubuntu
                                <break/>16.04.5 x64</td>
                            <td align="center" colspan="1" rowspan="1">Ubuntu
                                <break/>16.04.5 x64</td>
                            <td align="center" colspan="1" rowspan="1">Ubuntu
                                <break/>16.04.5 x64</td>
                            <td align="center" colspan="1" rowspan="1">Ubuntu
                                <break/>16.04.5 x64</td>
                            <td align="center" colspan="1" rowspan="1">Ubuntu
                                <break/>16.04.5 x64</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">CPU</td>
                            <td align="center" colspan="1" rowspan="1">4</td>
                            <td align="center" colspan="1" rowspan="1">8</td>
                            <td align="center" colspan="1" rowspan="1">8</td>
                            <td align="center" colspan="1" rowspan="1">8</td>
                            <td align="center" colspan="1" rowspan="1">8</td>
                            <td align="center" colspan="1" rowspan="1">8</td>
                            <td align="center" colspan="1" rowspan="1">8</td>
                            <td align="center" colspan="1" rowspan="1">8</td>
                            <td align="center" colspan="1" rowspan="1">8</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">RAM</td>
                            <td align="center" colspan="1" rowspan="1">16GB</td>
                            <td align="center" colspan="1" rowspan="1">16GB</td>
                            <td align="center" colspan="1" rowspan="1">16GB</td>
                            <td align="center" colspan="1" rowspan="1">16GB</td>
                            <td align="center" colspan="1" rowspan="1">16GB</td>
                            <td align="center" colspan="1" rowspan="1">16GB</td>
                            <td align="center" colspan="1" rowspan="1">16GB</td>
                            <td align="center" colspan="1" rowspan="1">16GB</td>
                            <td align="center" colspan="1" rowspan="1">16GB</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Update Phase Average
                                <break/>CPU Usage</td>
                            <td align="center" colspan="1" rowspan="1">309%</td>
                            <td align="center" colspan="1" rowspan="1">556%</td>
                            <td align="center" colspan="1" rowspan="1">201%</td>
                            <td align="center" colspan="1" rowspan="1">201%</td>
                            <td align="center" colspan="1" rowspan="1">204%</td>
                            <td align="center" colspan="1" rowspan="1">303%</td>
                            <td align="center" colspan="1" rowspan="1">351%</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Update Phase Average
                                <break/>RAM Usage</td>
                            <td align="center" colspan="1" rowspan="1">9%</td>
                            <td align="center" colspan="1" rowspan="1">84%</td>
                            <td align="center" colspan="1" rowspan="1">39%</td>
                            <td align="center" colspan="1" rowspan="1">36%</td>
                            <td align="center" colspan="1" rowspan="1">36%</td>
                            <td align="center" colspan="1" rowspan="1">36%</td>
                            <td align="center" colspan="1" rowspan="1">37%</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Update Phase Elapsed
                                <break/>Duration</td>
                            <td align="center" colspan="1" rowspan="1">13m</td>
                            <td align="center" colspan="1" rowspan="1">5m</td>
                            <td align="center" colspan="1" rowspan="1">1h24m</td>
                            <td align="center" colspan="1" rowspan="1">2h3m</td>
                            <td align="center" colspan="1" rowspan="1">2h43m</td>
                            <td align="center" colspan="1" rowspan="1">8h27m</td>
                            <td align="center" colspan="1" rowspan="1">8h37m</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Update Phase Files Size</td>
                            <td align="center" colspan="1" rowspan="1">838MB</td>
                            <td align="center" colspan="1" rowspan="1">823MB</td>
                            <td align="center" colspan="1" rowspan="1">3GB</td>
                            <td align="center" colspan="1" rowspan="1">2.8GB</td>
                            <td align="center" colspan="1" rowspan="1">2.4GB</td>
                            <td align="center" colspan="1" rowspan="1">25GB</td>
                            <td align="center" colspan="1" rowspan="1">38GB</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Generate Phase Average
                                <break/>CPU Usage</td>
                            <td align="center" colspan="1" rowspan="1">141%</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">164%</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Generate Phase Average
                                <break/>RAM Usage</td>
                            <td align="center" colspan="1" rowspan="1">19%</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">90%</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Generate Phase Elapsed
                                <break/>Duration</td>
                            <td align="center" colspan="1" rowspan="1">18m</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">8h5m</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Generate Phase Files Size</td>
                            <td align="center" colspan="1" rowspan="1">3.8GB</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">392GB</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Web Phase Average CPU
                                <break/>Usage While Running 100K
                                <break/>Query</td>
                            <td align="center" colspan="1" rowspan="1">113%</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">3%</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Web Phase Average RAM
                                <break/>Usage While Running 100K
                                <break/>Query</td>
                            <td align="center" colspan="1" rowspan="1">2%</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">36%</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">100K Consequent Query
                                <break/>Average Response Time
                                <break/>Queries Sent Inside Same
                                <break/>Machine</td>
                            <td align="center" colspan="1" rowspan="1">0.3ms</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">3ms</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">100K Consequent Query
                                <break/>Average Response Time
                                <break/>Queries Sent From Outside
                                <break/>Network and Machine</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">20ms</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Total Keys</td>
                            <td align="center" colspan="1" rowspan="1">60M</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">3.19B</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">Total Values</td>
                            <td align="center" colspan="1" rowspan="1">131M</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                            <td align="center" colspan="1" rowspan="1">17.07B</td>
                            <td align="center" colspan="1" rowspan="1">-</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
        </sec>
        <sec sec-type="discussion">
            <title>Discussion</title>
            <p>The benchmarks show that Biobtree
                <sup>
                    <xref ref-type="bibr" rid="ref-1">1</xref>
                </sup> has produced LMDB database output in relatively acceptable times. Let&#x2019;s discuss how Biobtree behaves if UniProt provides tens of times larger data. Clearly, more disk space would have been needed.</p>
            <p>If enough disk space is provided, the next obstacle would have happened during the update phase, because currently UniProt provides a single gzip compressed file for each dataset and Biobtree reads each file as a stream from the beginning to end. Gzip does not allow a random access to a file unless there are checkpoints defined. This characteristic of gzip prevents the processing of a single large file in a split manner and utilizes more computing resources if available. Two solutions can address the issue. The first could be for UniProt to allow its datasets to be capable of parallel processing, like splitting and compressing files inside a tar archive. The second solution would be to implement a new functionality in Biobtree and save and decompress these files to local disk and make parallel access to decompressed files.</p>
            <p>A further obstacle would have happened during the generate phase, since we would have obtained more files from update phase. The generate phase struggles to merge all these files and could cause much longer output generation times. To address this obstacle, a new phase could be implemented that runs before the generate phase and merges files coming from the update phase.</p>
        </sec>
        <sec>
            <title>Limitations</title>
            <p>Although duplicate values for each key are discarded during the update phase for each dataset, the generate phase could rarely produce duplicate values. The user needs to discard these duplicate records manually if these are created. Another limitation is when querying the special keywords, they need to be fully specified including all space characters.</p>
        </sec>
        <sec>
            <title>Future work</title>
            <p>The limitations can be addressed and new functionalities added in the future. For instance, different bioinformatics datasets like Ensembl
                <sup>
                    <xref ref-type="bibr" rid="ref-8">8</xref>
                </sup> or ENA
                <sup>
                    <xref ref-type="bibr" rid="ref-9">9</xref>
                </sup> can be integrated. Another feature would be sorting result values based on a certain criterion.</p>
        </sec>
        <sec>
            <title>Data availability</title>
            <p>All data underlying the results are available as part of the article and no additional source data are required.</p>
        </sec>
        <sec>
            <title>Software availability</title>
            <p>
                <bold>All source codes and binaries available at:</bold> 
                <ext-link ext-link-type="uri" xlink:href="https://www.github.com/tamerh/biobtree">https://www.github.com/tamerh/biobtree</ext-link>.</p>
            <p>
                <bold>Archived source code at time of publication:</bold> 
                <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.2547047">https://doi.org/10.5281/zenodo.2547047</ext-link>
                <sup>
                    <xref ref-type="bibr" rid="ref-1">1</xref>
                </sup>.</p>
            <p>
                <bold>License:</bold> 
                <ext-link ext-link-type="uri" xlink:href="https://opensource.org/licenses/BSD-3-Clause">BSD 3-Clause "New" or "Revised" license</ext-link>.</p>
        </sec>
    </body>
    <back>
        <ref-list>
            <ref id="ref-1">
                <label>1</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>G&#x00fc;r</surname>
                            <given-names>T</given-names>
                        </name>
</person-group>:
                    <article-title>tamerh/biobtree: biobtree v1.0.0-rc2 (Version v1.0.0-rc2).</article-title>
                    <source>

                        <italic toggle="yes">Zenodo.</italic>
</source>
                    <year>2019</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://www.doi.org/10.5281/zenodo.2547047">http://www.doi.org/10.5281/zenodo.2547047</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-2">
                <label>2</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Hastings</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Owen</surname>
                            <given-names>G</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dekker</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>ChEBI in 2016: Improved services and an expanding collection of metabolites.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>2016</year>;<volume>44</volume>(<issue>D1</issue>):<fpage>D1214</fpage>&#x2013;<lpage>D1219</lpage>.
                    <pub-id pub-id-type="pmid">26467479</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkv1031</pub-id>
                    <pub-id pub-id-type="pmcid">4702775</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-3">
                <label>3</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Yates</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Braschi</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gray</surname>
                            <given-names>KA</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Genenames.org: the HGNC and VGNC resources in 2017.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>2017</year>;<volume>45</volume>(<issue>D1</issue>):<fpage>D619</fpage>&#x2013;<lpage>625</lpage>.
                    <pub-id pub-id-type="pmid">27799471</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkw1033</pub-id>
                    <pub-id pub-id-type="pmcid">5210531</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-4">
                <label>4</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wishart</surname>
                            <given-names>DS</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Feunang</surname>
                            <given-names>YD</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Marcu</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>HMDB 4.0: the human metabolome database for 2018.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>2018</year>;<volume>46</volume>(<issue>D1</issue>):<fpage>D608</fpage>&#x2013;<lpage>17</lpage>.
                    <pub-id pub-id-type="pmid">29140435</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkx1089</pub-id>
                    <pub-id pub-id-type="pmcid">5753273</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-5">
                <label>5</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Mitchell</surname>
                            <given-names>AL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Attwood</surname>
                            <given-names>TK</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Babbitt</surname>
                            <given-names>PC</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>InterPro in 2019: improving coverage, classification and access to protein sequence annotations.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>2019</year>;<volume>47</volume>(<issue>D1</issue>):<fpage>D351</fpage>&#x2013;<lpage>D360</lpage>.
                    <pub-id pub-id-type="pmid">30398656</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gky1100</pub-id>
                    <pub-id pub-id-type="pmcid">6323941</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-6">
                <label>6</label>
                <mixed-citation publication-type="journal">
                    <collab>Europe PMC Consortium</collab>:
                    <article-title>Europe PMC: a full-text literature database for the life sciences and platform for innovation.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>2015</year>;<volume>43</volume>(<issue>Database issue</issue>):<fpage>D1042</fpage>&#x2013;<lpage>D1048</lpage>.
                    <pub-id pub-id-type="pmid">25378340</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gku1061</pub-id>
                    <pub-id pub-id-type="pmcid">4383902</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-7">
                <label>7</label>
                <mixed-citation publication-type="journal">
                    <collab>The UniProt Consortium</collab>:
                    <article-title>UniProt: a worldwide hub of protein knowledge.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>2019</year>;<volume>47</volume>(<issue>D1</issue>):<fpage>D506</fpage>&#x2013;<lpage>D515</lpage>.
                    <pub-id pub-id-type="pmid">30395287</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gky1049</pub-id>
                    <pub-id pub-id-type="pmcid">6323992</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-8">
                <label>8</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Zerbino</surname>
                            <given-names>DR</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Achuthan</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Akanni</surname>
                            <given-names>W</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Ensembl 2018.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>2018</year>;<volume>46</volume>(<issue>D1</issue>):<fpage>D754</fpage>&#x2013;<lpage>D761</lpage>.
                    <pub-id pub-id-type="pmid">29155950</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkx1098</pub-id>
                    <pub-id pub-id-type="pmcid">5753206</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-9">
                <label>9</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Harrison</surname>
                            <given-names>PW</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Alako</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Amid</surname>
                            <given-names>C</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The European Nucleotide Archive in 2018.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>2019</year>;<volume>47</volume>(<issue>D1</issue>):<fpage>D84</fpage>&#x2013;<lpage>D88</lpage>.
                    <pub-id pub-id-type="pmid">30395270</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gky1078</pub-id>
                    <pub-id pub-id-type="pmcid">6323982</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report46335">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.19605.r46335</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Lampa</surname>
                        <given-names>Samuel</given-names>
                    </name>
                    <xref ref-type="aff" rid="r46335a1">1</xref>
                    <xref ref-type="aff" rid="r46335a2">2</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-6740-9212</uri>
                </contrib>
                <aff id="r46335a1">
                    <label>1</label>Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden</aff>
                <aff id="r46335a2">
                    <label>2</label>Savantic AB, Stockholm, Sweden</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>15</day>
                <month>4</month>
                <year>2019</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2019 Lampa S</copyright-statement>
                <copyright-year>2019</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport46335" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.17927.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The article describes a commandline tool, Biobtree, that is claimed to allow to process relations between bioinformatics datasets based on various characteristics such as identifiers and keywords.</p>
            <p> </p>
            <p> The manuscript describes the tool in a clear way technically, making it quite clear what it does in technical terms, and how it is supposed to be used.</p>
            <p> </p>
            <p> Also, I was able to install and run the tool in a simple way on my laptop (i5 CPU, 8GB RAM and 10-15 GB free hard drive, Xubuntu 16.04 64 bit) without problems. It provides a simple but good looking and easy to use web interface.</p>
            <p> </p>
            <p> I'm seeing at least two major issues with the tool and manuscript though, that needs being thoroughly addressed to make them acceptable.</p>
            <p> </p>
            <p> 
                <bold>Main problem 1: Visualization?</bold>
            </p>
            <p> </p>
            <p> Firstly, the title claims that the tool does visualization of the database produced by the tool. Perhaps I'm missing something, but I have not found any visualization in the tool apart from a form of search hit result listings. I don't think this is enough to be called "visualization". Especially as it is unclear how the current form of output is supposed to be used in a concrete biological usecase. With the current wording, I would expect something more graphical, like a graphviz-like graph view of dataset relations.</p>
            <p> </p>
            <p> Suggested edits to make the tool and paper acceptable: 
                <list list-type="bullet">
                    <list-item>
                        <p>Provide graphical visualization beyond results listings (or explain how to show them, if I have missed them), or else remove "visualization" from the title and other places.</p>
                    </list-item>
                    <list-item>
                        <p>Use this/these visualizations in the use cases/demonstrators discussed above, to explain how they contribute to solving concrete biological problems.</p>
                    </list-item>
                </list> 
                <bold>Main problem 2: Lack of context and discussion of biological relevance</bold>
            </p>
            <p> </p>
            <p> The first and main problem with the manuscript is that it does not provide a clear enough description of what 
                <italic>biological</italic> problem it is solving. Nor does it provide an overview of existing tools and solutions in this field. Right now, the manuscript only states what the tool can do in technical terms. It somehow reads like a (well written) user guide or README file, but not yet a scientific paper. To help potential new users understand why they might need this tool, it needs to be put in context and compared with other existing tools.</p>
            <p> </p>
            <p> In my view, the manuscript needs the following points thoroughly addressing to be acceptable: 
                <list list-type="order">
                    <list-item>
                        <p>In the introduction: Elaborate on the field of mapping/visualising dataset relations, mentioning relevant existing similar tools, what are the typical problems, and what particular problem Biobtree solves.</p>
                    </list-item>
                    <list-item>
                        <p>Explain a few examples of 
                            <italic>biological</italic> problems that can be solved with this tool, or type of tool.</p>
                    </list-item>
                    <list-item>
                        <p>E.g. in the results: Provide at least one, and optimally two or three potentially simple, but relevant, biological demonstrators or use cases, that can be addressed with the tool. Provide complete instructions on how to re-run this or these demo(s) and provide outputs for this/these in terms of figures or diagrams and how these were produced. In this way, both reviewers and users can make sure that they understand how to operate the tool.</p>
                    </list-item>
                    <list-item>
                        <p>In the discussion: Connect back to the explained problem the tool is addressing, and explain how the problem was solved, again reinstating the relevance of this specific tool compared to other existing tools, and what improvement it provides to the end user trying to solve biological problems, exemplified by the demonstrators or use cases.</p>
                    </list-item>
                </list> 
                <bold>Language issues</bold>
            </p>
            <p> </p>
            <p> The manuscript also contains quite a number of language issues. I'm listing a few language suggestions below as examples, but further language proofing or editing is highly recommended, to make sure there are not more of these: 
                <list list-type="order">
                    <list-item>
                        <p>
                            <bold>Methods</bold> section:</p>
                        <p> "in GO programming language" --&gt; "in the Go programming language"</p>
                        <p> (Note the "the" and that only G is uppercase in "Go").</p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Update phase</bold> section:</p>
                        <p> "to LMDB" -&gt; "to the LMDB"</p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Update phase</bold> section:</p>
                        <p> "Updating reads selected datasets as a stream"</p>
                        <p> I don't understand this sentence. Please language-check it.</p>
                    </list-item>
                    <list-item>
                        <p>"is used in next" -&gt; "is used in the next"</p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Generate phase</bold> section:</p>
                        <p> "the project github page" -&gt; "the project's GitHub page"</p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Web phase </bold>section:</p>
                        <p> "starts web phase" -&gt; "starts the web phase"</p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Web interface</bold> section:</p>
                        <p> "The Web interface allow user" -&gt; "The web interface allows the user"</p>
                    </list-item>
                    <list-item>
                        <p>
                            <bold>Operation </bold>section:</p>
                        <p> "Biobtree executable" -&gt; "The biobtree executable".</p>
                    </list-item>
                </list>
            </p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Partly</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Partly</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Yes</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Partly</p>
            <p>Reviewer Expertise:</p>
            <p>Scientific workflow tools, Cheminformatics, Semantic web.</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report45074">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.19605.r45074</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Shokhirev</surname>
                        <given-names>Maxim N.</given-names>
                    </name>
                    <xref ref-type="aff" rid="r45074a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-8379-8657</uri>
                </contrib>
                <aff id="r45074a1">
                    <label>1</label>Razavi Newman Integrative Genomics and Bioinformatics Core, Salk Institute for Biological Studies, La Jolla, CA, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>7</day>
                <month>3</month>
                <year>2019</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2019 Shokhirev MN</copyright-statement>
                <copyright-year>2019</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport45074" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.17927.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>reject</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>While it is important to create a consistent and queryable database of biological identifiers, it is unclear what advances this tool brings to the field. For example, how does this tool compare to other queryable database tools such as mygene.info, or BioMart?&#x00a0;The paper will greatly benefit from a comparison to these and other such tools.</p>
            <p> </p>
            <p> I downloaded and ran the tool but it seems I can't get through the update phase&#x00a0;when I run ./biobtree update (It seems to hang after uniprot_reviewed finishes) without any other messages. When I rerun using biobtree --d uniprot_reviewed update it finishes but there is an error:&#x00a0;</p>
            <p> </p>
            <p> Error while reading file-&gt; .//out/index/0_13.938476000.gz</p>
            <p> panic: gzip: invalid header</p>
            <p> </p>
            <p> I tried running generate and web after that regardless, but couldn't get it to work:</p>
            <p> </p>
            <p> panic: mdb_txn_commit: MDB_BAD_TXN: Transaction must abort, has a child, or is invalid</p>
            <p> </p>
            <p> Error while reading meta information file which should be produced with generate command. Please make sure you did previous steps correctly.</p>
            <p> </p>
            <p> The author needs to debug/test their code to ensure that it can be used by others.</p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>No</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Partly</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Yes</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Bioinformatics</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.</p>
        </body>
        <sub-article article-type="response" id="comment4474-45074">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Gur</surname>
                            <given-names>Tamer</given-names>
                        </name>
                        <aff>EMBL European Bioinformatics Institute, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>10</day>
                    <month>3</month>
                    <year>2019</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Thank you for reviewing the article. I agree that there are several similar tools exist with different dataset and functionalities such as Biomart and mygene.info. However, this tool can still complement them for following main reasons. 
                    <list list-type="bullet">
                        <list-item>
                            <p>Biobtree&#x00a0;can work in local machine. This can be especially useful when large number of requests needs to be performed.&#x00a0;For instance currently similar Uniprot 
                                <ext-link ext-link-type="uri" xlink:href="https://www.uniprot.org/uploadlists/">tool</ext-link> documentation 
                                <ext-link ext-link-type="uri" xlink:href="https://www.uniprot.org/help/uploadlists">suggests</ext-link> either split the requests or download underlying data when number of requests are above 50K. These types of limitations for bulk requests&#x00a0;are sensible for fair usage of a public service&#x00a0;and can be&#x00a0;more suitable with&#x00a0;Biobtree type&#x00a0;locally runnable tool.</p>
                        </list-item>
                        <list-item>
                            <p>Users custom dataset can be integrated.&#x00a0;</p>
                        </list-item>
                        <list-item>
                            <p>Tool provides new intuitive web interface.</p>
                        </list-item>
                    </list> In relation to reported errors, I have added demo of tool in case such errors happen again. I have also added integration test which runs periodically on Linux, MacOS and Windows operating systems via Azure DevOps platform. These tests can be accessed publicly and test and demo links can be found at github page. Based on these tests, it seems that tool is working as expected. I believe that hanged process is a specific issue or bug which I am happy to resolve if I have more information. The rest of problems which have been reported are most probably due to the prematurely exited hanged process. Either starting in a new folder or passing --clean parameter can solve the issue.</p>
            </body>
        </sub-article>
    </sub-article>
</article>
