ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

Biobtree: A tool to search, map and visualize bioinformatics identifiers and special keywords

[version 1; peer review: 1 approved with reservations, 1 not approved]
PUBLISHED 04 Feb 2019
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the EMBL-EBI collection.

Abstract

Due to their nature, bioinformatics datasets are often closely related to each other. For this reason, search, mapping and visualization of these relations are often performed manually or programmatically via identifiers or special keywords such as gene symbols. Although various tools exist for these situations, the growing volume of bioinformatics datasets, emerging new software tools and approaches motivates new solutions. To provide a new tool for these current cases, I present the Biobtree bioinformatics tool. Biobtree effectively fetches and indexes identifiers and special keywords with their related identifiers from supported datasets, optionally with user pre-defined datasets and provides a web interface, web services and direct B+ tree data structure based single uniform database output. Biobtree can handle billions of identifiers and runs via a single executable file with no installation and dependency required. It also aims to provide a relatively small codebase for easy maintenance, addition of new features and extension to larger datasets. Biobtree is available to download from GitHub.

Keywords

bioinformatics, identifiers, search, mapping, visualization

Introduction

Bioinformatics datasets often consist of entries, where each entry is represented by unique identifier. Depending on the dataset, each entry contains various types of information such as sequence data, biological function, chemical structure or literature reference etc. In addition, entries often contain cross-referencing information to other dataset entries via identifiers. Let’s take as an example entry the proto-oncogene vav protein in humans, which is encoded by the VAV1 gene. If we display this protein on the UniProt website we see cross references to many other datasets. These cross references represent relations of datasets with each other. Various tools exist to deal with such data; however, the growing volume of bioinformatics datasets, emerging new software tools and analysis approaches motivates new solutions. Biobtree1 presented herein is capable of improved and rapid processing of large numbers of unique identifiers of entries and related identifiers that are specified via cross-reference data.

In some datasets, in addition to unique identifiers there is information that is strongly related to entries but not necessarily unique for each entry. Species names or UniProt secondary accessions are example of this type. Information that is strongly related to the entries but not necessarily unique is a second data source for Biobtree. In Biobtree, these types are called special keywords and each of these can be related to multiple entries among the datasets.

Biobtree retrieves all these identifiers, related identifiers and special keywords from various bioinformatics resources and stores it in a single database. The data resources currently used are ChEBI2, HGNC3, HMDB4, InterPro5, Europe PMC6 and UniProt7. Table 1 shows details of these datasets.

Table 1. List of datasets.

DatasetDescriptionFile NameLocationFormatSpecial
Keywords
ChEBIChEBI reference
accession data
database_accession.
tsv
ftp.ebi.ac.uk/chebi/Flat_file_tab_delimited/TSV-
HGNCHuman gene
nomenclature
hgnc_complete_set.
json
ftp.ebi.ac.uk/genenames/new/json/JSONname,symbol
HMDBHuman metabolome
database
hmdb_metabolites.
zip
http://www.hmdb.ca/system/downloads/current/XMLname,
synonyms
InterProProtein Familiesinterpro.xml.gzftp://ftp.ebi.ac.uk/pub/databases/interpro/currentXML-
Literature
mappings
Literature pmid, pmcid
and doi mappings
PMID_PMCID_DOI.
csv.gz
ftp://ftp.ebi.ac.uk/pub/databases/pmc/DOI/CSV-
TaxonomyNCBI Taxonomytaxonomy.xml.gzftp://ftp.ebi.ac.uk/pub/databases/taxonomy/XMLscientificName
UniparcUniProt Sequence
Archive
uniparc_all.xml.gzftp.ebi.ac.uk/pub/databases/uniprot/current_release/uniparc/XML-
UniProt
reviewed
UniProt Knowledgebase
reviewed
uniprot_sprot.xml.gzftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/XMLaccession
UniProt
unreviewed
UniProt Knowledgebase
unreviewed
uniprot_trembl.xml.gzftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/XMLaccession
Uniref50UniProt sequence
clusters
uniref50.xml.gzftp.ebi.ac.uk/pub/databases/uniprot/current_release/uniref/uniref50/-
Uniref90UniProt sequence
clusters
uniref90.xml.gzftp.ebi.ac.uk/pub/databases/uniprot/current_release/uniref/uniref90/XML-
Uniref100UniProt sequence
clusters
uniref100.xml.gzftp.ebi.ac.uk/pub/databases/uniprot/current_release/uniref/uniref100/XML-

Based on stored data, Biobtree provides search, map and visualization functionalities via provided web services or a web interface. For instance, all the UniProt proteins entries belonging to a gene name, or, all Ensembl8 genome transcripts identifiers and ENA9 sequence identifiers that map to a protein identifier can be accessed. These relations, determined via identifiers, are stored bidirectionally so all actions can also be done in the opposite way.

Biobtree is managed from a single executable file for each major operating system without requiring any installation or compilation. As a database, Biobtree uses a B+ tree data structure based LMDB key value store. LMDB provides fast batch inserts and reads and allows effective operation on a large number of records. LMDB is embedded into Biobtree’s executable binary code so it does not require a separate installation.

Methods

Implementation

Biobtree1 has been implemented in GO programming language. To use C programming language based LMDB in Biobtree GO environment lmdb-go binding library has been used. To implement the Web interface Javascript programming language, Vue and Bulma web frameworks has been used. Biobtree workflow consists of three phases, which will be explained in later sections. These phases are named update, generate and web and are controlled by a Biobtree command line interface (CLI)

Update phase

The purpose of the update phase is to retrieve dataset identifiers and special keywords from remote servers or a local disk and produce files that contains identifiers and special keywords with their referred identifiers as keys and values in a sorted order. It is essential that the produced files are sorted to make fast batch inserts to LMDB database in the next phase. The updating phase is started via the Biobtree CLI with the update command. For example, the following command starts the update phase for the hgnc dataset.

$ biobtree --d hgnc update

Updating reads selected datasets as a stream and saves Biobtree-related data in a series of files. An advantage of reading dataset as streams is that it does not require fully downloading the dataset to the local disk. Datasets can have different formats like XML, JSON, TSV or CSV. Biobtree has specialized parsers for each dataset and parses them to produce its output files. When Biobtree runs the first time it retrieves its configuration, license and web interface files from the source code repository. Configuration files contain Biobtree runtime settings and dataset definitions.

Integrate user dataset

User data can be integrated to Biobtree. This feature creates an alternative for data providers to serve their data. Data should be gzipped and in an xml format compliant with UniProt xml schema definition. After the file path of the data is configured in a Biobtree configuration file, updating starts similarly:

$ biobtree --d my_data update

Updating in multiple computers

Biobtree supports executing the update phase over multiple computers. This is useful when it is necessary to use multiple computer processors at the same time such as with large datasets. The following two commands can be run on different computers with additional idx argument to guarantee that the produced files have unique names.

$ biobtree --d uniparc –idx 1 update
$ biobtree --d uniref50 –idx 2 update

Although Biobtree supports the updating phase occurring over multiple computers, for the next phase all the produced files have to be in a single location.

Generate phase

The purpose of this phase is to merge all the files produced in the update phase by keeping the sorted order and generate the final key and values in the generated LMDB database. Keys consist of identifiers and special keywords and values are identifiers that are referred to by these keys with their dataset information. If the values size for each key are above a certain threshold they are saved in pages. The following command starts the generate phase.

$ biobtree generate

The generated database output is used in next web phase but it can be also used directly. Example source codes for using the database directly can be found in the project github page.

Web phase

The purpose of this phase is to provide web services and a web interface via the produced output of the generate phase. The following command starts web phase

$ biobtree web

With this command the Biobtree web server is started instantly, serving the REST, gRPC and web interface services.

REST service

To make queries in the produced database, Biobtree provides RESTful endpoints with json-formatted responses. For example, each dataset has a unique identifier and other meta information like name, url template, etc. Dataset-unique identifiers are used in all the services to distinguish the dataset. These meta information is retrieved via the following endpoint:

                    http://localhost:8888/ws/meta

The following are used to query single or multiple identifiers or special keywords:

                    http://localhost:8888/ws/?idlist=vav_human

                    http://localhost:8888/ws/?idlist=vav_human,tpi1,brca2

To make a paging query for a certain identifier:

                    http://localhost:8888/ws/?id=vav_human&dataset=1&page=1

To make a filtering query based on a dataset:

                    http://localhost:8888/ws/?id=vav_human&dataset=1&filters=102

To make a paging query with active filtering:

                    http://localhost:8888/ws/?id=vav_human&dataset=1&filters=102&page=1

gRPC service

RESTful services are often json-based and are very convenient in json-based applications. But for non–json-based applications, it requires an extra process of serialization and deserialization. To address this, Biobtree provides a gRPC service with same functionality as its RESTful service. Sample codes for using gRPC in different languages can be found on the project github page. The following is snippet of Biobtree gRPC service definitions:

service BiobtreeService {

rpc Get     (BiobtreeGetRequest)     returns (BiobtreeGetResponse);
rpc GetPage (BiobtreeGetPageRequest) returns (BiobtreeGetPageResponse);
rpc Filter  (BiobtreeFilterRequest)  returns (BiobtreeFilterResponse);
rpc Meta    (BiobtreeMetaRequest)    returns (BiobtreeMetaResponse);

}

Web interface

The Web interface allow user to visualize the produced database via the RESTful service. Once the web phase is started it is accessed via the browser from the following address:

                    http://localhost:8888/ui

The Web interface provides searching of multiple identifiers and special keywords, visualizing and filtering results, and executing bulk queries. On the result page for each result, it provides a url to access the main website where the data are originally produced. Figure 1 and Figure 2 show the main and result page of the web interface.

fe81097f-ee99-4c2f-967d-dfbdbcf8cb99_figure1.gif

Figure 1. Web interface main page.

fe81097f-ee99-4c2f-967d-dfbdbcf8cb99_figure2.gif

Figure 2. Web interface result page.

Operation

Biobtree executable file is available from GitHub page for Windows, MacOS and Linux operating systems. For default datasets and configuration Biobtree uses up to 4 GB of memory. For large datasets it is advised to use computer which has large RAM space such as 16 GB to speed up finishing update and generate phases. RAM usage for these phases can be managed from configuration file via kvgenChunkSize and batchSize variables. For CPU Biobtree uses all available CPU powers when it needed. This default behaviour can be restricted via CLI with maxcpu argument.

Benchmarks

Software benchmarks should often be taken with a grain of salt, especially where the input data is large, because benchmarks results can be affected by many factors like the processor, operating system, storage, network speed, input data, application parameters etc. Considering these, the purpose of Biobtree benchmarks is mainly to show overall capabilities and resource usage of Biobtree. Table 2 shows the benchmark details and results. Benchmarks were primarily computed at DigitalOcean London datacentres using their CPU optimized droplets with block storage volumes. Hundred thousand sample query which used in the benchmarks can be found on the project GitHub page.

Table 2. Benchmarks results.

-Benchmark-1Benchmark-2Benchmark-3Benchmark-4Benchmark-5Benchmark-6Benchmark-7Benchmark-8Benchmark-9
DescriptionRuns all the
phases
and 100K
query
for default
datasets
at macbook
laptop
Runs default
dataset
update phase
at digital
ocean.
Runs uniref50
update phase
at digital
ocean.
Runs uniref90
update phase
at digital
ocean.
Runs uniref100
update phase
at digital
ocean.
Runs uniprot
unreviewed
update phase
at digital
ocean.
Runs uniparc
update phase
at digital
ocean.
Runs generate
phase
for all data
at digital
ocean.
Runs 100K
query against
all data
at digital
ocean.
Operating SystemmacOS High
Sierra
Ubuntu
16.04.5 x64
Ubuntu
16.04.5 x64
Ubuntu
16.04.5 x64
Ubuntu
16.04.5 x64
Ubuntu
16.04.5 x64
Ubuntu
16.04.5 x64
Ubuntu
16.04.5 x64
Ubuntu
16.04.5 x64
CPU488888888
RAM16GB16GB16GB16GB16GB16GB16GB16GB16GB
Update Phase Average
CPU Usage
309%556%201%201%204%303%351%--
Update Phase Average
RAM Usage
9%84%39%36%36%36%37%--
Update Phase Elapsed
Duration
13m5m1h24m2h3m2h43m8h27m8h37m--
Update Phase Files Size838MB823MB3GB2.8GB2.4GB25GB38GB--
Generate Phase Average
CPU Usage
141%------164%-
Generate Phase Average
RAM Usage
19%------90%-
Generate Phase Elapsed
Duration
18m------8h5m-
Generate Phase Files Size3.8GB------392GB-
Web Phase Average CPU
Usage While Running 100K
Query
113%-------3%
Web Phase Average RAM
Usage While Running 100K
Query
2%-------36%
100K Consequent Query
Average Response Time
Queries Sent Inside Same
Machine
0.3ms-------3ms
100K Consequent Query
Average Response Time
Queries Sent From Outside
Network and Machine
--------20ms
Total Keys60M------3.19B-
Total Values131M------17.07B-

Discussion

The benchmarks show that Biobtree1 has produced LMDB database output in relatively acceptable times. Let’s discuss how Biobtree behaves if UniProt provides tens of times larger data. Clearly, more disk space would have been needed.

If enough disk space is provided, the next obstacle would have happened during the update phase, because currently UniProt provides a single gzip compressed file for each dataset and Biobtree reads each file as a stream from the beginning to end. Gzip does not allow a random access to a file unless there are checkpoints defined. This characteristic of gzip prevents the processing of a single large file in a split manner and utilizes more computing resources if available. Two solutions can address the issue. The first could be for UniProt to allow its datasets to be capable of parallel processing, like splitting and compressing files inside a tar archive. The second solution would be to implement a new functionality in Biobtree and save and decompress these files to local disk and make parallel access to decompressed files.

A further obstacle would have happened during the generate phase, since we would have obtained more files from update phase. The generate phase struggles to merge all these files and could cause much longer output generation times. To address this obstacle, a new phase could be implemented that runs before the generate phase and merges files coming from the update phase.

Limitations

Although duplicate values for each key are discarded during the update phase for each dataset, the generate phase could rarely produce duplicate values. The user needs to discard these duplicate records manually if these are created. Another limitation is when querying the special keywords, they need to be fully specified including all space characters.

Future work

The limitations can be addressed and new functionalities added in the future. For instance, different bioinformatics datasets like Ensembl8 or ENA9 can be integrated. Another feature would be sorting result values based on a certain criterion.

Data availability

All data underlying the results are available as part of the article and no additional source data are required.

Software availability

All source codes and binaries available at: https://www.github.com/tamerh/biobtree.

Archived source code at time of publication: https://doi.org/10.5281/zenodo.25470471.

License: BSD 3-Clause "New" or "Revised" license.

Comments on this article Comments (0)

Version 4
VERSION 4 PUBLISHED 04 Feb 2019
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Gur T. Biobtree: A tool to search, map and visualize bioinformatics identifiers and special keywords [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2019, 8:145 (https://doi.org/10.12688/f1000research.17927.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 04 Feb 2019
Views
19
Cite
Reviewer Report 15 Apr 2019
Samuel Lampa, Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden;  Savantic AB, Stockholm, Sweden 
Approved with Reservations
VIEWS 19
The article describes a commandline tool, Biobtree, that is claimed to allow to process relations between bioinformatics datasets based on various characteristics such as identifiers and keywords.

The manuscript describes the tool in a clear way technically, ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Lampa S. Reviewer Report For: Biobtree: A tool to search, map and visualize bioinformatics identifiers and special keywords [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2019, 8:145 (https://doi.org/10.5256/f1000research.19605.r46335)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
49
Cite
Reviewer Report 07 Mar 2019
Maxim N. Shokhirev, Razavi Newman Integrative Genomics and Bioinformatics Core, Salk Institute for Biological Studies, La Jolla, CA, USA 
Not Approved
VIEWS 49
While it is important to create a consistent and queryable database of biological identifiers, it is unclear what advances this tool brings to the field. For example, how does this tool compare to other queryable database tools such as mygene.info, ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Shokhirev MN. Reviewer Report For: Biobtree: A tool to search, map and visualize bioinformatics identifiers and special keywords [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2019, 8:145 (https://doi.org/10.5256/f1000research.19605.r45074)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 10 Mar 2019
    Tamer Gur, EMBL European Bioinformatics Institute, UK
    10 Mar 2019
    Author Response
    Thank you for reviewing the article. I agree that there are several similar tools exist with different dataset and functionalities such as Biomart and mygene.info. However, this tool can still ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 10 Mar 2019
    Tamer Gur, EMBL European Bioinformatics Institute, UK
    10 Mar 2019
    Author Response
    Thank you for reviewing the article. I agree that there are several similar tools exist with different dataset and functionalities such as Biomart and mygene.info. However, this tool can still ... Continue reading

Comments on this article Comments (0)

Version 4
VERSION 4 PUBLISHED 04 Feb 2019
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.