Keywords
bioinformatics, identifiers, search, mapping, visualization
This article is included in the Bioinformatics gateway.
This article is included in the EMBL-EBI collection.
bioinformatics, identifiers, search, mapping, visualization
In this new version, based on reviewers comments article has been reconstructed and tool has been
improved by integrating new datasets and adding new major feature for performing mapping queries
via web interface and web services.
See the author's detailed response to the review by Samuel Lampa
See the author's detailed response to the review by Maxim N. Shokhirev
Mapping bioinformatics datasets through a web interface or programmatically via identifiers or special keywords and attributes such as gene name, gene location, protein accessions and species name is a common need during genomics research. These mappings play an essential role in molecular data integration (Huang et al., 2011) and allow the gathering of maximum biological insight (Mudunuri et al., 2009) for these diverse bioinformatics datasets.
There are several existing tools for these mapping needs; these tools are gene-centric, protein-centric or can provide both gene- and protein-centric solutions. One of the common gene-centric tools is BioMart (Zhang et al., 2011)-based tools such as Ensembl BioMarts (Kinsella et al., 2011) which covers Ensembl (Zerbino et al., 2018) and Ensembl Genomes (Kersey et al., 2018) datasets. The R programming language package biomarRt (Durinck et al., 2009) is also widely used via performing queries with BioMart-based tools. Other common gene-centric tools are MyGene.info (Xin et al., 2016), DAVID (Huang da et al., 2009) and g:Profiler (Raudvere et al., 2019). Uniprot ID mapping service (Huang et al., 2011) provides a protein-centric solution. bioDBnet (Mudunuri et al., 2009) and BridgeDb (van Iersel et al., 2010) provide services for both gene- and protein-centric solutions.
On the other hand, genomics data size is increasing continuously (Langmead & Nellore, 2018) especially via high throughput sequencing, so performing these mapping on these expanding data sizes in local computers, cloud computing or existing computing environments in a rapid and effective way via tools with easy installation and requiring minimum maintenance is a challenge (Marx, 2013).
The referenced existing gene-centric tools currently do not support large Ensembl Bacteria genomes. Existing tools either provide only online service or require specific technical knowledge such as a particular database or specific programming language to install, use and adapt to different computational environments such as a local computer. Another limitation of the referenced tools is that they provide one-dimensional filtering capability in a single mapping query.
Biobtree address these problems of existing tools, First, it can be used via a single executable file without requiring installation, specific technical knowledge or extra maintenance such as database administration. To process large datasets, it uses a specialized MapReduce-based solution which is discussed in the next section. MapReduce is an effective way to deal with large datasets (Langmead & Nellore, 2018). After processing data, Biobtree provides a web interface, web services and chain mapping and filtering query capability in a single query with its intuitive query syntax which is demonstrated in the use cases section. Biobtree covers a range of bioinformatics datasets including Ensembl Bacteria genomes. The data resources currently used are ChEBI (Hastings et al., 2016), HGNC (Braschi et al., 2019), HMDB (Wishart et al., 2018), InterPro (Mitchell et al., 2019), Europe PMC (Europe PMC Consortium, 2015), UniProt (UniProt Consortium, 2019), Chembl (Gaulton et al., 2017), Gene Ontology (The Gene Ontology Consortium, 2019), EFO (Malone et al., 2010), ECO (Giglio et al., 2019), Ensembl (Zerbino et al., 2018) and Ensembl Genomes (Kersey et al., 2018). Table 1 shows details of these datasets.
The Biobtree implementation process starts by retrieving selected datasets as shown in Table 1 and retrieving data entries belonging to these datasets with their attributes and mapping information from their public locations, which are also shown in Table 1. During this data retrieval, the whole of the data do not get stored and uncompressed on the disc, instead data are retrieved and uncompressed in a streaming manner in the memory, which allows avoiding excessive disc space usage. Necessary data, which are these mapping and attributes, are compactly stored as chunks on the disc. During these data retrievals, all the idle CPUs have been utilized to merge and sort these chunks recursively with each other. It is essential that the produced files are sorted to make fast batch inserts to the LMDB database which Biobtree uses as a database to store its result data. Once the data retrieval process is completed, result chunk files are globally merged using the patience sort technique and inserted into the LMDB database as key and values. Keys consist of identifiers and special keywords such as gene names or species name, and values are attributes and mapped datasets information. In these processes, data retrieval and creation of sorted chunks represent the map phase, global merge of the chunks and database creation represent the reduce phase of the MapReduce solution. Once the database is created, the Biobtree web module provides a web interface and web services to perform both searching for identifiers and mapping queries. Mapping queries has been done with a query syntax which allows chains of mapping and filtering between datasets. An example use case with this syntax is demonstrated in the next section. Biobtree uses a B+ tree data structure-based LMDB key-value store. LMDB provides fast batch inserts and reads which fits the bioinformatics datasets update cycle well where datasets are often updated periodically, and then only intensive read operations are performed. LMDB is embedded into Biobtree’s executable binary code so it does not require a separate installation or special maintenance.
This section covers three use cases. Each use case consists of two or three inputs, a command for how to run them on a local computer and the expected output after performing the use case. The first input is called ‘terms’, which contains a set of identifiers or special keywords such as species name separated by commas. The mapping process starts based on records belonging to these terms. The second input is a mapping query which consists of chains of map and filter function queries. The map function query takes a single dataset name as an argument and the filter function gets a Boolean query based on given source datasets attributes to filter out mappings. Datasets’ attributes are mostly available for source datasets shown in Table 1 and can be explored via the web interface by searching identifiers. The third optional input is dataset to filter out terms which occurred in different dataset with same value. To perform use cases, a given command needs to be run from a terminal via a Biobtree binary file which can be downloaded from the GitHub page. Once a command runs, it processes related data and the Biobtree web interface opens and the use case can be performed via the web interface. For use cases 1 and 2, because it is the same dataset, it is enough to run the command once. For use case 3, the commands need to be run again to process specific bacteria datasets instead of the default human genome dataset.
usecase-1 Map Affymetrix identifiers to Ensembl human genome identifiers and then map these to the molecular function type GO terms
Command biobtree start
Terms 202763_at,209310_s_at
Mapping query map(transcript).map(ensembl).filter(ensembl.genome=="homo_sapiens").map(go).filter(go.type=="molecular_function")
Dataset affy_hg_u133_plus_2
Output Query results 21 GO terms. The terms which maps to identifier 202763_at are GO:0002020, GO:0004190, GO:0004197, GO:0004861, GO:0005123, GO:0005515, GO:0008233, GO:0008234, GO:0016005, GO:0016787, GO:0044877, GO:0097153, GO:0097199 and GO:0097200. And the terms which maps to identifier 209310_s_at are GO:0004197, GO:0005515, GO:0008233, GO:0008234, GO:0016787, GO:0050700 and GO:0097199.
usecase-2 Map human Ensembl identifiers with given genome location to the reviewed Uniprot identifiers
Command biobtree start
Term homo_sapiens
Mapping query map(ensembl).filter(ensembl.start>100000000 && ensembl.end< 101000000 && ensembl.seq_region_name=="X").map(uniprot).filter(uniprot.reviewed)
Output Query results 9 Uniprot reviewed proteins. These protein identifiers are O43657, Q9H2S6, P33240, O60687, Q96C24, Q8TAB3, Q5H913, Q6PP77 and Q9Y5S8.
usecase-3 Map all taxonomic children of given bacteria and then map these children to Ensembl with given genome location and contains a given word
Command biobtree -d +ensembl_bacteria -sp "serovar_infantis,serovar_virchow" start
Term Salmonella enterica subsp. enterica
Mapping query map(taxchild).map(ensembl).filter(ensembl.start<10000&&ensembl.description.contains("SopD"))
Output Query results 7 Ensembl Genomes Bacteria genes. These gene identifiers are AEW14_05145, AEW14_15935, ACH54_23895, ACH56_04205, DE27_21250, DE87_06330 and LPMST02_21800.
A mapping between bioinformatics datasets via identifiers or special keywords such as species names is often performed during genomic analyses and plays an essential role in molecular data integration and getting maximum biological insight from these datasets. There are several gene-centric, protein-centric and both protein- and gene-centric tools for addressing these mapping needs. These tools currently do not support the large Ensembl Genomes Bacteria dataset. In addition, these tools provide either only online services or require specific technical knowledge to install and adapt to new computing environments. Existing tools also provide one-dimensional filtering in a single mapping query. Biobtree addresses these problems by managing a tool with a single executable file without requiring specific technical knowledge and processing large datasets with its specialized MapReduce-based solution. Based on processed data, it creates a uniform database and allows searching identifiers and chain mappings and filtering queries with its web interface and web services.
More datasets can be integrated into the existing system such as gene expression. In addition, following and experimenting with the advancements in large data processing techniques, databases and data structures fields to improve the tool further.
All data underlying the results are available as part of the article and no additional source data are required.
All source codes and binaries available at: https://www.github.com/tamerh/biobtree.
Archived source code at time of publication: https://doi.org/10.5281/zenodo.2547047
License: BSD 3-Clause “New” or “Revised” license.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Scientific workflow tools, Cheminformatics, Semantic web
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatics
Is the rationale for developing the new software tool clearly explained?
Partly
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Scientific workflow tools, Cheminformatics, Semantic web.
Is the rationale for developing the new software tool clearly explained?
Partly
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatics
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 4 (revision) 20 Jan 20 |
read | |
Version 3 (revision) 07 Jan 20 |
read | read |
Version 2 (revision) 16 Sep 19 |
read | read |
Version 1 04 Feb 19 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)