Providing gene-to-variant and variant-to-gene database identifier mappings to use with BridgeDb mapping services.

Database identifier mapping services are important to make database information interoperable. BridgeDb offers such a service. Available mapping for BridgeDb link 1. genes and gene products identifiers, 2. metabolite identifiers and InChI structure description, and 3. identifiers for biochemical reactions and interactions between multiple resources that use such IDs while the mappings are obtained from multiple sources. In this study we created BridgeDb mapping databases for selections of genes-to-variants (and variants-to-genes) based on the variants described in Ensembl. Moreover, we demonstrated the use of these mappings in different software tools like R, PathVisio, Cytoscape and a local installation using Docker. The variant mapping databases are now described on the BridgeDb website and are available from the BridgeDb mapping database repository and updated according to the regular BridgeDb mapping update schedule. Database identifier mapping services are necessary to make the information interoperable to be able to link to other resources. In the present work tbridgeDB added a new feature enabling mapping databases for genes-to variants and vice versa for the variants described in Ensemble. Implementation stages are explained in detail making the work reproducible. The use case scenario clearly demonstrates the value added from the service.The work is certainly scientifically sound. This article describes the use of the BridgeDb framework and associated tools to map gene and gene variant identifiers. The authors produced and made available 5 ready to use mapping databases focused on different categories of human gene variants extracted from the Ensembl database (version 91). They also document in the article the use of these databases in 4 different environments. As stated by the authors BridgeDb is integrated or used by different tools or resources such as wikipathway and pathvision. In this context, it makes sense to enrich the BridgeDB ecosystem with additional mapping databases such as those produced by the authors and focused on gene/gene variant associations, and it makes sense to publish an article describing the new available features.


Introduction
Many bioinformatics software tools rely on database identifier mapping, for instance for 1) recognition and mapping of identifiers used in experimental data to the corresponding identifiers present in secondary sources like pathways or ontology classes or 2) simply to combine data from different sources that use different identifiers. BridgeDb is a database identifier mapping tool that is available as a Java framework and as an installable web service (van Iersel et al., 2010). Tools that integrate BridgeDb are for instance: the community curated pathway resource WikiPathways (Slenter et al., 2018), the modular pathway editor and pathway analysis tool PathVisio (Kutmon et al., 2015), and the network tool Cytoscape used to visualize, extend and evaluate biological networks. Depending on the available mappings BridgeDb can provide the mapping between identifiers from various data sources, also when these link to different molecular levels, e.g. gene to protein. BridgeDb can also be deployed as a web service. Moreover, it is available in a semantic web version, the Identifier Mapping Service (IMS), which can be used inside the Open PHACTS platform but can also be deployed from a software container (Gray et al., 2014) (link to tutorial and link to GitHub). Mappings for BridgeDb are already available for gene products for many species (produced from the respective Ensembl genome annotations (Aken et al., 2017)), for metabolite identifiers (produced from HMDB (Wishart et al., 2013)) and ChEBI (Hastings et al., 2013)), and for reaction identifiers (produced from Rhea (Morgat et al., 2017)).
The BridgeDb mapping databases are linking pins between tools that support genetic variants, genes, and pathways analysis helping to visualize a complex biological context such those typical of the multifaceted (genetic) diseases. Gene-to-variant mapping was not yet available for BridgeDb. Such mappings can be especially useful to work with genetic variations, for instance when evaluating traits with a complicated genetic background like blood pressure, susceptibility to heart failure, or diabetes type 2 development. Single nucleotide polymorphism (SNP) can be responsible for phenotypic variations. In extreme cases this can be the cause of rare genetic disorders. For example, several SNPs in the human DMD gene can cause Duchenne muscular dystrophy (DMD), a severe congenital disorder which leads to severe physical impairment (Magri et al., 2011). Since BridgeDb can stack mappings, the combination of the new gene-to-variant mapping database with the collection that was already available offers versatile mappings for variants to a large set of different human gene and gene product identifiers.
The main objective of this work was to provide mappings between gene identifiers and variant identifiers in both directions. The steps needed to achieve this were: 1) select the best source for the mappings, 2) collect data from the selected source, 3) annotate the result with provenance data about the process, the source, and the source version, and 4) finally to release the new BridgeDb mapping database and integrate that in the regular BridgeDb mapping database update schedule.
Target users for the resulting mappings are 1) bioinformaticians and developers, working on new approaches for data integration, if these use human genetic (variant) information; 2) members and users of ELIXIR data interoperability services, including the implementations in the tools mentioned that perform analyses based on human genetic variant data, for instance for the analysis of common multifaceted genetic diseases or in the rare disease field; and 3) researchers who access and query molecular data resulting from the analysis above.

Methods
The gene-to-variant database uses mappings between Ensembl and dbSNP (Kitts et al., 2013). The Ensembl gene-to-dbSNP variant mappings present in Ensembl were used as the source. The released database is based on Ensembl r91, dbSNP b150, and the human genome assembly GRCh38. Although Ensembl provides more genetic variation from different sources, we focused on dbSNP as this variation database is regularly updated and adjusted to the actual Ensembl genome built. We compared both sources (Ensembl and dbSNP) and made sure that Ensembl provides all dbSNP available variants. So, we are able to rely only on the Ensembl API as a source for the extraction of the data necessary for creation of this mapping database. To prevent problems introduced by the user interfaces we used database dumps for this comparison.
The data dump was obtained from the Ensembl ftp server (link for download). For the first version, we used Ensembl 91, gene annotation with Gencode 27. The vcf (variant call format) file is the one relevant for our mapping. It contains the dbSNP identifier with its additional attributes and the associated Ensembl transcript identifier. By querying the Ensembl platform web service, we can access the gene identifier of the transcript. Combined, this leads to mappings between variants and genes. The size of the complete mapping database exceeded 150 Gb (for Ensembl 91), so we decided to create several different subsets: exonic variants, missense variants, protein truncating variants (PTV), PTV and missense variants, and variants with a PolyPhen score >0.908 indicating "Probably Damaging". Other selections can be created easily on individual demand.
The created database contains the link between the Ensembl gene identifiers and the dbSNP variant identifiers including a selection of attributes (MAF (minor allele frequency), chromosome, variant alleles, and chromosome position start/end).
For the rare cases where a variation is associated to more than one gene, the variant is also associated to these genes in the BridgeDb database. For example, rs199773918 overlaps in the exons of two genes (ENSG00000173366 and ENSG00000239732), and in the exonic variant BridgeDb mapping both genes show up. Nevertheless, in our selection of variants it may happen that not all of them show up due to different variant effect classifications in the different genes. As an example, rs199773918 is a variant that overlaps in the following genes: TPR (ENSG00000047410) and PRG4 (ENSG00000116690). This variant is a "3' prime UTR variant" of TPR and a "missense variant" of PRG4. It can be found in both genes variant tables but due to our selection it will show up only once in the missense variant dataset.

Database creation.
An open-source Java program to create the gene-to-variant database is available on GitHub. After downloading the vcf file from Ensembl, users create a configuration file with several parameters. Then the database creation program will parse the vcf file, retrieve additional information through the Ensembl web service and create the BridgeDb mapping database. Due to the large amount of mappings, the tool commits the mappings to the database in batches to keep the required memory low.
Operation. The database creation workflow is depicted in Figure 1. The vcf file can be downloaded from the Ensembl FTP. The "Homo_sapiens_incl_consequences.vcf.gz" file is used.
System requirements. The database creation tool runs with Java and requires more memory than usually given to a Java process. We advise users to allocate 3-4GB of memory at least when running the database creation tool (-Xmx4G).

Results
The resulting BridgeDb mapping databases are available as a Derby database from here. The new mappings are available for all the BridgeDb implementations mentioned above (PathVisio, Cytoscape, R package, web service, and the IMS). The mapping databases are freely available for download under CC-BY license. Application examples of the use of the variant BridgeDb database are given in the following section. We created gene-to-variant mapping databases for the variant classes given in Table 1. Any other subset of variant classes can be created on demand using the tool described in the Methods section.

Use cases
To test and demonstrate the application of the variant BridgeDb database, we downloaded the database from BridgeDb. The gene-to-variant (and variant-to-gene) queries are shown in four different tools: R command line (Team, 2014), PathVisio (Kutmon et al., 2015), Cytoscape (Shannon et al., 2003) and the local IMS installation using Docker, in order to provide an overview of the flexibility of the mapping database in different environments. A genetic variant of the rare disease Duchenne muscular dystrophy (DMD) was selected from the genedisease association database DisGeNET (Piñero et al., 2017). The rs104894790 (Lenk et al., 1993) SNP was chosen because it presented a high number of citations and a stop gain damaging effect on the gene's protein product.

R
The SNP, rs104894790, as described above was used to query the Ensembl identifier for the gene(s) in which it is located (variant-to-gene query). The query was performed in R command

PathVisio
We used PathVisio (version 3.3.0) (Figure 2), a biological pathway analysis tool that allows drawing, editing and analyzing biological pathways, to demonstrate how the new gene-variant database can be used to evaluate variants in a pathway context. PathVisio, like Cytoscape, has the BridgeDb functionality integrated in the core. For the purpose of the demonstration, we first selected pathways that contain the DMD gene from the R example. Five pathways were found: two striated muscle contraction pathways (WP3795 and WP383), Ectoderm differentiation (WP2858), Extracellular matrix organization (WP2703) and Arrhythmogenic right ventricular cardiomyopathy (WP2118). In principle, a new PathVisio plugin could now be developed that searches pathways that contain genes with selected variants automatically, or the plugin could show all variants from an analysis sets on a given pathway. For the example, one of the striated muscle contraction pathways (WP383) was selected and visualized. Next, the BridgeDb variant database was loaded, using the BridgeDbConfig plugin. After selecting a gene in the pathway, the backpage tab of the right hand side panel now shows the list of hyperlinks obtained from the BridgeDb database that point to different information sources linked to the gene selected. Figure 2 shows the backpage with the list of the 720 SNPs (from the BridgeDb with a PolyPhen score > 0.908, file SNP_r91_PolyPhen.bridge) for the selected DMD gene. All the SNPs in the backpage have a hyperlink to the corresponding dbSNP page.

Cytoscape
An alternative gene-to-variant visualization is provided using Cytoscape (version 3.6.1), a popular tool for (biological) network analysis and visualization ( Figure 3). The BridgeDb app for Cytoscape is available here. A node with the Ensembl gene  Using the BridgeDb app for Cytoscape, a gene-variant network for the DMD gene (blue rounded rectangle) and its 720 probably damaging variants (PolyPhen score > 0.908) was created. The node color of the variants represents the PolyPhen score as a gradient (white-red), the darker the red, the higher the PolyPhen score.
identifier of DMD was created and the 720 SNPs were mapped to the gene using the BridgeDb app interface. A gene-variant network was created using the list of variants mapped. Moreover, the app can be used to configure the selection of several attribute columns related to the variant nodes such as: chromosome location, minor allele frequency, and variant allele. In this example figure, we visualize the PolyPhen score as the node fill color of the variants. For simplicity, the rs-numbers are not displayed.

BridgeDb Identifier Mapping Service (IMS)
Finally, we here show that identifier mapping linking variants to genes and vice versa can also be done at a semantic web level, we here demonstrate how an online BridgeDb Identifier Mapping Service (IMS) can be set up. The IMS technology was developed in the Open PHACTS project to link drug discovery related data sets, including a Docker image (Batchelor et al., 2014;van Iersel et al., 2010;Williams et al., 2012). Here, identifier mappings are defined by link sets, which specify which identifiers are mapped. However, unlike traditional BridgeDb mapping files, these link sets also specify why the two identifiers are mapped, allowing them to be used as scientific lenses (Batchelor et al., 2014).
Because the IMS works at a semantic web level, identifiers are represented by uniform resource identifiers (URIs). Moreover, the IMS is aware of URI equivalence defined by the MIRIAM registry (Juty et al., 2012). This means that even when a mapping file does not provide mappings for a certain URI, one would still get a number of equivalent URIs, following knowledge from MIRIAM database. And, when a single mapping is found in the link sets, equivalent URIs for the mapped URIs it returned. The IMS provide a targetUriPattern parameter allowing you to restrict the number of mapped URIs.
We developed a tutorial explaining how to set up an IMS instance with the variant-gene mappings (available from GitHub). The instance is run locally using a Docker container developed by Open PHACTS, which is available from DockerHub. After the Docker image is started, it provides a web interface and an API. The web interface has a "Check Mapping for an URI" page where the URI can be given to be mapped, the return format (XML, JSON, or HTML), and optionally a lensUri (see (Batchelor et al., 2014), and the aforementioned targetUriPattern).
However, it is more convenient to use this API from other tools, as demonstrated with a second R script (Supplementary File 1). This R script uses the curl (Ooms, 2017 link) and jsonlite (Ooms, 2014) packages to interact with the IMS. The first package is used to call the IMS webservice and the second to convert the returned JSON into a data model more easily handled in R. The example consists of two API calls: the first part finds 603 variants for the DMD gene (Ensembl ID ENSG00000198947); the second example takes a single variant (dbSNP ID rs769658853) and looks up the matching gene.

Discussion
The BridgeDb toolset provides several apps and tools designed for different purposes, while mapping databases are available to link different database IDs for genes and gene products, metabolites, and reactions and interactions. A mapping database in the BridgeDb software environment, capable of linking genes to their variants and vice versa, was not yet available. The new database is expected to be useful to enhance the biological interpretation of genetic variant data (as shown with the example of the DMD gene) for instance when using apps that evaluate biological pathways, use the classification of genes according to ontology terms, or in the R environment when performing gene and variant related statistical evaluation.
With this newly created mapping database and the transitivity function of BridgeDb, the user can map between three different layers: e.g. variant-gene-protein. This approach can support multi-omics analysis for various biomedical applications, and tools like Cytoscape and PathVisio can be used immediately to benefit from this.
We intend to keep the content up-to-date by regular updates. The human variant mapping database is already incorporated into the quarterly BridgeDb mapping database update. Also other variant sets including more than only the currently included protein truncating and missense variants can be created on user community (or individual) demand.

Data availability
The License: Apache 2.0

Competing interests
No competing interests were disclosed.

Grant information
This work was funded by ELIXIR, the European research infrastructure for life-science data.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Yes of variants found associated to the ENSG00000198947 gene using the IMS method (603) is smaller than the number of variants found with the methods relying on the PolyPhen bridge database (720). An greater, or at least an equal, number of SNPs was expected since no specific database is specified in the IMS query. Does it rely on the same information than available in bridge databases? The authors claim that the Bridge databases they provide contain a selection of attributes that can be retrieved. I could only test this feature using the cytoscape BridgeDb app, trying to reproduce the Figure 3. I could install the SNP_r91_PolyPhen.bridge database and use it to retrieve 720 variants associated to the ENSG00000198947 gene. Then, I tried to get all the attributes for the 720 SNPs.
The query took more than 3 hours to run (Processor: Intel i5-6300U 2.40Ghz; RAM: 8GB; OS: Windows 10 Enterprise 64-bit; Cytoscape 3.6.1; BridgeDb app 1.1.0.2). Such long runtime should be mentioned in the article. 52225867;source=dbSNP;v=rs5927022;vdb=variation;vf=15937754). Also it was not easy to find in which database the variant described in the article was available. I've tried all of them and could find this variant in the "SNP_r91_Exon.bridge" and "SNP_r91_PTV.bridge" databases. The script should be in accordance with the article. The authors should also provide a strategy to identify the relevant database to map a SNP or several SNPs to a gene (as mentioned above). Is it possible to get SNP attributes from the R interface as it is from cytoscape? ○ Finally minor issues could also be addressed to improve the quality of the article Method The authors write that they are able to rely on Ensembl API. But they've used files downloaded from Ensembl site and not the API. This sentence should be modified accordingly.

○
The author mention problems introduced by Ensembl user interfaces. What are these problems?
○ ○ Implementation How long does it take to create each bridge database? Why not creating a complete database with more attribute for variant annotation?

○
The vcf file mentioned in the article is not available anymore. It has been split by chromosome since the Ensembl release 93. Is the database creation workflow compatible with this new organization of the original files? ○ The figure 1 is not very informative. It does not describe the database creation workflow which is only a box in this figure. I think it would be more informative to focus on this box and to explain what are the different steps in this box. Indeed, according to the information found on github, it seems that there are 2 java programs "VariantReader" and "VariantCreator" which are called sequentially in order to produce the database.

Results
The dates in table 1 are misleading. They probably refer to the date of the database creation in June and July 2018. However the Ensembl version used as data source is from December 2017/April 2018. The authors should clarify this point in the table legend.
○ ○ Use cases It would be very useful to add the attributes of the SNPs in the PathVisio backpage in addition to the hyperlinks.

○
Being able to access SNP information from cytoscape is a nice feature. However I don't think that the use case provided by the author is very relevant. Indeed I don't know how a network with 1 gene linked to 720 variants can be used or interpreted as such (In this case a table with all the variants related to the gene and their attributes should be sufficient). Maybe, an example of a network with more gene would be more interesting.

○
The link to the cytoscape app is missing.