Keywords
dbGaP, TCGA, SRA, cancer, database, genome, The Cancer Genome Atlas, GDC
This article is included in the Hackathons collection.
dbGaP, TCGA, SRA, cancer, database, genome, The Cancer Genome Atlas, GDC
Many large funding organizations, including the National Institutes of Health (NIH), encourage researchers to make their data available in public databases. Policies like the NIH’s Genomic Data Sharing policy (https://gds.nih.gov/03policy2.html) and other incentives around data sharing have promoted the development of several public data repositories. However, in spite of the availability of data, it can still be challenging to harness the power of these public databases, and researchers are faced with a variety of barriers in accessing shared data (van Schaik et al., 2014).
A major obstacle to data discovery is the disconnectedness of various data sharing resources. Automated tools that can connect these databases and reduce the time that researchers spend on data discovery are critically needed (Dudley & Butte, 2008; Ruau et al., 2011). Such tools will promote reproducibility, increase the efficiency of research, and aid in solving the problem of small sample sizes. These issues are especially relevant to genomic data, which is typically expensive to gather.
Here, we focus on connecting two popular genomic data repositories, the Database of Phenotypes and Genotypes (dbGaP) (Tryka et al., 2014) and The Cancer Genome Atlas (TCGA), data hosted by the Genomic Data Commons (GDC; https://gdc.cancer.gov/). These two popular data sharing resources both house genomic datasets related to cancer, but despite containing similar data, these repositories have no direct connection to allow researchers to link them together. In the case of these two repositories the only way to find projects with analogous metadata is to manually search each repository. The key contribution of this work is a tool that acts as an interface between the GDC and dbGaP, which allows researchers to discover dbGaP datasets with similar metadata to a TCGA dataset of interest.
GDC. The GDC (https://gdc.cancer.gov/) is a highly curated resource for datasets from cancer related genomic studies from the National Cancer Institute (NCI). Its primary function is to provide a centralized repository for accessibility to data from large-scale NCI programs, such as TCGA and its pediatric equivalent, Therapeutically Applicable Research to Generate Effective Treatments. As of September 2016, GDC held over 260K sequence files with different genomic data-types (whole genome, RNA, etc.) of over 14K patients.
dbGaP. The National Center for Biotechnology Information (NCBI) dbGaP (https://www.ncbi.nlm.nih.gov/gap) is the largest collection of genomic data. It is not limited to cancer data or human data. While the metadata fields are fixed, unlike the GDC, the entries in these fields are not curated. This is a challenge for harmonizing the metadata across the two datasets. The NCBI Sequence Read Archive (SRA) (https://www.ncbi.nlm.nih.gov/sra) is a collection of sequence data associated with the studies in dbGaP.
As the tool was developed as part of a hackathon, we used a development methodology similar to the Rapid Application Development model suitable for prototype development (Kerr & Hunter, 1994). This subsection is organized as steps within this methodology.
Defining the scope. We first identified the end users of our tool to be molecular and computational biologists and bioinformaticians with limited programming experience. Thus, the tools had to be easy to setup and execute. Next, we identified the use-cases as follows:
The tool should take TCGA study identifiers or study-level metadata values from the GDC and identify dbGaP studies with analogous data.
The tool should subsequently provide the capability of fetching the sequence level genomic data directly for these studies from the NCBI SRA data repository.
This gave us the necessary modules that needed to be developed.
Mapping the metadata. We first extracted the required metadata by parsing the raw XML data and also scrapping the website data from both TCGA (GDC) and dbGaP. This metadata is stored as mapping tables in CSV format. Based on the extracted metadata, we developed two mapping dictionaries to translate between 1) disease terms and 2) genomic data-types, as defined separately within dbGaP and the GDC.
Accomplishing this mapping was challenging, as the allowable values for these fields is strictly controlled in the GDC, but completely user-defined in dbGaP. We designed a rule-based mapper to generate an initial map between search values from each repository, then manually curated these mappings to refine and rank mapped terms. These mappings are stored and used during the execution of our tool.
Developing the required modules. Both the TCGA data (through GDC; https://gdc.cancer.gov/developers/gdc-application-programming-interface-api) and dbGaP (through NCBI Eutils; https://eutils.ncbi.nlm.nih.gov/entrez/eutils/) provide APIs to access their respective data that allow metadata transfer in the XML or JSON formats. An API or Application Programming Interface provide an interface to data and services that other programs can directly use.
The SRA toolkit is a software tool that allows researchers to obtain the sequence data (with appropriate access rights) from the SRA database. The search can be narrowed by various parameters, including the genomic region and type of sequence (e.g. mRNA and whole genome shotgun).
We used Python (version 2.7; https://www.python.org/) for the development of our tool. We wanted to keep the tool as platform agnostic as possible. As the SRA toolkit is Unix-based, only the final part of the implementation pipeline, as discussed subsequently, is a shell script (not directly compatible in Windows environment).
We developed an easy-to-use tool that can be used to find additional data from dbGaP (and SRA) by expanding TCGA queries automatically. The first part of the pipeline allows researchers to query either repository by TCGA Project ID, File ID, Case ID, disease type, or experimental strategy via a metadata mapping dictionary. It returns not only a list of TCGA IDs, but also a list of related dbGaP study IDs. For dbGaP studies with NCBI SRA data, the second part of the pipeline will return the .sam files that contains reads aligned to a genomic region of interest to be used with the SRA Toolkit. Our tool is divided into three modules as illustrated in Figure 1. Below, each module is discussed in detail.
This component of the pipeline queries the GDC in multiple ways, including a direct ID search for projects, cases, samples, or files, or a custom search by the cancer type or experimental methods. Currently, the scope of custom search is limited to the available terms in the GDC data portal (Table 1). The module fetches the metadata using the GDC API and extracts the metadata terms related to the specified ID (i.e. the cancer type and experiment method). It then translates these terms to corresponding dbGaP search terms and returns the relevant dbGaP study IDs using the NCBI Eutils API. While executing the pipeline, the XML/JSON outputs of the APIs are processed in-memory behind the scenes. Thus, the end-users are not exposed to the API directly.
The mapping between the Disease and Primary Site can be found in our GitHub repository.
For custom searches, this module returns results from both the GDC and dbGaP simultaneously. Thus, this module also provides consolidated search capability over the TCGA and dbGaP data. The output from this module includes two files:
The second component of the pipeline takes the list of dbGaP studies IDs and returns the list of sequence read run (SRR) files from the NCBI SRA from the dbGaP studies, when available. The users can specify the genomic region of interest as an additional parameter.
The final part of the pipeline takes a list of SRRs and uses the SRA-toolkit to return sequencing level genomic data for a genomic region of interest directly from NCBI SRA data repository. This module assumes the required authorization has been granted prior to accessing the sequencing data.
To our knowledge, this is the first easy-to-use tool for harmonizing TCGA and dbGaP study metadata for the purpose of data discovery and consolidated querying. We would like to continue to work with the cancer biology community to develop this interface tool. Future improvements include extending our search capabilities to include other metadata, the option to query multiple genomic regions simultaneously, and a user-friendly GUI. Feature requests or contributions of code can be made on our GitHub site, which will be monitored for such activity.
Latest source code: https://github.com/NCBI-Hackathons/TCGA_dbGaP.
Archive source code as at the time of publication: doi, 10.5281/zenodo.160551 (Kurata, 2016) (https://zenodo.org/record/160551#.WE7Lz9WLTcs)
License: CC0 1.0 Universal
BB, LA and SR conceived the idea. All authors participated in the background research, design and implementation of the software tool. JK, EW and SR developed the variable mappings across dbGaP and TCGA. JK, AB, YL, EW and SR were primarily involved in implementation of the software components. EW and LA prepared the first draft of the manuscript. SR contributed to subsequent drafts.
Funding was provided by the NIH intramural research program at the NLM. JK was supported in part by the Frances Berger Foundation Fellowship and her lab was supported in part by Ramesh Kesanupalli family and Beckman Research Institute Funds.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
We would like to acknowledge Lisa Federer, NIH Library Writing Center, for manuscript editing assistance. The work described here was done as part of the NCBI Hackathon (August 2016). This research was supported in part by an appointment to the National Library of Medicine (NLM) Research Participation Program administered by ORISE through an interagency agreement between the U.S Dept. of Energy and the NLM and supported in part by the Intramural Research Program of the U.S. National Institutes of Health, NLM.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
References
1. Kibbe WA, Arze C, Felix V, Mitraka E, et al.: Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data.Nucleic Acids Res. 2015; 43 (Database issue): D1071-8 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
References
1. Kim B, Ali T, Lijeron C, Afgan E, et al.: Bio-Docklets: virtualization containers for single-step execution of NGS pipelines. GigaScience. 2017. Publisher Full TextCompeting Interests: No competing interests were disclosed.
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
No
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 1 24 Mar 17 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)