ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive

[version 1; peer review: 2 approved with reservations]
PUBLISHED 19 May 2020
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Hackathons collection.

Abstract

The Sequence Read Archive (SRA) is a large public repository that stores raw next-generation sequencing data from thousands of diverse scientific investigations.  Despite its promise, reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples. Recently, the MetaSRA project standardized these metadata by annotating each sample with terms from biomedical ontologies. In this work, we present a pair of Jupyter notebook-based tools that utilize the MetaSRA for building structured datasets from the SRA in order to facilitate secondary analyses of the SRA’s human RNA-seq data. The first tool, called the Case-Control Finder, finds suitable case and control samples for a given disease or condition where the cases and controls are matched by tissue or cell type.  The second tool, called the Series Finder, finds ordered sets of samples for the purpose of addressing biological questions pertaining to changes over a numerical property such as time. These tools were the result of a three-day-long NCBI Codeathon in March 2019 held at the University of North Carolina at Chapel Hill.

Keywords

Hackathon, RNA-seq, Sequence Read Archive, MetaSRA, Metadata, Ontology, Jupyter

Introduction

The Sequence Read Archive (SRA; Leinonen et al., 2011) is a large public repository that stores next-generation sequencing data from thousands of diverse scientific investigations. Despite its promise, reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples (Gonçalves & Musen, 2019). Recently, the MetaSRA project (Bernstein et al., 2017) standardized these metadata by annotating each sample with terms from biomedical ontologies including Cell Ontology (Bard et al., 2005), Uberon (Mungall et al., 2012), Disease Ontology (Schriml et al., 2019), Cellosaurus (Bairoch, 2018), and the Experimental Factors Ontology (Malone et al., 2010). The MetaSRA also features an interface (http://metasra.biostat.wisc.edu) for querying human RNA-seq samples using these ontology term annotations. However, the MetaSRA web interface is not capable of producing structured datasets such as those that match case samples associated with a target condition or disease with healthy control samples. Similarly, the MetaSRA is also not capable of searching for samples associated with a particular condition and/or tissue-type that are ordered according to a numeric property (e.g., age).

Construction of such datasets is non-trivial and requires further processing of the results provided by the MetaSRA website. For example, finding case and control samples for a given disease likely requires matching case samples to control samples according to their tissue or cell type. Furthermore, given these search results, users may wish to further filter samples according to whether they are poorly annotated (i.e., are missing cell type or tissue information), whether they are derived from a cell line, or whether they were experimentally treated. Moreover, given these results, the user may wish to explore other ontology terms associated with the search results within either the case or control samples to check for any variables that may confound downstream analyses. Finding longitudinal or time-series data presents similar challenges. To the best of our knowledge, no existing tool addresses these tasks.

To address these two tasks, we produced two Jupyter notebook-based tools. The first tool, called the Case-Control Finder, searches the SRA via the MetaSRA terms to produce matched-case and control samples for a given disease or condition where the cases and controls are matched by tissue and cell type. The second tool, called the Series Finder, finds ordered sets of samples for the purpose of answering biological questions pertaining to changes over a numerical property (e.g., time). More specifically, the Series Finder produces ordered sets of samples, where the order is determined based on a temporal property in the metadata such as age, as standardized by the MetaSRA’s real-valued properties. These tools promise to facilitate the construction of suitable public datasets for secondary analyses.

Methods

The tools presented in this work were written in Python (v3.6) and make use of Python packages pandas (McKinney, 2011), Matplotlib (Hunter, 2007), and seaborn (https://seaborn.pydata.org). These notebooks are available ready-to-run in a Docker container.

Case-Control Finder

The Case-Control Finder implements the following steps to produce a dataset of matched-case control samples for a given disease (Figure 1A):

858a34aa-73ef-4b08-9e69-64c1d1f9223f_figure1.gif

Figure 1. Data flows for hypothesis-driven query tools.

An overview of the backend processing functions called from the Jupyter notebooks.

  • 1. Generate candidate case and control samples. Generate the set of candidate case samples by querying for all samples associated with a user-specified condition or disease using the MetaSRA-mapped ontology terms. Also, find all candidate control samples that are not associated with the target condition/disease.

  • 2. Filter poorly annotated samples. Filter samples based on a metadata completeness threshold, which requires that all samples be associated with either a tissue term or a cell type term. The tissue/cell type information is required for downstream matching of case samples to control samples.

  • 3. Apply user-specified filters. Further filter samples according to user-specified filtering parameters. The user can filter out cell line samples, treated samples, and in vitro differentiated samples. The user can also remove all diseased samples from the candidate control samples for the purpose of generating a healthy control-set.

  • 4. Match by tissue and cell type. The candidate case samples are then matched with the candidate control samples by their tissue and cell type terms. Specifically, given that each sample can be associated with multiple ontology terms in the MetaSRA, a set of case samples is matched with a set of control samples when both sets of samples are labelled with the same set of tissue and cell type terms. For example, a set of case samples annotated with the set of terms “liver” and “epithelial cell” will be matched only to control samples also labeled strictly with these terms (Figure 2A). This ensures that case samples are matched with maximally similar control samples and mitigates matching samples from different tissue-types. For example, a set of case samples labelled with both the terms “liver” and “epithelial cell” will not be matched with a set of samples labelled only as “epithelial cell,” as there is no guarantee that the latter set of samples originate in the liver.

858a34aa-73ef-4b08-9e69-64c1d1f9223f_figure2.gif

Figure 2. Example results from the Case-Control Finder.

Results from running the Case-Control Finder for the query “liver cancer.” (A) The Case-Control Finder displays the number of case/control studies (left) and case/control samples (right) matched by each tissue and cell type. (B) The user can select either the case samples or control samples for a given tissue or cell type and display the most common ontology terms associated with those selected samples. Displayed here are the most common terms associated with the case samples labeled as “liver.” (C) The notebook also displays four pie charts for viewing the fraction of samples belonging to a cell line (top left), each sex (top right), each developmental stage (bottom left), and whether they were given an experimental treatment (bottom right).

Once the dataset is constructed, the notebook enables the user to explore the samples for other MetaSRA mapped ontology terms within the data (Figure 2B and C). By presenting other common ontology terms in the data, the user may be able to identify variables that either confound analysis.

Series Finder

The Series Finder finds RNA-seq data samples that are associated with a numerical property (e.g., age or time point) for a given tissue or cell type. To do so, the Series Finder utilizes the real-value property annotations provided by the MetaSRA where each real-value property in the MetaSRA is structured as a tuple consisting of a property name (e.g., age), numerical value, and unit (e.g., year).

To perform a query, the user provides an ontology term, such as a tissue or cell type, as well as a property name and unit. The Series Finder then finds all samples that are associated with the target ontology term and real-value property. The user can also provide a set of blacklist terms that can be used to filter the samples. Given a list of blacklist terms, the Series Finder will remove all samples annotated with any blacklist term. The Series Finder will then return all remaining samples ordered by their associated numerical values (Figure 1B).

Results and use cases

We used the Case-Control Finder to query for samples of liver cancer RNA-seq samples matched with healthy control samples. This query resulted in six sets of samples representing different tissues or cell types including epithelial cells, hepatocytes, stem cells, and liver tissue (Figure 2A). The Case-Control Finder identified common terms associated with the case “liver cancer” samples (Figure 2B), and categorized these samples by cell line status, sex, developmental stage, and treatment status (Figure 2C).

We used the Series Finder to find all brain samples in the SRA ordered by the age of the sample donor. This query resulted in samples spanning many ages (Figure 3A). This dataset could prove useful for exploring gene expression-based signatures of aging. The Series Finder also identified common terms at each age (Figure 3B) and for each age’s sample-set, categorized those samples by cell line status, sex, developmental stage, and treatment status (Figure 3C).

858a34aa-73ef-4b08-9e69-64c1d1f9223f_figure3.gif

Figure 3. Example results from the Series Finder.

Results from running the Series Finder for the query “brain” sorted by “age,” where unit is specified as “year.” (A) The Series Finder displays the number of samples sorted by age. (B) The user can select samples associated with a given time point for further exploration. Here the samples annotated as “year = 63” are selected. The notebook then displays four pie charts for viewing the fraction of samples belonging to a cell line (top left), each sex (top right), each developmental stage (bottom left), and whether they were given an experimental treatment (bottom right). (C) Given the selected samples from (B), the notebook displays the most frequent terms associated with those selected samples. Displayed here are the most frequent terms associated with the case samples labeled as “liver.”

Conclusion and future work

We implemented two Jupyter notebooks for performing hypothesis-driven queries of public RNA-seq samples in the SRA. These tools are built upon the standardized metadata provided by the MetaSRA project and enable querying of the metadata beyond what is natively possible via the MetaSRA website interface. Future work will entail either integrating these tools into a standard web-interface, such as the interface of the MetaSRA website, or by implementing a stand-alone web application for these tools using a platform such as R Shiny.

Data availability

The figures and datasets produced in the analyses can be found on GitHub: https://github.com/mbernste/hypothesis-driven-SRA-queries/tree/master/results

Software availability

All code is maintained on GitHub: https://github.com/mbernste/hypothesis-driven-SRA-queries

Archived code as at time of publication: https://doi.org/10.5281/zenodo.3807512 (Bernstein, 2020)

License: CC0

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 19 May 2020
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Bernstein MN, Gladstein A, Latt KZ et al. Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive [version 1; peer review: 2 approved with reservations]. F1000Research 2020, 9:376 (https://doi.org/10.12688/f1000research.23180.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 19 May 2020
Views
23
Cite
Reviewer Report 05 Jun 2020
Shannon Ellis, Department of Cognitive Science, UC San Diego, La Jolla, CA, USA;  Department of Biostatistics, Johns Hopkins University School of Public Health, Baltimore, MD, USA 
Approved with Reservations
VIEWS 23
This paper describes the development of two Jupyter notebook-based tools (Case-Control Finder and Series Finder) for improving the ease with which researchers can identify cases within the SRA for further study.

While the paper does a nice ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ellis S. Reviewer Report For: Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive [version 1; peer review: 2 approved with reservations]. F1000Research 2020, 9:376 (https://doi.org/10.5256/f1000research.25586.r63614)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 04 Aug 2020
    Matthew Bernstein, Morgridge Institute for Research, Madison, 53715, USA
    04 Aug 2020
    Author Response
    We greatly appreciate the reviewer's valuable suggestions and feedback. Please see our responses below:

    1. We agree that using the MetaSRA’s API would be a great idea; however, the API restricts queries ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 04 Aug 2020
    Matthew Bernstein, Morgridge Institute for Research, Madison, 53715, USA
    04 Aug 2020
    Author Response
    We greatly appreciate the reviewer's valuable suggestions and feedback. Please see our responses below:

    1. We agree that using the MetaSRA’s API would be a great idea; however, the API restricts queries ... Continue reading
Views
38
Cite
Reviewer Report 01 Jun 2020
Zichen Wang, Sema4, Stamford, CT, USA 
Approved with Reservations
VIEWS 38
Bernstein et al. provides two Jupyter notebook-based tools to facilitate re-analysis of human RNA-seq data deposited to SRA. The tools were built on top of annotated metadata of RNA-seq samples from the MetaRNA, and provided some visualizations of the summary statistics ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Wang Z. Reviewer Report For: Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive [version 1; peer review: 2 approved with reservations]. F1000Research 2020, 9:376 (https://doi.org/10.5256/f1000research.25586.r63613)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 04 Aug 2020
    Matthew Bernstein, Morgridge Institute for Research, Madison, 53715, USA
    04 Aug 2020
    Author Response
    We greatly appreciate the reviewer's valuable feedback. Please find our responses to each point below:

    1. Within the abstract we now point the reader to the tools’ Github repository, which describes ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 04 Aug 2020
    Matthew Bernstein, Morgridge Institute for Research, Madison, 53715, USA
    04 Aug 2020
    Author Response
    We greatly appreciate the reviewer's valuable feedback. Please find our responses to each point below:

    1. Within the abstract we now point the reader to the tools’ Github repository, which describes ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 19 May 2020
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.