Keywords
RNA-seq, ideogram, javascript, next generation sequencing
This article is included in the Hackathons collection.
RNA-seq, ideogram, javascript, next generation sequencing
Interactive visualizations can yield insights from the deluge of gene expression data brought about by RNA-seq technology. Several genome browsers enable users to see such data conveniently plotted within a single chromosome in a web application (Broad Institute, 2014; Kent et al., 2002; National Center for Biotechnology Information: Genome Data Viewer (2016)). While such single-chromosome views excel at displaying local features, depicting RNA-seq data across all chromosomes in a genome, i.e. in an ideogram, has the potential to intuitively highlight global patterns of gene expression (such as in Figure 2a in Parker et al., 2016).
In this paper we describe RNA-Seq Viewer, a web application that enables users to visualize genome-wide expression data from the National Center for Biotechnology Information’s (NCBI) Sequence Read Archive (SRA) (Kodama et al., 2012) and Gene Expression Omnibus (GEO) (Barrett et al., 2013) databases. The application consists of a backend data pipeline written in Python and a web frontend powered by Ideogram.js, a JavaScript library for chromosome visualization (Weitz, 2015).
The data pipeline, developed by a small team of software engineers in a three-day NCBI hackathon at Brandeis University, extracts aligned RNA-seq data from SRA or GEO and transforms it into a format used by Ideogram. Ideogram then displays the distribution of genes in chromosome context across the entire human genome and enables users to filter those genes by gene type or expression levels in the given SRA/GEO sample.
The primary task of the hackathon was to develop a prototype data pipeline to extract aligned RNA-seq data from SRA, determine genomic coordinates for the sampled genes, and transform the combined result into the JSON format used by Ideogram.js annotation sets. The formatted annotation data was then plugged into a lightly modified example from the Ideogram repository to provide an interactive, faceted search application for exploring genome-wide patterns of gene expression.
Ideogram.js uses JavaScript and SVG to draw chromosomes and associated annotation data in HTML documents. It leverages D3.js, a popular JavaScript visualization library, for data binding, DOM manipulation, and animation (Bostock et al., 2011). Faceted search in Ideogram is enabled by Crossfilter, a JavaScript library for exploring large multivariate datasets (Square Inc., 2012). By relying only on JavaScript libraries, HTML and CSS, Ideogram can function entirely in a web browser, with no server-side code required, which simplifies embedding ideograms in a web application.
Annotation data for Ideogram leverages space-efficient data structures and the compact nature of JSON to minimize load time in web pages. For example, the gzip-compressed set of 31,148 human gene feature annotations, including data on expression level and gene type, output by our pipeline for SRA run SRR562646 (National Center for Biotechnology: Sequence Read Archive Run Browser) is 399 KB in size and takes less than 285 ms to download on an average US Internet connection (14 Mb/s download bandwidth, 50 ms latency) (Belson et al., 2016) as measured using Chrome Developer Tools (Basques & Kearney, 2016). Under the same network-throttled conditions using Chrome version 51 on a Mac OS X laptop with a 2.9 GHz Intel Core i5 CPU, the Chrome DevTools Timeline tab reports that an uncached, interactive genome-wide histogram of expression for 31,148 gene features takes Ideogram between 830 ms and 1044 ms to completely load and render after the start of navigation to the web page.
Broadly, the pipeline developed to produce Ideogram annotation data works as follows:
1. Get data for an SRR accession from NCBI SRA (National Center for Biotechnology Information: Sequence Read Archive).
2. Count reads for each gene and normalize expression values to TPM units (Wagner et al., 2012)
3. Get coordinates and type for each gene from a GFF file in the NCBI Homo sapiens Annotation Release
4. Format coordinates and TPM values for each gene into JSON used by Ideogram.js
The data pipeline exists in two parts: one for data in SRA and one for data in GEO.
The tool reads a list of SRR accession numbers (National Center for Biotechnology Information, 2011; National Center for Biotechnology Information: SRA Handbook (2011)) and identifies the ones that have alignment. It then retrieves the genome reference used for the creation of the BAM/SAM file to download the gene annotation for quantification. Only genome assemblies GRCh37 (GCA_000001405.1) and GRCh38 (GCA_000001405.15) are supported, and the annotations used for each of them are NCBI Homo sapiens Annotation Release 105 and 107, respectively (National Center for Biotechnology Information, 2013; National Center for Biotechnology Information, 2015).
Alternatively, the tool can read a BAM/SAM file in case of local files. In one single command, the tool quantifies gene expression using HTSeq-count version 0.6.1p1 (Anders et al., 2015) after sam-dump version 1.3 (National Center for Biotechnology Information, 2011). To avoid possible errors due to non-standard SAM files, our filtering steps in the middle sort the BAM file and keep only properly paired reads. The output from HTSeq-count is a tabular file, where the first column is the gene symbol and the second is the read counts. Finally, we normalize the expression by the length of the mature transcript using the longest transcript as the size of the gene.
After obtaining TPM values for each gene’s expression level (Step 2) as described above, the next step in the pipeline parses genomic coordinates (chromosome name, start and stop) and gene type (e.g. mRNA, ncRNA) from a GFF file in the NCBI Homo sapiens Annotation Release. These data are combined with each gene’s TPM value, formatted into a compressed JSON structure, and written to a file containing symbols, genomic coordinates, expression levels and gene types for every human gene. This file, e.g. SRR562646.json, represents the final output of the RNA-Seq Viewer data pipeline, and contains all the data used by the fast client-side faceted search in Ideogram.js.
The resulting RNA-Seq Viewer web application prototype was demonstrated at the conclusion of the three-day hackathon at Brandeis University. The application provides an interactive data visualization in which users can filter genes by expression level and gene type across the entire human genome (Figure 1) or within a single chromosome (Figure 2).
The RNA-Seq Viewer prototype demonstrates a pipeline for transforming aligned RNA-seq data from SRA into a format used for genome-wide visualization.
Next steps for this data pipeline include supporting RNA-seq alignment and normalization when using multiple samples, such as from different tissues. Filters for those different tissues could also be added as filters in the display. The resulting genome-wide visualizations could then be embedded in genome browsers, e.g. NCBI Genome Data Viewer (National Center for Biotechnology Information: Genome Data Viewer), or any genomics-oriented application that supports HTML, CSS, and JavaScript.
The prototype implemented in the hackathon only supports RNA-seq datasets from SRA that are already aligned to a reference genome, e.g. GRCh37 or GRCh38. Salmon (Patro et al., 2015) and Kallisto (Bray et al., 2016) are two popular alignment programs that could be used for this task. Both alignment programs can generate gene expression, with low memory and CPU requirements.
Latest source code: https://github.com/NCBI-Hackathons/rnaseqview
Archived source code as at the time of publication: https://dx.doi.org/10.5281/zenodo.377055 (Weitz et al., 2017)
License: CC0 1.0 Universal
All of the authors participated in designing the study, carrying out the research, and preparing the manuscript. All authors were involved in the revision of the draft manuscript and have agreed to the final content.
Work on this project by Eric Weitz and Ben Busby was supported by the Intramural Research Program of the National Institutes of Health (NIH)/National Library of Medicine (NLM)/National Center for Biotechnology Information (NCBI).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The authors thank Francesco Pontiggia and Brandeis University for provisioning computing resources and facilities for development. The authors thank Lisa Federer, NIH Library Writing Center, for manuscript editing assistance. The authors thank Valerie Schneider, NCBI, for insightful comments and suggestions.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational biology, genomics
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 1 28 Apr 17 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)