ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

WTFgenes: What's The Function of these genes? Static sites for model-based gene set analysis

[version 1; peer review: 2 approved with reservations]
PUBLISHED 04 Apr 2017
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

A common technique for interpreting experimentally-identified lists of genes is to look for enrichment of genes associated with particular ontology terms. The most common test uses the hypergeometric distribution; more recently, a model-based test was proposed. These approaches must typically be run using downloaded software, or on a server. We develop a collapsed likelihood for model-based gene set analysis and present WTFgenes, an implementation of both hypergeometric and model-based approaches, that can be published as a static site with computation run in JavaScript on the user's web browser client. Apart from hosting files, zero server resources are required: the site can (for example) be served directly from Amazon S3 or GitHub Pages. A C++11 implementation yielding identical results runs roughly twice as fast as the JavaScript version. WTFgenes is available from https://github.com/evoldoers/wtfgenes under the BSD3 license. A demonstration for the Gene Ontology is usable at https://evoldoers.github.io/wtfgo.

Keywords

Gene Ontology, Graphical Model, Gene Set Enrichment Analysis

Introduction

Term Enrichment Analysis (TEA) is a common technique for finding functional patterns, specifically overrepresented ontology terms, in a set of experimentally identified genes1. The most common approach, which we refer to as Frequentist TEA, is a one-tailed Fisher’s Exact Test (based on the hypergeometric distribution, which models the number of term-associations if the gene set was chosen by chance), with a suitable correction for multiple hypothesis testing. Frequentist TEA has been implemented many times on various platforms18.

A model-based alternative to Frequentist TEA, which more directly addresses some of the multiple testing issues (for example, by modeling the ways that an observed gene list can be broken down into complementary gene sets), is Bayesian TEA. In contrast to Frequentist TEA, which just rejects a null hypothesis that genes are chosen by chance, the Bayesian TEA explicitly models the alternative hypothesis that the gene set was generated from a few random ontology terms. This approach was introduced by 9 and further developed by 10, who implemented model-based testing in Java and R11. However, the model-based approach remains significantly less well-explored than frequentist approaches.

The graphical model underpinning Bayesian TEA is sketched in Figure 1. For each of the m terms there is a boolean random variable Tj (“term j is activated”). For each of the n genes there is a directly-observed boolean random variable Oi (“gene i is observed in the gene set”), and one deterministic boolean variable Hi (“gene i is activated”) defined by Hi = 1 ΠjGi (1 − Tj), where Gi is the set of terms associated with gene i (including directly annotated terms, as well as ancestral terms implied by transitive closure of the directly annotated terms). The probability parameters are π (term activation), α (false positive) and β (false negative), and the respective hyperparameters are p = (p0, p1), a = (a0, a1) and b = (b0, b1).

de34503b-2dfc-4ace-81ca-cf56f7483362_figure1.gif

Figure 1. Model-based explanation of observed genes (Oi) using ontology terms (Tj), following10.

Other variables and hyperparameters are defined in the text. Circular nodes indicate continuous-valued variables or hyperparameters; square nodes indicate discrete-valued (boolean) variables. Dashed lines indicate deterministic relationships; shaded nodes indicate observations. Plates (rounded rectangles) indicate replicated subgraph structures.

The model is

P(Tj=1|π)=πP(Oi=1|Hi=0,α)=αP(Oi=1|Hi=1,β)=1β

with π ∼ Beta(p), α ∼ Beta(a) and β ∼ Beta(b). The model of 10 is similar, but uses an ad hoc discretized prior for π, α and β .

Most Bayesian and Frequentist TEA implementations are designed for desktop use. Several Frequentist TEA implementations are designed for the web, such as DAVID-WS6 and Enrichr8,12,13, which has a rich dynamic web front-end. However, web-facing Frequentist TEA implementations generally require a server-hosted back end that executes code. Further, there are no JavaScript-based Bayesian TEA implementations, and no web-facing implementations other than the Java-based Ontologizer which can be loaded via Java Web Start.

In order to further explore the model-based TEA and compare it to Frequentist TEA, and to make these investigations accessible to researchers in a way that would be easily embeddable in static websites, we developed WTFgenes, a JavaScript implementation of both approaches with (for time-sensitive applications) a parallel C++ implementation that is numerically identical.

We note in passing that Fisher’s Exact Test—which we call Frequentist TEA—was originally motivated by a blind tea-tasting challenge14.

Methods

Model

In developing our Bayesian TEA sampler, we introduce a collapsed version of the model in Figure 1 by integrating out the probability parameters. Let cp = jm Tj count the number of activated terms, cg = inHi the activated genes, ca = in Oi (1 – Hi) the false positives and cb = in Oi Hi the false negatives.

Then

                                                                                P(T, O|a, b, p) = Z(cp;m, p)Z(ca; ncg, a)Z(cb; cg, b)

where

Z(k;N,A)=B(Nk+A0,k+A1)B(A0,A1)

is the beta-Bernoulli distribution for k ordered successes in N trials with hyperparameters A= (A0, A1), using the beta function

B(x,y)=01tx1(1t)y1dt=Γ(x)Γ(y)Γ(x+y)

Integrating out probability parameters improves sampling efficiency and allows for higher-dimensional models where, for example, we observe multiple gene sets and give each term its own probability πj or each gene its own error rates (αi, βi). Our implementation by default uses uninformative priors with hyperparameters a = b = p = (1, 1), but this can be overridden by the user.

The MCMC sampler uses a Metropolis-Hastings kernel15. Each proposed move perturbs some subset of the term variables. The moves include flip, where a single term is toggled; step, where any activated term and any one of its unactivated ancestors or descendants are toggled; jump, where any activated term and any unactivated term are toggled; and randomize, where all term variables are uniformly randomized. The relative rates of these moves can be set by the user.

The sampler of 10 implemented only the flip move. To test the relative efficacy of the newly-introduced moves we measured the autocorrelation of the term variables for a dataset of 17 S.cerevisiae genes involved in mating (The gene IDs used in this evaluation, for purposes of reproduction, were: STE2, STE3, STE5, GPA1, SST2, STE11, STE50, STE20, STE4, STE18, FUS3, KSS1, PTP2, MSG5, DIG1, DIG2, STE12. Other representative gene sets for yeast may be obtained from the Gene Ontology website at http://geneontology.org/experimental/enrichment-genesets/yeast/ and several of these are bundled with the example dataset in the WTFgenes repository). The results, shown in Figure 2, led us to set the MCMC defaults, such that the flip, step, and jump moves are equiprobable, while randomize is disabled.

de34503b-2dfc-4ace-81ca-cf56f7483362_figure2.gif

Figure 2. Autocorrelation of term variables, as a function of the number of MCMC samples, for several MCMC kernels on a set of 17 S.cerevisiae mating genes.

A rapidly-decaying curve indicates an efficiently-mixing kernel. The kernel incorporating flip, step and jump moves (defined in the text) mixes most efficiently.

Implementation

We have implemented both Frequentist TEA (with Bonferroni correction) and Bayesian TEA (as described above), in both C++11 and JavaScript. The JavaScript version can be run as a command-line tool using node, or via a web interface in a browser, and includes extensive unit tests. The two implementations use the same random number generator and yield numerically identical results.

Operation

Our JavaScript software, when used as a web application, offers a “quick report” view using Frequentist TEA. For the slower-running, but more powerful, Bayesian TEA, the software plots the log-likelihood during an MCMC sampling run, for visual feedback. The repository includes setup scripts allowing the tool to be deployed as a “static site”, i.e. consisting only of static files (HTML, CSS, JSON, and JavaScript) that can be hosted via a minimal web server with no need for dynamic code execution. This has considerable advantages: static web hosting is generally much cheaper, and far more secure, than running server-hosted web applications.

An example WTFgenes static site, configured for the GO-basic ontology and GO-annotated genomes from the Gene Ontology website, can be found at https://evoldoers.github.io/wtfgo.

An earlier version of this article can be found on bioRxiv (doi: 10.1101/114785).

Results

When compiled using clang, the C++ version of WTFgenes is about twice as fast as the JavaScript version: a benchmark of Bayesian TEA on a late-2014 iMac (4GHz Intel Core i7), using the above mentioned 17 yeast mating genes and the relevant subset of 518 GO terms, run for 1,000 samples per term, took 37.6 seconds of user time for the C++ implementation and 79.8 seconds in JavaScript.

By contrast, the Frequentist TEA approach is almost instant. However, its weaker statistical power is apparent from Figure 3, which compares the recall vs specificity of Bayesian and Frequentist methods on simulated datasets (The full workflow for this simulation is available at http://doi.org/10.5281/zenodo.40060816). For values of N from 1 to 4, we sampled N terms from the S.cerevisiae subset of the Gene Ontology, and generated a corresponding set of yeast genes with false positive rate 0.1% and false negative rate 1%. The MCMC sampler was run for 100 iterations per term, and this experiment was repeated 100 times. The model-based approach has vastly superior recall to the Fisher exact test, and the difference grows with the number of terms.

de34503b-2dfc-4ace-81ca-cf56f7483362_figure3.gif

Figure 3. ROC curves for Frequentist and Bayesian TEA.

The axes are scaled per term. There are 5,919 ontology terms annotated to S.cerevisiae genes, so (for example) a false discovery rate of 0.001 corresponds to about 6 falsely reported terms.

Discussion

JavaScript genome browsers, such as JBrowse17, represent a broader web trend of producing static sites where possible, for reasons of security and performance. We have implemented such a static site generator for ontological term enrichment analysis of gene sets that offers both Bayesian and frequentist tests. In contrast with existing web services for Frequentist TEA, such as DAVID-WS or Enrichr, it requires no server resources and allows comparison of Bayesian and Frequentist approaches.

Model-based TEA is versatile: it can readily be extended to allow for datasets that are structured temporally18, spatially19, or by genomic region20; to use domain-specific biological knowledge21; or to incorporate additional lines of evidence such as quantitative data22. We hope our development of a collapsed likelihood, and evaluation of different MCMC kernels, will assist these efforts.

Software and data availability

Latest source code: https://github.com/evoldoers/wtfgenes

Archived source code as at time of publication: http://doi.org/10.5281/zenodo.40060623

Software license: BSD3

A demonstration for the Gene Ontology is usable at https://evoldoers.github.io/wtfgo.

A Makefile-driven simulation study underpinning results reported in this paper is available at http://doi.org/10.5281/zenodo.40060816.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 04 Apr 2017
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Mungall CJ and Holmes IH. WTFgenes: What's The Function of these genes? Static sites for model-based gene set analysis [version 1; peer review: 2 approved with reservations]. F1000Research 2017, 6:423 (https://doi.org/10.12688/f1000research.11175.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 04 Apr 2017
Views
18
Cite
Reviewer Report 23 Jun 2017
Ruth Isserlin, The Donnelly Centre, University of Toronto, Toronto, ON, Canada 
Approved with Reservations
VIEWS 18
In the article "WTFgenes: What's The Function of these genes? Static sites for model-based gene set analysis" Mungall and Holmes introduce a java script static site implementation of a model based Bayesian method to calculate functional enrichment. Included in this is an ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Isserlin R. Reviewer Report For: WTFgenes: What's The Function of these genes? Static sites for model-based gene set analysis [version 1; peer review: 2 approved with reservations]. F1000Research 2017, 6:423 (https://doi.org/10.5256/f1000research.12058.r23167)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
28
Cite
Reviewer Report 05 May 2017
Cedric Simillion, Interfaculty Bioinformatics Unit and SIB Swiss Institute of Bioinformatics, University of Bern, Bern, Switzerland;  Department of Clinical Research, University of Bern, Bern, Switzerland 
Approved with Reservations
VIEWS 28
The authors present a novel "Term Enrichment Analysis" algorithm, which expands on previous work by Bauer et al. (2010). The provided implementation of the algorithm as a stand-alone web interface is very well-designed and user-friendly. The availability of a command-line ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Simillion C. Reviewer Report For: WTFgenes: What's The Function of these genes? Static sites for model-based gene set analysis [version 1; peer review: 2 approved with reservations]. F1000Research 2017, 6:423 (https://doi.org/10.5256/f1000research.12058.r22056)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 04 Apr 2017
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.