Keywords
automated, genomic, analysis, datasets, DNA, methylation, expression, arrays
This article is included in the Bioinformatics gateway.
This article is included in the RPackage gateway.
automated, genomic, analysis, datasets, DNA, methylation, expression, arrays
Summary of Updates:
See the authors' detailed response to the review by Matthew McCall
See the authors' detailed response to the review by Subha Madhavan
While high dimensional genetic data have increased in availability at reduced cost, robust analyses remain labor intensive and costly. Numerous automated software pipelines have been developed in an effort to increase the rate and decrease the costs at which analyses can be completed, including SVAw11, Partek4, InSilicoDB18. Automated Genomics Analysis (AGA) provides a more dynamic experience, allowing the user to start with raw data and a text file containing corresponding sample annotations from either a single or multiple studies. AGA performs all necessary normalization and batch correction, and then enables the user to interactively determine the samples to contrast in the analysis based on the sample annotations. AGA is implemented in R to facilitate adaptation of state-of-the art genomics analysis techniques. Linking R to a web browser-based interface through RStudio’s shiny also facilitates collaborative analyses in research teams with diverse bioinformatics expertise.
AGA bridges the gap between interactive and reproducible analyses for several platforms, including expression arrays, methylation arrays, and processed RNAseq data. Through the interface, the user determines the size and scope of the analyses. AGA first performs data normalization, including the ComBat7 and SVA9 batch correction algorithms to enable comparison across multiple datasets for non-methylation platforms. The software then performs differential analysis16, and gene set analyses2,17 based upon defined sample groups. Users obtain standard visualization of genomics data, including hierarchical clustering, boxplots and heatmaps as part of the default analysis. Plots and tables summarizing the results from each analysis are customizable through the interface. The figures and tables in AGA are interactive and customizable. In contrast to other point and click software, AGA logs the R code, and exports the workspace with each figure and table, ensuring that each analysis can be reproduced and further customized. The runtime of analyses will depend largely on the desktop hardware, but also on the data platform and optional analyses selected. On a Mac Pro workstation, containing a 3.2 GHz Quad-Core Intel Xeon processor and 10Gb 1066 MHz DDR3 RAM, analyses containing under 100 samples were completed in under 30 minutes.
The AGA application is run through R and interactive through web browsers. AGA is implemented with RStudio’s shiny13, integrating the R code used in the analysis with HTML and JavaScript, for the interactive user interface. Usage requires R version 3.0.1 or higher, and either Mozilla Firefox or Google Chrome, and R packages described in the AGA User’s Manual. The program is divided into seven tabs. Clicking the respective Update button generates the results to be displayed in each tab and clicking the Download buttons save the plots and data.
AGA supports analyses of DNA methylation and gene expression data. Currently, AGA supports DNA methylation analysis on Illumina 450k arrays. It also supports gene expression analysis of any human Affymetrix expression platform, including exon arrays, and normalized gene counts from RNAseq data. Notably, the flexible format for normalized RNAseq data may be adapted to analyze normalized data from other platforms measuring continuous data, many of which we plan to incorporate in future versions of AGA.
Users of AGA select to load annotation files and high throughput genomic data from files in a specified directory. AGA accepts raw CEL files and iDat files for Affymetrix and DNA methylation arrays, respectively where background correction is performed, as well as quantile normalization for the expression arrays. For gene expression microarrays, AGA performs RMA normalization implemented in the Bioconductor package affy5. Probe-level estimates of DNA methylation are computed from iDat files using Illumina standards with the minfi package1. RNAseq data are formatted as individual text files for each sample, assumed to contain gene names and normalized counts for each sample. It is assumed that normalized RNAseq data are formatted as individual text files for each sample, containing gene names and normalized counts for each sample. More details about the format for each data type are provided in the User’s manual. Sample annotations are specified in a CSV file, whose first column matches the names of the data files. By default, it is assumed the annotation file defines the sample batch; however, this can be updated by editing the annotation files to contain a ‘Batch’ column with unique identifiers for each respective batch within the dataset. Further details about the sample annotations are also provided in the User’s manual.
After loading in the annotation files, AGA users select categories from the annotation for differential expression analysis. AGA automatically groups samples with common levels in each category as groups for differential analysis. Samples may be further subset from the complete dataset from the criteria selected for each group. When selected, AGA updates the display to output the sample size for each group. Samples are set for analysis by clicking the “Run the Analysis!” button. In cases for which samples span multiple batches, the analysis automatically performs ComBat and SVA batch correction protecting for the biological groups in the annotation selected by the user. Help boxes are available to clarify each input field with further details in the User’s manual.
The Dendrogram Plot tab in displays unsupervised hierarchical clustering based upon the complete correlation between values of genes (rows) and samples (columns). The Heatmap Plot tab provides an interactive Javascript heatmap of the genomic data, allowing users to customize genes plotted and color rows by sample annotations. For both Dendrograms and Heatmaps, an option is available to view the pre-batch corrected data to show the effects of batch on and efficacy of correction of the data. The Gene Box Plot tab creates boxplots to summarize values of a user-selected gene in the selected groups.
The Differential Results tab displays the results from the differential analysis using empirical Bayes moderated t-statistics with the Bio-conductor Package limma16. Statistics are computed on data that have been batch corrected by combining ComBat with SVA, protecting for the biological groups selected for comparison10. The p-values are adjusted utilizing the Benjamini-Hotchberg method for multiple hypothesis testing8. Optionally, gene set statistics can be performed for each gene set defined in Biocarta and Gene Ontology using a Wilcoxon rank-sum test comparing the t-statistics from the most differentially expressed probe for genes in the set to similarly selected t-statistics for genes outside of the set. If selected, results from gene set analysis are displayed in the GSA Results tab.
As an example, we perform analysis on sample datasets containing gene expression of primary head and neck squamous cell carcinoma (HNSCC) tumors. We downloaded measurements from a combination of frozen tumor samples from two distinct studies in GEO available under accession numbers GSE103003 and GSE679112, representing two distinct batches. Raw CEL files and annotation csv files were obtained as described in the User’s manual. We initialize AGA by selecting the directory containing these data. Once loaded, we check the HPV and Tumor.Source.Type columns to group the samples into primary HPV-positive and HPV-negative tumors for differential expression analysis. We then click “Run the Analysis” to normalize the CEL files with RMA6, batch correct the data with ComBat and SVA, and perform differential expression analysis. The plot in the Dendrogram Plot tab confirms that the batch effects are apparent between these datasets but removed after batch. The heatmap generated in the Heatmap Plot tab (Figure 1) demonstrates that the batch correction nonetheless preserves gene expression difference between HPV-positive and HPV-negative tumors. Moreover, performing differential expression analysis comparing HPV-positive and HPV-negative HNSCC in the “Differential Analysis” tab confirms the well-established overexpression (p=8.74e-9) of CDKN2A (p16) in HPV-positive HNSCC14,15.
We note that sample names are truncated in the heatmap, but users can reduce the lengths of sample names or ensure that sample identity can be determined by the final characters in the name to associate specific samples with the heatmap.
AGA provides an interface to enable users who may be unfamiliar with R to perform reproducible genomics class comparison analysis. Unlike other automated pipelines, experienced R users can reproduce, extend or modify preliminary analyses. Thus, AGA facilitates collaborations between novice and expert R users for genomics analysis. Future work will extend the AGA pipeline to encode normalization routines to DNA methylation, and analysis routines for other genomics platforms, including copy number data.
MFO and EJF conceived the software and EJF and MC designed the web interface. MC designed and coded implemented the software application, and prepared the manuscript. HSP researched and composed cross-study normalization techniques. XXX standardized annotation files for the two example data sets. YW and LC assisted by providing the initial coding for alternative analyses. All authors helped prepare the manuscript.
Funding was provided from NIH/NIDCR P50DE019032, NIH/NCI P30CA006973, K25CA141053 to EJF, and NIH/NLM R01LM011000 to MFO.
I confirm that the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
We would like to thank Joe Cheng and Winston Chang of RStudio for their support with shiny. Alla Guseynova, Michael Fox and Louis Franceschi are very much appreciated their technical support and implementations of various iterations of the project. We thank Thomas Considine for his assistance in proofreading this manuscript; and Bahman Afsari and Thomas Considine for testing the application and User Manual. Finally, we also thank Luigi Marchionni and Jean-Philippe Fortin for collaborative efforts.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 21 Oct 15 |
||
Version 1 28 Jan 15 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)