GAC: Gene Associations with Clinical, a web based application

We present GAC, a shiny R based tool for interactive visualization of clinical associations based on high-dimensional data. The tool provides a web-based suite to perform supervised principal component analysis (SuperPC), an approach that uses both high-dimensional data, such as gene expression, combined with clinical data to infer clinical associations. We extended the approach to address binary outcomes, in addition to continuous and time-to-event data in our package, thereby increasing the use and flexibility of SuperPC. Additionally, the tool provides an interactive visualization for summarizing results based on a forest plot for both binary and time-to-event data. In summary, the GAC suite of tools provide a one stop shop for conducting statistical analysis to identify and visualize the association between a clinical outcome of interest and high-dimensional data types, such as genomic data. Our GAC package has been implemented in R and is available via http://shinygispa.winship.emory.edu/GAC/. The developmental repository is available at https://github.com/manalirupji/GAC.

This article is included in the gateway. RPackage

Introduction
Heterogeneity in terms of tumor characteristics, prognosis, and survival among cancer patients has remained a persistent problem. It has been well established that clinical factors alone are not sufficient to explain differences in prognosis. For example, based on clinical factors only, two tumor patients may have the same prognosis, but may not respond to the same treatment as the tumors may have a completely different molecular composition [1][2][3][4][5] . Despite the introduction of a tumor's genomic profile to explaindifferences in prognosis, there remains unexplained heterogeneity in tumor response to treatment. One factor potentially attributing to such unexplained differences may be due to inaccurate prognosis and prediction resulting from the analysis approach used to define prognostic markers of response.
For this purpose, Bair and Tibshirani 6 introduced a supervised principal component (SuperPC) method within the context of defining expression-based cancer subtypes of prognostic significance. The method uses both gene expression and clinical data for predicting patient prognosis. This approach was applied to several publicly available datasets that demonstrated it's ability to accurately predict the clinical outcome of interest based on a given gene expression profile. Since its inception, SuperPC has been introduced as a powerful tool for reducing dimensionality in selecting features (a.k.a., genes) of prognostic relevance in cancer.
Currently, the SuperPC method has been developed as an R package, 'superpc' but as it stands, it is unable to address the following: 1) clinical association based on a binary outcome (e.g. responders versus non-responders); 2) ease of use for clinicians and researchers with limited programming skills; and 3) a visual summary of results.
To address these limitations, we developed GAC: Gene Association with Clinical, an interactive, GUI-based web-based application for analysis of gene associations with various clinical outcomes of interest. We developed GAC based on the R packages 'shiny' and 'superpc'. Our GAC tool enables the user to perform a SuperPC analysis for three types of outcomes: time-to-event, continuous, and binary, and provides a summary of results using forest plots that may be readily exported into a file.

Supervised principal component analysis
SuperPC is a generalization of principal component analysis, which generates a linear combination of the features or variables of interest that capture the directions of largest variation in a dataset. Instead of using the whole dataset directly, SuperPC defines a list of genes based on their association with an outcome of interest. To select the list of genes, a univariate score for each gene is calculated and those features (a.k.a., genes) whose score exceeds a threshold are retained as input into a principal component analysis, based on the retained features. For details, refer to Bair and Tibshirani 6 .
Time-to-event outcome SuperPC for time-to-event was conducted using the 'superpc' package in R. Depending on the sample size of the original dataset; the researcher selects what proportion of the dataset to split into training and testing. The researcher can also specify how many numbers to test to check which the optimal threshold is. The number of folds for cross validation to determine the threshold also needs to be determined. There is also an option to run the analysis randomly, or upload fold IDs to replicate an analysis that was previously carried out. The association between the time-to-event outcome and the predicted principal component may be represented in a KM plot by dichotomizing the principal component using the median (Figure 1).

Continuous outcome
SuperPC for continuous outcomes is implemented using the 'superpc' package in R, with the same options as time-to-event analysis. The predicted principal component is presented visually as continuous values through a scatter plot along with Pearson's correlation ( Figure 2). The predicted principal component could also be presented as binary groups (cutoff at median) through a boxplot, with a t-test applied.

Binary outcome
In the 'superpc' R package developed by Bair and Tibshirani (2004), SuperPC analysis can be performed on both continuous and survival outcomes. We have extended this tool to include SuperPC for binary outcome (example 'responders' vs 'nonresponders'). This extension follows a similar analysis workflow as the other two outcomes in that a list of genes is defined based on a univariate score to which a threshold is applied and the genes whose scores exceed the threshold are used as input into a principal component analysis. For modeling gene associations with binary outcomes a logistic regression has been implemented. The predicted principal component can be visualized as either a continuous variable through a box plot, with a t-test to summarize the statistical association (Figure 3), or as binary groups (cutoff at median) using a bar plot, with a chi-square test to summarize the statistical association between the predicted and the observed outcome.

Forest plot
A forest plot is a graphical display of point estimates of association widely used in meta-analysis. It has become popular for displaying the associations between clinical and genomic data. With our GAC tool, users have the option to generate a forest plot to display results ( Figure 4).

Implementation
The GAC tool is written in R and tested using version 3.3.0. The interactive plots and data tables are made available using the shiny R package (www.rstudio.com/shiny).

Amendments from Version 3
Based on the reviewer's comment for our paper, we have been requested to update Figure 1- Figure 3 according to the changes that were made in our previous version 3. Figure Figure 2, the user has a choice to display the continuous predicted principal component as a scatter plot, or divide it into binary discrete groups (using median cut-off) to represent the association through a barplot. Similar download options are available.

Discussion
GAC is a suite of tools that allows the user to conduct statistical analysis to identify and visualize the association between clinical outcomes of interest and genomic data using an interactive application in R. I thank the authors for incorporating my suggestions into their work; however, the figures in the paper should be updated to reflect the changes in the shiny app, which were made in response to comment 2 from my initial review: "2. The numeric output of the GAC shiny app is often unformatted R output. For example, on the Super PC Binary Outcome tab below the box plots. It would be better to provide formatted results (e.g. using knitr::kable). Also, the p-values (and other numeric outputs) should likely be rounded to a number of significant digits that conveys the precision of the values."

Data and software availability
No competing interests were disclosed. Competing Interests:

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Author Response 15 Feb 2018 , Emory University, USA

Manali Rupji
We thank the referee for their suggestions which have lead to an improved paper. In the latest version, we have incorporated updates to Figures 1-3

Cedric Simillion
Interfaculty Bioinformatics Unit and SIB Swiss Institute of Bioinformatics, University of Bern, Berne, Switzerland I agree with the previous referee that the authors have developed a nice user interface to the SuperPC algorithm, which should make its application accessible to many experimental scientists. I read the manuscript and I have little to no comments to make on its contents.
However, I have tried to use the tool myself on my own dataset and was unable to get it to work. I had formatted my data exactly as in supplied example input files, even with the exact same column names. After submitting my data, I only got an unhelpful error message "ERROR: An error has occurred. Check your logs or contact the app author for clarification." I could not see any logs or any other means of finding out why the program didn't work.
If the problem lies with the formatting of the input files, the authors need to provide some more documentation on the format of the input data, including column names, newline convention, etc.
Given that I presently cannot verify that the tool is working properly, I therefore cannot accept the manuscript for indexing.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed.

Competing Interests:
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 21 Dec 2017 , Emory University, USA

Manali Rupji
We thank the reviewer for their thoughtful comments. 1.

3.
We thank the reviewer for their thoughtful comments.
Below is our response to the reviewer's comments : Thank you for pointing this problem out. As it turns out, there was indeed a bug with reading the input data that caused the problem described. This bug has been fixed in the new version of the tool available on the same link . http://shinygispa.winship.emory.edu/GAC/ Additionally, we have included a sentence about data formatting in the Data and software availability section: 'User uploaded data should be in the same format as the example data provided.' The tutorial has also updated to state the same.
The tool has also been updated to indicate errors in code directly as seen on the R console as a way to better inform on the possible nature of the error.
No competing interests were disclosed. 3.

4.
There are several grammatical errors. For example: "The researcher can also specify how many numbers to test to check which the optimal threshold is." The superPC method is referred to in a variety of ways; it would be good to pick one of these: "Our GAC tool enables the user to perform a SuperPC analysis..." "... super pc analysis can be performed on both continuous …" "We have extended this tool to include super PC…" " We thank the reviewer for their thoughtful comments that have led to both an improved paper and tool.
We have included in the introduction section a brief summary of the 'superpc' method.
Additionally, below, we respond individually to each comment.
1. The authors write, "Currently, prognosis and outcome predictions are solely based on clinical factors." This seems to dismiss a lot of recent work in this area. On a related note, the most recent paper cited is from 2004.
We agree with the reviewer and have updated the introduction.Considering that the main method, supervised principal components analysis, was published in 2004, the other papers referenced earlier reflect the cited results of the method.
2. The numeric output of the GAC shiny app is often unformatted R output. For example, on the Super PC Binary Outcome tab below the box plots. It would be better to provide formatted results (e.g. using knitr::kable). Also, the p-values (and other numeric outputs) should likely be rounded to a number of significant digits that conveys the precision of the values.
We have modified the output in both the SuperPC binary and continuous outcome sections to provide formatted results using pander from the 'knitr' package. which also provides greater control on the display of results. The numeric output has also been modified as suggested by provided rounded results in the table.
3. There are several grammatical errors. For example: "The researcher can also specify how many numbers to test to check which the optimal threshold is." We have corrected the grammatical errors in the manuscript.
4. The superPC method is referred to in a variety of ways; it would be good to pick one of these:"Our GAC tool enables the user to perform a SuperPC analysis..." "... super pc analysis can be performed on both continuous …" "We have extended this tool to include super PC…" " Figure 2. SuperPC continuous outcome" We thank the reviewer for pointing out this inconsistent nomenclature. When specifically referring to the R package, we use the name of the package as 'superpc' and when referring to the tool, method or software, 'SuperPC' is used throughout the manuscript.
No competing interests were disclosed.

2.
Build a user manual page to explain the details of each function and use an example to demonstrate the usage.

Is the rationale for developing the new software tool clearly explained? Partly
Is the description of the software tool technically sound? Yes Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed.

Competing Interests:
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 20 Sep 2017 , Emory University, USA

Manali Rupji
We thank the reviewer for their thoughtful comments that have led to an improved paper. Below, we respond individually to each comment.
Since the example data is from TCGA and this work aims to facilitate clinicians and researchers with limited programming skills, it is necessary to cite some literature to tell readers how to retrieve and process the TCGA Data, such as TCGA-Assembler (Nature methods, 2014), and it would be better if such tools can be embedded into GAC.
Various tools are available for extracting TCGA data (now available through GDC data portal). Data can be downloaded directly from the GDC data portal ( ) or by use of R packages. Some of the more recent https://gdc-portal.nci.nih.gov/ packages include TCGA Biolinks (Colaprico , 2015) which is available through et. al Bioconductor and the TCGA Assembler (Zhu , 2014) available as an R package. Users et. al may refer to the detailed tutorials of these tools for instructions on extracting TCGA data types.
The authors stated that 'superpc' is unable to address the limitation about clinical association based on a binary outcome in the Introduction. But in the Methods section, it said 'SuperPC for binary outcome follows the same analysis workflow as the other two