shinySISPA: A web tool for defining sample groups using gene sets from multiple-omics data

As opposed to genome-wide testing of several hundreds of thousands of genes on very few samples, gene panels target as few as tens of genes and enable the simultaneous testing of many samples. For example, some cancer gene panels test for 50 genes that can affect tumor growth and potentially identify treatment options directed against the genetic mutation. The increasing popularity of gene panel testing has spurred the technological development of panels that test for diverse data types such as expression and mutation. Once samples are tested, there is the desire to examine clinical associations based on the panel and for this purpose, one would like to identify, among the samples tested, which show support for a molecular profile (e.g., presence of mutation with increased expression) versus those samples that do not among the genes tested. With user-specified molecular profile of interest, and gene panel data matrices (e.g., gene expression, variants, etc.) that define the profile, shinySISPA (Sample Integrated Set Profile Analysis) is a web-based shiny tool to define two sample groups with and without profile support based on our previously published method from which clinical associations may be readily examined. The shinySISPA can be accessed from http://shinygispa.winship.emory.edu/shinySISPA/.


Introduction
Unlike gene set profiling, sample profiling is a challenge due to the heterogeneity between, and within the tumor patient samples. Identification of homogenous groups of samples or molecular subtypes is commonly approached using clustering methods (e.g., Handl et al., 2005;Kowalski et al., 2016;Monti et al., 2003;Șenbabaoğlu et al., 2014;Verhaak et al., 2010). Whether or not the sample groups are meaningful and the clustering stable, requires additional testing, is highly subjective, limited to examining changes in a single data type, and often require removal of genes or samples to obtain the desired results. While clustering tools such as TNBC subtyping (Chen et al., 2012;Lehmann et al., 2011) are convenient for subtype discovery and sample classification, they are restricted to studying a specific cancer tissue and data type, within the context of established expression signature profiles. Although these methods may prove useful in certain case, there is a need for a basic tool that can identify sample groups using any combination of genomic data types based on a gene or gene set and molecular profile of interest. Some examples of gene sets may be derived from a specific biological process, network, gene enrichment analysis, a gene panel, etc. A molecular profile is a series of increasing or decreasing changes among diverse data types operating on a given gene set. For example, a gene mutation with expression is a molecular profile of increased variant support with increased levels of expression. The shinySISPA is a web tool developed to implement the novel method, SISPA (Kowalski et al., 2016), for defining samples with similar molecular profiles based on a user input gene set and data types. SISPA does not impose analytical distribution assumptions on the data, and is scalable to define samples that support a general profile defined by any combination of genomic data types applied to any number of genes.

Methods
Implementation SISPA is written in the R programming language (R project) and the shiny web application framework is implemented using the Shiny R package (Chang et al., 2016).
The tool is hosted on a 64bit CentOS 6 server (http://shinygispa. winship.emory.edu/shinySISPA/) running the Shiny Server program designed to host R Shiny applications. This tool has been extensively tested on Windows 7 and Mac Pro 10 operating system with firefox and google chrome browser. Given a dataset of 377 samples and 16 genes under the two-feature analysis, it took three seconds to obtain shinySISPA defined sample groups and less than a second to generate the waterfall plot. The time it takes to generate sample profile diagnostic plots depends on the number of genes in a set; it took less than 10 seconds for 16 genes in both sets of a two-feature analysis. As a note, speed at which results are generated is also dependent on the internet connectivity.

Operation
The tool workflow consists of four basic inputs as shown in Figure 1: (1) Selecting the analysis type. User selects a single-, two-, or three-feature analysis, where a feature corresponds to a specific data type (e.g., expression, methylation, mutation, copy number variation) and thus, a single-feature analysis refers to use of a single data type, while a two-feature uses a combination of two data types and so forth.
(2) Uploading the data. User inputs the data for each feature containing the genes and samples of interest. The same samples are required for each feature, though the gene sets may differ between features.
(3) Specifying a molecular profile. A molecular profile is a series of increasing ("up") or decreasing ("down") genomic changes within each feature. In Figure 1, a profile of decreased expression with decreased copy number is input.
(4) Selecting the number of breaks to define sample groups. User can specify the change point detection method (Killick et al., 2016) for finding optimal break points in the distribution of computed composite (among features) z-scores within samples (Kowalski et al., 2016; see Supplementary File 1).
The results are output in four separate tabs: (1) Input Data. Summarizes the user input data in terms of the input number of genes, number of samples, and box plot distribution by data type.
(2) SISPA Results. Outputs the table of defined sample groups with their gene set enrichment score for the selected analysis type and molecular profile of interest. The scatter plot on the right displays all the change points detected within the data-set, samples falling in the topmost change point are the samples with the profile activity. The frequency plot at the rightmost bottom represents the distribution of the number of samples with and without the profile activity.
(3) Waterfall Plot. To visualize the sample groups that correlate with the profile of interest. Samples with the profile activity have the highest score and are shown in orange filled bars, while samples without the profile activity are shown with grey-filled bars.
(4) Sample Profile. Represents the diagnostic plots to visualize the distribution of the user-input data overall by the identified

Amendments from Version 1
This version includes revisions and response to referee report 16 Apr 2018. Mainly, we have changed the default parameter settings (under 'Sample Profile' and 'Changepoint Input' panel) on the online version of the shinySISPA web-tool to output results as shown in the manuscript. The example datasets are available for one-, two-, and three-feature analysis under 'Data Input' panel. We have updated Figure 1 and Figure legend pertaining to the reviewer comment. We have corrected the typo in the main text pertaining to the number of patients with and without profile activity. Figure 1. A schematic representation of shinySISPA workflow for a two-feature analysis. Here, we define samples supporting the molecular profile of decreased gene expression and copy loss. The tool requires user selection of analysis type, user upload of data types on samples and gene sets, and specification of a profile to output the samples supporting that profile. The samples are selected based on a change point model applied to composite (among features and genes), within-sample z-scores. A waterfall plot of profile activity is output with samples selected in orange as showing the most support for the profile.

REVISED
sample groups. It also allows the users to view data distribution for a selected gene in the set within each data type to assess what genes in particular satisfy the profile versus samples without profile.
All results generated during the process are directly downloadable on the user's local computer. A detailed manual with tool settings are provided in the Supplementary File 1. Upon forming such sample groups, one may readily examine the effect of a profile on various clinical and biological clinical outcomes.

Use case
We applied shinySISPA to profile newly diagnosed multiple myeloma (MM) patients for decreased gene expression and copy number based on a GISPA (Gene Integrated Set Profile Analysis)-derived gene set characterizing the IgH translocation in the MM cell lines (Kowalski et al., 2016). The t(14;16) translocation is known to be associated with poor prognosis in MM. By applying the shinySISPA tool to the t(14;16) characterized gene set profile, we were able to translate cell line profiles to patient profiles. Using the IA6 release of Multiple Myeloma Research Foundation (MMRF) CoMMpass study, we downloaded data from 377 newly diagnosed patients at pre-treatment with available clinical outcomes, RNA-Seq expression, and DNA-copy number variations from the MMRF Research Gateway portal. Based on our two-feature analysis, 7 of the 370 MM patients were defined with profile activity (Figure 1) by identifying changes in variance using change point v2.2.2. Furthermore, we used CASAS (Rupji et al., 2017) to compare survival curves of the identified two sample groups for downstream clinical interpretation. We found seven samples with profile activity to be significantly (P<0.0001) associated with poor survival as compared to the 370 samples without the profile activity (HR = 9.81; 95% CI = (3.39, 28.37)).

Conclusion
We have demonstrated the utility of our shinySISPA tool in translating cell line characterized gene sets molecular profile to patient profiling (Kowalski et al., 2016); however, one can use any a priori-defined gene sets with any combination of molecular data for identifying samples with a similar gene set profile. The introduction of a change point model to select samples with profile support offers a more objective approach than with clustering methods. With only a gene set and a combination of data types from the same samples, the tool is widely applicable to many settings. For example, shinySISPA may be used to define patients based on known drug targets and pathways, or to identify patients that may be at risk for poor prognosis based on known prognostic markers.

Data and software availability
The example sample data used to demonstrate shinySISPA workflow is available on the web-version and with the package source code at: https://github.com/BhaktiDwivedi/shinySISPA. This work proposes a web-based tool allowing to identify a subgroup of samples that share similar molecular profiles based on gene expression information and visualize the results in various plots. It is written in R program language. This tool would be useful for easily dealing with clinical data and genomic data.
One weakness is to only define two sample groups but not multiple groups.

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes The authors introduce a Shiny application that are developed as a user-friendly tool for conducting omics-data analysis with their published methodology work. The article provides detailed description of the inputs and outputs for the shiny application, and shows working example for the use of this application. This article is index-able with some necessary modifications. The following are my major comments.
Before introducing the shiny application, the authors should provide a brief summary of their methodology work, including clear definitions of important terminologies and a description of their model. It would greatly help the readers to understand the contents followed subsequently. With a glimpse on the methodology paper, I found the terms such as "molecular profile", "feature" are defined misleadingly in this article. What is a change point model? For the input element (4), where to select the number of breaks? Is it the max Q allowed? What is the max Q? What are the options for "Changes Using" in the bottom right of Figure 1? For the output element (2), it is better to include a figure with the description. In the "Use case" section, Why GISPA is used? What's the relation of GISPA and SISPA? How many genes are used? Is it 16? Where are the 7 patients in Figure 1? Are they the orange bars? What is "using change point v2.2.2"? "We found seven samples with profile activity to be significantly ( <0.0001) associated with P poor survival as compared to the 300 samples without the profile activity (HR = 9.81; 95% CI = (3.39, 28.37))." Previously mentioned, there are 377 patients. Why are 70 samples missing? For the reproducibility of the work, please upload a default dataset in the shiny application, so that the readers/users can replicated the analysis for the first time. Also, I provide my minor comments below.
Please consider a thorough revision of the English writing for this scientific article. There are a 1.

2.
Please consider a thorough revision of the English writing for this scientific article. There are a number of grammatical errors, and sentences do not flow smoothly. Please avoid using colloquial words in scientific writing. Please modify the legend of Figure 1. In the brackets "(shown in grey)", it is hard to locate where it is referring to since there are many grey colors in the figure. Also, please add labels to the subfigures that are referred in the text.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly No competing interests were disclosed.

Competing Interests:
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Approved with Reservations
The authors introduce a Shiny application that are developed as a user-friendly tool for conducting omics-data analysis with their published methodology work. The article provides detailed , and shows working example description of the inputs and outputs for the shiny application for the use of this application. This article is index-able with some necessary modifications. The following are my major comments.
1) Before introducing the shiny application, the authors should provide a brief summary of their methodology work, including clear definitions of important terminologies and a description of their model. It would greatly help the readers to understand the contents followed subsequently.
We have provided a reference to our methods paper that includes detailed information Response:

1.
We have provided a reference to our methods paper that includes detailed information Response: on the SISPA approach. In this paper, our focus is upon introducing the application of the method in terms of tool development and implementation by providing detailed examples and information on data input and output/results. Considering this focus, along with space constraints, we have opted to not repeat the already published method description and instead, reference it.
With a glimpse on the methodology paper, I found the terms such as "molecular profile", "feature" are defined misleadingly in this article. We have intentionally used the terms "molecular profile" and "feature" in the same Response: context as in the published methods papers, whereby "molecular profile" refers to change of either increase ("up") or decrease ("down") and "feature" refers to a specific data type (e.g., expression, methylation, copy number change). We have also provided examples of the term "molecular profile" to further clarify the context. In this paper: "A "feature" corresponds to a specific data type (e.g., expression, methylation, mutation, copy number variation) and thus, a single-feature analysis refers to use of a single data type, while a two-feature uses a combination of two data types and so forth." (pg# 3 last paragraph, under ) Selecting the analysis type "A "molecular profile" is a series of increasing ("up") or decreasing ("down") genomic changes within each feature…" (pg# 4 second paragraph, under Specifying a molecular profile) In the published methodology paper: "The Cancer Genome Atlas (TCGA) nomenclature ( ) that references a specific data type https://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm (RNA-Seq expression, DNA CpG methylation, etc.) as a feature." "A profile is defined by specifying , a change of either increase or decrease within each of a priori the features.."

What is a change point model?
A changepoint model is a method for identifying changepoints within data. It is a Response: published method. Please see below references to the changepoint method and the changepoint R package for details: Killick R and Eckley IA (2014). "changepoint: An R Package for Changepoint Analysis." ,  (4), where to select the number of breaks? Is it the max Q allowed? What is the max Q? What are the options for "Changes Using" in the bottom right of Figure 1? Yes, the allotted maximum number of breaks is specified using the "Max Q Allowed". Response: "In input element (4), users can modify the Changepoint Input (Killick R, ., 2016) to find the et al optimal break points within the estimated profile sample score (Kowalski, ., 2016). The et al changes can be found using mean("mean"), variance ("var") or both ("meanvar") with the user-specified changepoint method ("AMOC", "BinSeg", "PELT", or "SeqNeigh") given the allotted maximum number of change points ("Max Q allowed"). Note that the number of change points identified may differ for the same dataset depending on the change point R package version installed on the system. Currently we are running changepoint version 2.2.2 on our hosting server..." Please see pg# 6 of the supplementary file 1 for the details.

1.
Please see pg# 6 of the supplementary file 1 for the details.
3) For the output element (2), it is better to include a figure with the description.
The output elements ("Input Data" "SISPA results", "Waterfall Plot" and "Sample Response: Profile") screenshot including the figures are explained in detail in the supplementary file 1. Please see pg# 7-11 under "Result" Section. 4) In the "Use case" section, Why GISPA is used? What's the relation of GISPA and SISPA? Response: GISPA (Gene Integrated Set Profile Analysis) is a method designed to define gene sets with similar, a priori specified molecular profile. While, SISPA (Sample Integrated Set Profile Analysis) is a method designed to define sample groups with similar gene set a priori specified molecular profile. Both GISPA and SISPA method are published in Nucleic Acid Research (Kowalski et al., 2016;PMID: 26826710).
GISPA was used to identify genes with decreased expression and decreased copy change molecular profile in a multiple myeloma cell line with IgH translocation. This gene set is published in the methodology paper Nucleic Acid Research (Kowalski et al., 2016;PMID: 26826710).
Here, we extracted RNA-seq expression, and copy number change data for GISPA derived gene set characterizing the IgH translocation on 377 newly diagnosed patients enrolled in the coMMpass clinical trial to define samples with a similar gene set profile, i.e., decreased expression with copy loss. This example data is provided with this paper. Pg# 4 last paragraph under "Use case" describes the use of application of SISPA using GISPA derived gene sets.
2. How many genes are used? Is it 16?
Yes. The number of genes in the expression and copy number variation data is 16.

Response:
3. Where are the 7 patients in Figure 1? Are they the orange bars?
The patients with and without profile activity are highlighted in "Samples Supporting Response: the Gene Set Profile" labeled section of the Figure 1 5. "We found seven samples with profile activity to be significantly ( <0.0001) associated with P poor survival as compared to the 300 samples without the profile activity (HR = 9.81; 95% CI = (3.39, 28.37))." Previously mentioned, there are 377 patients. Why are 70 samples missing?
We have corrected the typo, please see page# 4 and 5, last paragraph. It is 7 of 370 Response: patients. "Based on our two-feature analysis, 7 of the 370 MM patients were defined with profile activity (  Figure 1) by identifying changes in variance using change point v2.2.2. Furthermore, we used CASAS ( Rupji , 2017) to compare survival curves of the identified two sample groups for et al. downstream clinical interpretation. We found seven samples with profile activity to be significantly ( <0.0001) associated with poor survival as compared to the 370 samples without the profile P activity (HR = 9.81; 95% CI = (3.39, 28.37))." 5) For the reproducibility of the work, please upload a default dataset in the shiny application, so that the readers/users can replicated the analysis for the first time.
All users can access and analyze the example dataset (i.e., default dataset) used in Response: the paper by choosing the "Example data" from the Upload Input option on the web-interface. The data is also available to download from GitHub ( . https://github.com/BhaktiDwivedi/shinySISPA) Users are able to obtain the same exact results, i.e., samples with and without profile activity using the current default settings implemented in shinySISPA web tool.
Also, I provide my minor comments below.
1. Please consider a thorough revision of the English writing for this scientific article. There are a number of grammatical errors, and sentences do not flow smoothly. Please avoid using colloquial words in scientific writing.
We have reviewed the manuscript and do not identify any such problem. If the Response: reviewer still feels strongly about it, we kindly request specific examples to be cited from the text.
2. Please modify the legend of Figure 1. In the brackets "(shown in grey)", it is hard to locate where it is referring to since there are many grey colors in the figure. Also, please add labels to the subfigures that are referred in the text.
We have addressed and incorporated these changes. Please see updated Figure  Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? No Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly No competing interests were disclosed.