TCGAbiolinksGUI: A graphical user interface to analyze cancer molecular and clinical data [version 1; referees: 1 approved, 1 approved with reservations]

The GDC (Genomic Data Commons) data portal provides users with data from cancer genomics studies. Recently, we developed the R/Bioconductor package, which allows users to search, download and prepare TCGAbiolinks cancer genomics data for integrative data analysis. The use of this package requires users to have advanced knowledge of R thus limiting the number of users. To overcome this obstacle and improve the accessibility of the package by a wider range of users, we developed a graphical user interface (GUI) using Shiny available through the package The

The National Cancer Institute's (NCI) Genomic Data Commons (GDC), a data sharing platform that promotes precision medicine in oncology, provides a rich resource of molecular and clinical data.As of 2018, almost 13,000 tumor patient samples across 38 different cancer types and subtypes are freely available for download and analysis.Currently, the platform includes data from The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Treatments (TARGET), with the expectation that many other cancer genomic repositories to be incorporated into GDC over the next few years.The publicly available data have been utilized by researchers for novel discoveries and/or validate important findings related to tumorigenesis, improvements in treatment diagnosis and refinement of tumor classifications.To enhance these findings, several important bioinformatics tools to harness genomics cancer data were developed, many of them belonging to the Bioconductor project 1 .
TCGAbiolinks 2 , an R/Bioconductor package, was developed to facilitate the analysis of cancer genomics data by incorporating the query, download and processing steps directly from GDC.This tool allows users to advance their data analysis of cancer genomics by harnessing additional Bioconductor packages thereby allowing users access to a wealth of statistical methodologies.In addition, it can perform integrative data analysis across different types of experimental data types, such as DNA methylation and Gene expression data.A detailed comparison between TCGAbiolinks and other bioinformatics tools to analyze cancer genomics data was previously described 2 .Although TCGAbiolinks is a suitable R package for most data analysts with a strong knowledge and familiarity with R, specifically those who can comfortably write strings of common R commands, we developed TCGAbiolinksGUI to enable user access to the methodologies offered in TCGAbiolinks and to give users the flexibility of point-and-click style analysis without the need to enter specific arguments.TCGAbiolinksGUI takes in all the important features of TCGAbiolinks and offers a graphics user interface (GUI) thereby eliminating any need to be familiar with TCGAbiolinks' key functions and arguments.In addition, we added new functions to import users' own raw data for further integrative analysis with GDC data.Tutorials via online documents and YouTube video instructions are available from the website to assist end-users in taking full advantage of TCGAbiolinks.
Here we present TCGAbiolinksGUI, an R/Bioconductor package which uses the R web application framework Shiny 3 to provide a GUI to process, query, download, and perform integrative analyses of GDC data.

Infrastructure
The TCGAbiolinksGUI was created using Shiny, a Web Application Framework for R. TCGAbiolinksGUI incorporates several packages, that provide advanced features to enhance Shiny apps, such as shinyjs to add JavaScript actions 4 , shinydashboard to add dashboards 5 and shinyFiles 6 to provide access to the server file system.The following R/Bioconductor packages are used as back-ends for the data retrieval and analysis: • TCGAbiolinks 2 which allows to search, download and prepare data from the NCI's Genomic Data Commons (GDC) data portal into an R object and perform several downstream analysis; • ELMER (Enhancer Linking by Methylation/Expression Relationship) 7,8 which identifies DNA methylation changes in distal regulatory regions and correlate these signatures with the expression of nearby genes to identify transcriptional targets associated with cancer; • ComplexHeatmap 9 to visualize data as oncoprint and heatmaps; • pathview 10 which offers pathway based data integration and visualization; • maftools 11 to analyze, visualize and summarize genomics MAF (Mutation Annotation Format) files.

Graphical user interface design
The user interface has been divided into three main GUI menus.The first menu defines the acquisition of GDC data.The second, the 'Analysis' menu, is subdivided according to the molecular data types.And the third is dedicated to harnessing integrative analyses.Each menu is described below (see Figure 1): • Data: Provides a guided approach to search for published molecular subtype information, clinical and molecular data available in GDC.In addition, it downloads and processes the molecular data into an R object that can be used for further analysis.For raw DNA methylation data obtained in the form of Intensity Data (IDAT) files, we provide a pipeline using the R/Bioconductor minfi package to prepare the data for subsequent bioinformatics analysis 12 performing a background and dye-bias correction with the preprocessnoob function followed by a detection P-value quality masking (sample-specific) 13 and probes overlapping repeats or single nucleotide polymorphisms masking (non-sample specific) 14 (Figure 3).
• Clinical analysis: Performs survival analysis to quantify and test survival differences between two or more groups of patients and draws survival curves with the 'number at risk' table, the cumulative number of events table and the cumulative number of censored subjects table using the R/CRAN package survminer 15 .
• Epigenetic analysis: Performs a differentially methylated regions (DMR) analysis, visualizes the results through both volcano and heatmap plots, and visualizes the mean DNA methylation level by groups or subtypes.For certain tumor types like Glioma, we have added a function to classify non TCGA derived DNA methylation data into one of the 7 published epigenomic subtypes 16 using a RandomForest (RF) trained model derived from DNA methylation signatures available from the Cancer Genome Atlas (Figure 4).Description of how the RF models were created can be found in TCGAbiolinksGUI.datavignette.
• Transcriptomic analysis: Performs a differential expression analysis (DEA), and visualizes the results as either volcano or heatmap plots.Pathway analysis can be performed on a list of differentially expressed genes 10 .
• Genomic analysis: Visualize and summarize the mutations from MAF (Mutation Annotation Format) files through summary plots and oncoplots using the R/Bioconductor maftools package 9,11 (Figure 2 and Figure 6).
• Integrative analysis: Integrate the DMR and DEA results through a starburst plot.Integrate clinical and mutation data by way of a Kaplan-Meier survival analysis for groups of mutated samples vs non-mutated for a given gene (Figure 5).DNA methylation and gene expression data can be further analyzed using the R/Bioconductor ELMER package to discover functionally relevant genomic regions associated with cancer 7,8 .

Documentation
We provide a guided tutorial for users via an online vignette document which details each step and menu function.
Printable PDF and YouTube video instructions (http://bit.ly/TCGAbiolinksGUI_videoTutorials)are provided to help users utilize TCGAbiolinksGUI.A demonstration version of the tool is available at TCGAbiolinksGUI.To help improve and expand our tool over time, users are encouraged to report and file bug reports or feature requests via our GitHub repository.

Docker container
We recognize that one of the possible drawbacks of using our tool is the arduous process of installing the R/Bioconductor environment and all of the required dependencies.One solution would be to host the tool on a server, however the high demand for space, computational processing, and access for multiple users would make this approach financially challenging; instead, we encourage users to use it on their own computers.However, to further simplify the usability and accessibility of our tool, we provide a Docker image file that contains the complete R/Bioconductor environment configured to use TCGAbiolinksGUI.This file is compatible with most popular operating systems and it is available online.This image can be easily downloaded and deployed through the Kitematic tool, a simple application for managing Docker containers for Mac, Linux, and Windows.A detailed documentation of how to obtain and use the docker image via the Kitematic application is available at TCGAbiolinksGUI help document.The Docker image will be updated to coincide with any regular updates of TCGAbiolinksGUI.We believe this will provide several types of access for end-users interested in analyzing cancer genomics data stored at GDC.

Operation
The package can run on any platform with a R version ≥ 3.4 or higher and Bioconductor version ≥ 3.6 and higher.

Results and discussion
To provide end-users with insights and application of TCGAbiolinksGUI; 1) we compare TCGAbiolinksGUI with other published bioinformatics tools and 2) we provide a use-case that allows users a step-bystep guide to analyzing their own cancer molecular data alongside TCGA data.

Comparison of alternative software
Web tools used for cancer data analysis might be classified into two broad groups.The first group only provides an interface to existing software analysis tools.The Galaxy project, which is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research, is an example of such a tool that belongs to this group.The other group is composed of exploratory tools mainly focused on the visualization of processed data and pre-computed results.The cBioPortal project 17,18 , by providing several visualizations for mining the TCGA data, is an example of a tool that falls within this category.
If one were to classify TCGAbiolinksGUI, it would belong to the first group.Compared to the Galaxy project, TCGAbiolinksGUI offers an open platform which improves the accessibility of R/Bioconductor packages, allowing users an advantage to integrate their features with existing Bioconductor packages.Unlike the Galaxy project, which requires the interface elements to be structured through XML files 19 , TCGAbiolinksGUI can resolve this simply because it was built within the R/Shiny framework.cBioPortal and TCGAbiolinksGUI provide users with access to raw and processed data, however TCGAbiolinksGUI allows users to perform in-depth integrative analysis, a functionality which cBioPortal currently lacks.For example, if a user is interested in defining differentially expressed genes or DNA methylation events between two populations of tumors (i.e.FOXA1 mutants and FOXA1 wildtypes), by using cBioPortal, a user would have to download the gene expression, DNA methylation and mutation data, define the samples per group, and import the data into their favorite statistical tool to identify their list of differentially expressed or methylated genes.Whereas, with TCGAbiolinksGUI, we developed the platform so that the user can define the sample groups based on their mutation spectrum, perform supervised analysis that can then define differentially expressed or methylated gene list and this can be directly ported into pathway analysis.In addition, if the user is interested in observing survival differences between the groups, this can also be done within TCGAbiolinksGUI and thereby reducing the need to exit a specific data platform or having to transform the downloaded data to fit some other statistical platform to achieve the same goals.

Use case
In order to illustrate an integrative analysis using TCGAbiolinksGUI, we provide a use case available at https://bioinformaticsfmrp.github.io/Bioc2017.TCGAbiolinks.ELMER/index.html.This use case highlights a step by step guide for one to perform an integrative analysis using TCGA-LUSC (Lung Squamous Cell Carcinoma) data retrieved directly from GDC server 20 .

Conclusion
TCGAbiolinksGUI was developed to provide a user-friendly interface of our TCGAbiolinks package.TCGAbiolinksGUI is designed specifically for the least experienced R user to import GDC data and perform R/Bioconductor analysis as well as for the most experienced R user, who could execute several of the R/Bioconductor functions without the need to write several lines of R code.For the R/Bioconductor developers, the package has an extensible design feature that allows users 1) to add new features by modifying a few lines of the main code, 2) to add a file with user interface elements on the client side, and 3) add a file with their control on the server side.
Also, TCGAbiolinksGUI supports the most updated R/Bioconductor data structures (i.e.SummarizedExperiment and MultiAssayExperiment) which allow handling data and metadata into one single object and validates several integrity requirements.Thereby, TCGAbiolinksGUI package allows data handling to be as efficient as possible and thereby limits and avoids user errors in data manipulation such as sample removal that involves also metadata deletion.
Finally, several efforts to understand genomic and epigenomic alterations associated with tumor development has been made over the last few years, which presents several bioinformatics challenges, such as data retrieval and integration with clinical data and other molecular data types.By creating a graphical interface to tools like TCGAbiolinks whose relevance is seen in various articles [21][22][23] , this package will allow end-users to facilitate the mining of cancer data deposited in GDC, in hopes to aid in analyzing and discovering new functional genomic elements and potential therapeutic targets for cancer.

Software installation
To install the stable version from the Bioconductor repository http://bioconductor.org/packages/TCGAbiolinksGUI/ please use the following code.
Also, due to the number of libraries loaded we had to increase the maximum number of DLL R can load, for more information please check the vignette section "Increasing loaded DLL".
> sessionInfo() R version 3.4.The major concern is that this seems like a GUI that is doing a lot of different things and thus should be more modular.The authors emphasize the connection to GDC, but many of the features are applicable to any similar dataset.Ideally, each panel would be its own Shiny module (or maybe even a separate package entirely), and expert users and app developers could select specific modules and integrate them with custom modules to create new applications.There is mention of extensibility in the text.Does that rely on the Shiny module system?It would be helpful for the reader to know that if true.The authors actually compare their tool to Galaxy, a highly modular system.GUIs have a tendency to be monolithic; we should resist that in Bioconductor.This package has 181 total dependencies!Another point is that GUIs for exploratory data analysis should not only be useful for novices in a programming language.There are times when a GUI is more convenient than programming, even for an expert programmer.A GUI provides an alternative interface to the command line, thus opening the underlying functionality to other use cases, whether a bench biologist desperate for a way to see the data, or a computational biologist who wants to quickly explore the data visually while implementing a more sophisticated analysis.
Why not include the use case / workflow in the publication itself?It's useful to have an archive of that.

Some minor points:
The abstract mentions a website but does not link to it (until maybe later) The use of the word "advanced" to describe the R users of TCGABiolinks is ambiguous.What exactly do you mean by "advanced"?Later on, advanced appears to be mean anyone who can write simple R code.Maybe just drop the ambiguous adjective?"Gene expression" do not capitalize "Gene" The phrase "specifically those who can comfortably write strings of common R commands" is awkward, especially since R does not really have "commands".Maybe say something like: "Although TCGAbiolinks is accessible to data analysts who are familiar with R programming, ...", although I would phrase it more positively, rather than as deficiency of TCGAbiolinks, which is just playing the role that it is meant to play in a larger framework."graphics user interface" should be "graphical user interface" "Web Application Framework" -no need to capitalize Is the rationale for developing the new software tool clearly explained?

Yes
Page 13 of 16 Besides taking care of downloading the TCGA datasets, TCGAbiolinksGUI package also provides basic functions/plots to process mutation/expression/methylation datasets.Additionally, TCGAbiolinksGUI performs integrative analysis of methylation and gene expression and does motif finding on the inferred regulatory network.
The TCGAbiolinksGUI has a very modern UI design, quite sophisticated programming and very friendly user interface.The structure of the analysis workflow is very clear.Besides that, it also has very detailed documentations even with videos.For the functionalities TCGAbiolinksGUI has already implemented, basically they are nice and I don't have major issues on it.

Following are my minor comments:
1.There are some low-level errors which cause TCGAbiolinksGUI() function crashed and the webpage closed.E.g. when I tried to use "network inference" or "maftools plots", it gave error "no minet function/no read.maffunction".I would guess it's mainly due to I was using the old version of this package.But it would be nice to capture these low-level errors also and print them in the web interface without stopping `TCGAbiolinksGUI()`.Also in DEA analysis, if group column is not set, `TCGAbiolinksGUI()` stops.
2. In many analysis where "group column" is needed, there are so many "clinical information fields" in the drop-down list.Is it possible to remove some of them?e.g.sample Ids or columns with too many missing values, or these numeric columns which I think they would never be used in group comparisons.
3. When a file is selected, there is no information on the web interface to tell users whether the file is selected or which file is selected.4. For DEA analysis, if group levels which are used for comparison are forgot to provides, the error information is not informative ("Each group should have at least one sample") 5.When I do enrichment analysis for differentially expressed genes, I directly used the file generated at the "DEA analysis" step.However, it gave the error "no Gene_symbol column", but I checked there does have a "Gene_symbol" column (which is the first column in the file).I guess it might due to I was using the old version of the package.6.When making heatmaps for different genes or DMRs, it is possible to put the grouping information which was used for comparision as default column annotation?I think it won't be too difficult because the group column and group levels are encoded in the file name of DEA or DMR file.
7. Is it possible to export the R code of making each plot (or performing each analysis)?Users can use these R scripts as template and customize later.E.g. since there is no option to configure the annotation colors, with the R script, users can adjust this part by themselves later.

Figure 1 .
Figure 1.The volcano plot menu of TCGAbiolinksGUI.The panel on the left shows the menus divided into different analyses, the panel on the right shows the controls available for the selected menu.In the center is a volcano plot window from the analysis menu.It is possible to control the colors, to change cut-offs, to export results into a CSV document and to export the plot.

Figure 2 .
Figure 2. Visualizing mutation summary.This maftools plot shows a summary of the MAF file.Highlighting the most mutated genes, SNV class, and variant classification distributions within a tumor type.

Figure 5 .
Figure 5. Integrating clinical data and mutation data.Performing Kaplan-Meier survival analysis for groups of samples with a mutation in gene USH2A vs WT.

Figure 6 .
Figure 6.Visualizing mutation as an oncoplot.Each column represents a sample and each row a different gene.The top barplot has the frequency of mutations for each patient, while the right barplot has the frequency of mutations in each gene.By default, samples ordered by the most mutated genes.

F1000Research 2018, 7 :
439 Last updated: 30 JAN 2019 Yes Is the description of the software tool technically sound?Yes Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?Yes Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?Yes Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?Yes No competing interests were disclosed.Competing Interests: I have read this submission.I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.23 April 2018 Referee Report https://doi.org/10.5256/f1000research.15443.r33002Zuguang Gu Division of Theoretical Bioinformatics (B080), Heidelberg Center for Personalized Oncology (DKFZ-HIPO), German Cancer Research Center (DKFZ), Heidelberg, Germany Silva et al. developed a new package TCGAbiolinksGUI which provides an easy way to query, download and analyze TCGA datasets.It would be a useful tool for bioinformaticians also non-bioinformaticians to process and get use of the massive TCGA datasets.

For
some reason, I only tested TCGAbiolinksGUI with version 1.4.7 (the release version on Bioc at the time of reviewing this paper) and R version 3.4.4.I had some errors when testing some of the functionalities for which I think it should due to the lower version I was using and I would expect they should work OK with the development version.should work OK with the development version.
the rationale for developing the new software tool clearly explained?Yes Is the description of the software tool technically sound?Yes Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?Yes Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?No Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?Yes presented in the article?Yes No competing interests were disclosed.Competing Interests: Referee Expertise: Bioinformaitcs, next generation sequencing, R packages development, visualization I have read this submission.I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and moreThe peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com TCGAbiolinksGUI is available at: https://github.com/BioinformaticsFMRP/TCGAbiolinksGUI.Detailed steps of the use case are available at: https://bioinformaticsfmrp.github.io/Bioc2017.TCGAbiolinks.ELMER/ index.html.