Integration of EGA secure data access into Galaxy [version 1; peer review: 2 approved]

High-throughput molecular profiling techniques are routinely generating vast amounts of data for translational medicine studies. Secure access controlled systems are needed to manage, store, transfer and distribute these data due to its personally identifiable nature. The European Genome-phenome Archive (EGA) was created to facilitate access and management to long-term archival of bio-molecular data. Each data provider is responsible for ensuring a Data Access Committee is in place to grant access to data stored in the EGA. Moreover, the transfer of data during upload and download is encrypted. a workflow of multiple components, to derive interpreted data. The forward and backward FASTQ sequencing reads are imported from EGA by ega_download_streamer ; subsequently, the tool FASTQ Groomer does a consistency check of the data formats; then with Sickle , low quality bases (Q<30) are trimmed and reads clipped into less than 25 bases are discarded, only outputting the high-quality sequencing reads. Afterwards, these reads are aligned to the hg19 (GenBank Assembly ID GCA_000001405.1) reference genome in RNA STAR . Then STAR-Fusion is used for predicting the fusion genes, which also requires two reference files as auxiliary inputs. The output goes through two filters to only keep predictions having more than two split reads and more than two spanning reads. This manuscript describes proof of principle for accessing the data contained within the European genome-phenome archive (EGA) via Galaxy analysis environment. This is a very timely and important development as it would dramatically increase the utility of resources such as EGA. This is because continuous accumulation of new data and development of new technologies will undoubtedly result in the need to reanalyze previously generated datasets and to combine them with newly acquired research outcomes. Having ability to securely request and analyze the data via Galaxy makes such re-analyses straightforward and convenient resulting in more researchers performing these tasks. My hope is that this would will push other sites with similar data (e.g., dbGaP) to implement similar software solutions. I have comments:


Introduction
With the advent of high-resolution and high-throughput experimental platforms the field of biomedical research has become more complex, with major shifts in data diversity and dimensions. Consequently, solutions for the increasing demand of data processing, storage and workflow management are required for translational research. Due to the privacy issues related to the clinical nature of translational research and personal footprints in molecular data, there is a need for a secure framework to store and analyse data. The aim of CTMM Translational Research IT (CTMM-TraIT) project is to provide a multidomain IT-infrastructure as an end-to-end solution where researchers can capture, process, and share their study data. To achieve this, CTMM-TraIT makes use of large community-driven open source software including tranSMART [1][2][3] and Galaxy 4,5 . In a collaboration between ELIXIR, CTMM-TraIT and European Genomephenome Archive(EGA) a full ecosystem was designed, as shown in Figure 1, to connect the storage of raw molecular profiling data with processed data and the computational workflows.

Storage
Facilitating the long-term storage and management of raw, interpreted and clinical data (patient and tissue information), supported by provenance of computational workflows, is a key aim of the CTMM-TraIT project; special attention to the security is necessary, due to the privacy-sensitive nature of the data. EGA is a service that provides long term archiving and distribution of identifiable genetic and phenotypic data resulting from biomedical research projects. Data stored at the EGA are collected from individuals whose consent agreements authorise release only for specific research use to bona fide researchers. Strict protocols govern how information and data are managed, stored, transferred and distributed by the EGA project, and each data provider is responsible for ensuring a Data Access Committee is in place to grant access to their data. However, EGA only functions as long-term storage facility and does not facilitate analysis. Within the CTMM-TraIT project, we agreed upon a workflow in which the interpreted data, such as the BAM files, and the clinicalpathological data would be stored in tranSMART; the raw and The clinical data of an experiment describes the clinical-pathological data, including tissue and patient information. Descriptors of the samples combined with these variables are stored in tranSMART. Molecular profiling data are derived from samples of patients: these samples are processed in the laboratory to obtain tissue derivatives, such as isolation of DNA, RNA and proteins, which are subsequently analysed by high throughput experimental techniques to obtain the raw molecular profiling data; the descriptions of the performed experiments are also stored in tranSMART. The actual raw data produced by the high throughput analysis are physically stored in repositories like EGA, while the interpreted data processed by extensive computational workflows, and references to the raw data are stored in tranSMART. The ability to reanalyse the raw data, is provided by Galaxy. Note that the work described here indicated by red arrows implements a data connection, allowing a user to retrieve raw data from EGA in Galaxy, and run subsequent workflows, constructed by tools in the Galaxy tool shed.
uninterpreted data, such as FASTQ and BAM files, would be stored and archived in EGA. Figure 1 demonstrates how the clinicalpathological and interpreted data are managed by tranSMART, which links to the raw data in EGA which in turn can be accessed and (re)analysed from within a Galaxy environment.

Integrated analysis
Within EGA, data are separated into different layers: 1) raw data, produced by high-throughput platforms; 2) metadata describing the raw data, e.g. machines and protocols used and descriptions of treatments and tissues and 3) interpreted data, produced by running analyses on the first two layers. Since the EGA is ideally placed to facilitate continued data access and management to funded projects after completion of the project, and the data from layer 3 should be reproducible using the data from the other layers, only the data from the 1st and 2nd layer will go to EGA for archival storage.
In the ecosystem we use Galaxy, a popular and user friendly web-based bioinformatics platform that provides an intuitive user interface to run and design workflows, to perform integrated analysis from multiple domains (genomics, transcriptomics and proteomics), and to share and communicate both results and methodologies. It makes use of tools and libraries provided by the bioinformatics community as plugins. Tools are embedded as plugins in such a way that each of them becomes a modular block that can be plugged into the next block (tool or visualisation) 6 . To directly import and hence analyse data stored in EGA within Galaxy, it is necessary to implement an interface from EGA to Galaxy as a plugin (Galaxy tool wrapper).
Here we present an end-to-end interface to a framework which seeks to extend data accessibility, ensures long-term archival and facilitates downstream analysis by utilising EGA. The framework embeds EGA access into Galaxy, and allows subsequent workflows using (novel) Galaxy tools. An advantage of setting up an analysis in this way is that both the tools and the data are connected and centralised and can be shown, shared, and reproduced. We further demonstrate the setup with an RNA-Seq use case.

Implementation
We have embedded the EGA download client (https://ega-archive. org/using-ega-download-client) into Galaxy as a tool wrapper including dependency management. The tool is named as ega_ download_streamer and can be installed on Galaxy systems from the main tool shed. Before being able to use the tool, Galaxy needs to be configured by setting up an EGA account, further explained in Supplementary File 1. Hereafter, we call this tool "Galaxy EGA download streamer" To allow access to EGA, the tool interfaces with the EGA download client ensuring that encrypted data are transferred from EGA. The Galaxy EGA download streamer gets data from EGA directly into the user's history. On the Galaxy side, the server starts with a form requesting a unique EGA file identifier. After submission, it logs in with the configured credentials, creates an encryption key and sends this over a secure connection to EGA requesting EGA to encrypt the file that corresponds to the identifier with the given key. After making the request, the encrypted package becomes available and will be downloaded; subsequently the connection with EGA is closed. The package will be locally decrypted and extracted if it is a file archive. Galaxy determines, with its built-in sniffing system, the file type (FASTQ, BAM, GTF, etc.) and eventually puts the files into the user's history.

Operation
Galaxy with a version number 16.07 and above is required to use this tool, because only this version and higher can make the tool detect data types automatically within a workflow. In addition, at least 30 GB RAM and 100GB hard disk space is required to run the use case in the next section. Other system requirements for the installation of Galaxy can be found in Galaxy official document (https://wiki.galaxyproject.org/Admin/GetGalaxy).

Use case
As a proof of concept we show how the Galaxy EGA download streamer may be used in a workflow to detect fusion genes from RNA-seq data. To demonstrate this workflow we use cell line data that can be made publicly available. We use an RNA-Seq dataset of the VCaP cell line in the Galaxy workflow shown in Figure 2. Since the recurrent fusion TMPRSS2-ERG is found in more than 50% of the diagnosed patients 7 , we test it for the presence of fusion genes, and since VCaP contains TMPRSS2-ERG 8 we can use TMPRSS2-ERG as a positive control. We use the tool STAR-Fusion (https://github.com/STAR-Fusion/STAR-Fusion/ wiki) which can be used as a separate module after running the RNA-STAR aligner 9 .
Besides the Galaxy EGA download streamer accessing EGA, this workflow also requires adaptations to the RNA-STAR Galaxy wrapper from the IUC group (https://toolshed.g2.bx.psu.edu/ repos/iuc/rgrnastar), by adding a specific fusion detection settings preset and to create a new Galaxy wrapper for STAR-Fusion. The workflow starts with obtaining data from EGA which for this study are the raw paired-end FASTQ sequencing reads. These files translate to EGA identifiers: EGAF00001210838 (forward) and EGAF00001210839 (reverse) and will be the input for the Galaxy EGA download streamer. Because we want to ensure the handshake with other tools and the several sub file formats of FASTQ 10 , it is desirable to proceed with a FASTQ-sanger encoded file, which is ensured by the tool FASTQ Groomer 11 . Note that the search space for alignment is higher for fusion gene detection than for most other alignment purposes such as determining expression levels; hence we would like to have a high base quality to avoid misalignments and unnecessary computation. We improve the base quality by trimming low quality bases (Q<30) and discarding reads being clipped into less than 25 bases with the tool Sickle (12; https://github. com/najoshi/sickle). These high-quality sequencing reads were aligned to the hg19 (GenBank Assembly ID: GCA_000001405.1) reference genome. As proposed by the authors of STAR-Fusion we use fusion gene detection specific settings available as the "Use parameters suggested for STAR-Fusion"-preset in IUC's Galaxy RNA-STAR wrapper. Besides a classical alignment file, it also produces an alignment file for the discordant reads and an equivalent junction file. STAR-Fusion uses the junction file to predict fusion genes and requires two additional reference files (Data and software availability). STAR-Fusion produces a list that contains many candidates including predictions that have a rather low confidence level, less than 3 split or spanning reads. Therefore we end the workflow with two filters that only keep predictions that have more than two split reads and more than two spanning reads.

Discussion and conclusions
The EGA-TraIT implementation study sets out to design an entire ecosystem for molecular profiling data in clinical research, with a focus on security. Here we demonstrate with a proof-of-concept study that it is possible to connect EGA and Galaxy as designed within this system. This study is part of an ongoing effort to make EGA data correspond to FAIR (Findable, Accessible, Interoperable, and Reusable) data principles 15 , which will result in further recommendations on the EGA data model and ontologies in the near future. Here we highlight the implementation of the storage component and demonstrate how to use it in an analysis context. Its key value is that it allows tracking and redistributing the entire workflow and data jointly, from the beginning to the end, ensuring the provenance of all intermediate layers up to the final results. As a result, we have: • shared molecular data via EGA; • created new Galaxy tools; • shared the workflow, including all parameters, via a URL for Galaxy (as a shared history) and via myExperiment 16 ; • shared the interpreted data as a Galaxy history; • shared a manual as a Galaxy page on how to set up such an experiment.
Further to the work described here, the implementation study continues until the end of 2016 and the complete outcomes from this project, with recommendations on structuring metadata, will be presented in a future report. This implementation study and the Sickle, low quality bases (Q<30) are trimmed and reads clipped into less than 25 bases are discarded, only outputting the high-quality sequencing reads. Afterwards, these reads are aligned to the hg19 (GenBank Assembly ID GCA_000001405.1) reference genome in RNA STAR. Then STAR-Fusion is used for predicting the fusion genes, which also requires two reference files as auxiliary inputs. The output goes through two filters to only keep predictions having more than two split reads and more than two spanning reads.

Recommendations
A limitation of the working prototype for the Galaxy EGA download streamer is that it requires setting up a generic EGA account for the entire public Galaxy server. This means that any user can only access the data files that are available for that generic account rather than a personal account. We have thought of several solutions: • A secure input type for passwords. However, Galaxy currently does not support password input types, and textual input types are recorded in the database which allows them to be shared when history items are shared with other users.
• Adapt EGA in a way that it shares tokens that allow download of a particular file within a particular time. However, rerunning the tool would require selecting a new token. For this setup, it would be ideal to have a non-memorised data type.
• An authentication management mechanism within Galaxy. If a user would configure certain authentications within Galaxy, Galaxy can manage these authentications and automatically connect to EGA on request (OAuth model).
Due to the current limitation of data protection and access control in a public Galaxy service, a private Galaxy instance seems to be a practical solution to this problem, keeping the data access limited to a small research group. This does require extra expertise to properly establish the service in a secure environment.

Data and software availability Tools
The software can be found in the main Galaxy tool shed at the following URL This manuscript describes proof of principle for accessing the data contained within the European genome-phenome archive (EGA) via Galaxy analysis environment. This is a very timely and important development as it would dramatically increase the utility of resources such as EGA. This is because continuous accumulation of new data and development of new technologies will undoubtedly result in the need to reanalyze previously generated datasets and to combine them with newly acquired research outcomes. Having ability to securely request and analyze the data via Galaxy makes such re-analyses straightforward and convenient resulting in more researchers performing these tasks. My hope is that this would will push other sites with similar data (e.g., dbGaP) to implement similar software solutions.
I have several comments: To access the data a user must have necessary security credentials. One the data is transferred into Galaxy it leaves the EGA filesystem. How do authors secure this particular Galaxy instance? Other sites housing protected data will need this information to make the decision on whether to adopt a similar strategy.
It is necessary to explain why the Galaxy instance for EGA access requires a minimum of 30 GB RAM and 100GB disk space. It seems that these requirements are dictated by data size and types of tools (i.e., STAR tools require considerable amount of memory to perform analyses of RNAseq data). Again other sites adopting Hoogstrate's et al. approach should be aware of the fact that Galaxy itself requires little resources. Instead the underlying tools and data would dictate hardware specifications.
Galaxy Toolshed is introduced in the paper with little explanation of what it is. A couple of sentences descripting this "AppStore" will be helpful to reads who are not familiar with Galaxy's ecosystem.
While it is a good starting example, it would be helpful to show that one can generate more concrete results with Galaxy beyond just filtering STAR-fusion output. Is there any interesting bit of biological information that can be added to the paper?

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com