A guide and best practices for R/Bioconductor tool integration in Galaxy [version 1; peer review: 1 approved, 1 approved with reservations]

Galaxy provides a web-based platform for interactive, large-scale data analyses, which integrates bioinformatics tools written in a variety of languages. A substantial number of these tools are written in the R programming language, which enables powerful analysis and visualization of complex data. The Bioconductor Project provides access to these open source R tools and currently contains over 1200 R packages. While some R/Bioconductor tools are currently available in Galaxy, scientific research communities would benefit greatly if they were integrated on a larger scale. Tool development in Galaxy is an early entry point for Galaxy developers, biologists, and bioinformaticians, who want to make their work more accessible to a larger community of scientists. Here, we present a guide and best practices for R/Bioconductor tool integration into Galaxy. In addition, we introduce new functionalities to existing software that resolve dependency issues and semi-automate generation of tool integration components. With these improvements, novice and experienced developers can easily integrate R/Bioconductor tools into Galaxy to make their work more accessible to the scientific community.


Introduction
The Bioconductor Project (https://www.bioconductor.org/)provides one of the largest suites of open source bioinformatics tools for analyzing genomics and biomedical data from diverse high-throughput assays, including DNA microarrays, flow cytometry, and deep sequencing 1,2 .Bioconductor has an active user community of researchers from across a spectrum of biomedical professions, who routinely analyze complex genomics datasets.Bioconductor tools and packages are primarily based on R (https://www.r-project.org/), a programming language and integrated environment developed for statistical computing and data visualization 3 .By submitting R packages to Bioconductor, developers can easily distribute powerful statistical analysis and visualization tools, while supporting efforts towards reproducible scientific research.
To further increase their usability and distribution, some R/Bioconductor tools have been integrated into Galaxy (https:// galaxyproject.org/), an open source, web-based platform for performing, reproducing, and sharing data analyses, which utilize a variety of bioinformatics tools 4,5 .R/Bioconductor tool integration into Galaxy is a multi-step process that, although straightforward, poses unique challenges for both novice and advanced tool developers.The first major challenge is identifying and installing all of the dependencies needed for an R/Bioconductor tool.Each dependency needs to be available to Galaxy through its dependency management system, called the Tool Shed (https://toolshed.g2.bx.psu.edu/).If a dependency is not available in the Tool Shed, then it must be installed manually, which can often be difficult.The second major challenge is generating the required files/code for tool integration, which is a time-consuming process.Ensuring that the correct files are generated with correct syntax can be a frustrating task and is often the biggest hurdle for less-experienced developers.Given these issues, tool integration remains a daunting task for some developers, especially for those less familiar with command-line processes.Ideally, the R/Bioconductor tool integration process would be easy and intuitive for both novice and advanced developers.
Recognizing the need to improve the R/Bioconductor tool integration process, we outline here a guide and best practices (Table 1) for R/Bioconductor tool integration into Galaxy.Furthermore, we highlight the addition of new functionality to Planemo (http://planemo.readthedocs.io) 6, a suite of command-line utilities that assists in building and publishing Galaxy tools.To simplify generation of required tool integration files, we introduce a new Planemo command that creates template files based on an R/Bioconductor tool developed by the user.This new command also addresses dependency management issues by leveraging Bioconda (https://bioconda.github.io), a channel of the Conda package manager, which provides bioinformatics software 7 .Bioconda locates specific versioned R/Bioconductor packages via BioaRchive (https://bioarchive.galaxyproject.org/) 8and CRAN (https:// cran.r-project.org/) 9 and includes these as requirements for using the R/Bioconductor tool.These new Planemo functionalities simplify R/Bioconductor tool integration and will enable developers to focus more attention on R/Bioconductor tool development and less on tool integration into Galaxy.

Methods
Here, we present a complete guide for developers to integrate an R/Bioconductor tool into Galaxy.We first describe the necessary components for wrapping an R/Bioconductor tool in Galaxy.Next, we outline the steps required for integrating the tool components into Galaxy, testing the tool, and executing the tool.Finally, we describe a simplified integration process using the new Planemo command, bioc_tool_init.The instructions provided in this guide assume that the developer has system access to Galaxy source code files (e.g. using a local instance or cloud-based instance of Galaxy), has an active Internet connection, and has installed the Planemo python package (v0.35.0 or later; https://pypi.python.org/pypi/planemo/) 6 .Furthermore, the user must have installed any packages that are dependencies for the R/Bioconductor tool being integrated.

Tool components and structure
An R/Bioconductor Galaxy tool is defined by four major components.The first component is a Tool definition file (Tool wrapper) in XML format, which provides the interface between Galaxy and the R/Bioconductor tool being integrated.The second component is a Custom R file, which calls the R/Bioconductor tool(s) to perform a particular analysis.The third component is a Tool dependency file, which tells Galaxy where to find the required tool dependencies.The fourth component is the Test data directory, which includes both input and output files that will be used to test the Custom R file.These four components should be organized using the following directory structure:

Utilize descriptive sections in Tool definition file
Specifically, utilize the test section to identify which input parameters to use to test the R command and what output to expect as a result.Also, utilize the help section to provide text to describe the R/Bioconductor tool and any other useful information for the user.

Use CDATA to represent character data
In markup languages, such as XML, a CDATA section represents content that should be interpreted purely as character data.Using CDATA is highly recommended, for example, for text in the help and command sections of the Tool definition file.

Choose output formats recognized by Galaxy
Output generated by the Custom R file should be saved in a format that is present in the list of Galaxy-recognized datatypes.By using one of these datatypes, the user can view the file in Galaxy and perform manipulations on the data using other Galaxy tools.The list of available Galaxy datatypes can be seen in the Admin panel under "Datatypes Registry".Galaxy now also saves files as Rdata files, and the feature to visualize these files is under development.

Pass command-line arguments with flags
In the Custom R file examples presented here, the R package getopt is used to add command-line arguments with flags.It is good practice to pass arguments using flags -this is currently required by bioc_tool_init -because parameters are explicitly defined and not inferred.

Implement error handling
Handling error messages upon execution of an R script is particularly important.If an error message or any message generated by an R script is printed to stdout instead of stderr, Galaxy interprets this as the tool correctly producing an output file and does not know whether or not the tool failed.By specifically telling R to output error messages to stderr, Galaxy is able to distinguish between a failed job and a successful job.Include the following lines at the beginning of a Custom R file to send R errors to stderr and to handle UTF8 errors: ## Redirect R error handling to stderr.options(show.error.messages=F,\ error=function(){cat(geterrmessage(), \ file=stderr());q("no",1,F)}) ## Avoid crashing Galaxy with a UTF8 error on German LC settings.loc <-Sys.setlocale("LC_MESSAGES","en_US.UTF-8")

Avoid flooding stdout
It is recommended to suppress messages to stdout when loading R/Bioconductor packages.To do this, include the following lines in the Custom R file: ## Load required libraries without seeing messages suppressPackageStartupMessages({ library("getopt") library("affy") })

Allow verbose output for debugging
Toggle verbose outputs via the Custom R file and getopt package to allow for easier debugging.Intermittent status messages can also be written while the R script is processing large datasets.This allows the Custom R file to be usable outside of Galaxy as well.The example below includes a statement to print the defined parameters, which will be viewable in the history panel upon execution of tool.## Invoke by executing "Rscript my_r_tool_again.R --verbose TRUE" option_specifications <-matrix( \ c("verbose", "v", 2, "logical"), \ byrow=TRUE, ncol=4) options <-getopt(option_specifications) ## Toggle verbose option if (options$verbose) { cat ("Print something useful to stdout to show \ how script is running in Galaxy.\n")} debugging.The help section should be used to describe the R/Bioconductor tool and will appear at the bottom of the Galaxy tool form.Finally, appropriate references for the tool can be provided using the citations section.References can be cited, for example, using a DOI or a BibTeX entry.An example Tool definition file for an R/Bioconductor tool that enumerates k-mers in a fastq file is available as Supplementary File 1.This tool will subsequently be referred to as "Kmer_enumerate" and will be referenced throughout the remaining sections of this guide.

Custom R file. The Custom R file establishes the R environment and informs
Galaxy what R command(s) to execute.The first section of this file contains a header of information that handles error messages, loads required R libraries, and parses options.These requirements are needed for every R/Bioconductor tool being integrated; however, the list of imported R libraries will be specific to each tool.The next section defines the list of parameters to pass to the R command, including input and output parameters.Each parameter, if using the getopt command line parsing library, requires a unique name, a unique single letter designation, a flag indicating whether the parameter is required (0=no argument; 1=required; 2=optional), and the parameter type (e.g.character, integer, float).Optionally, variable names and values can be printed to standard output (stdout), which can be viewed in Galaxy when the tool executes.While not required, these printed statements can assist in debugging and inform whether the R/Bioconductor tool was executed correctly.The final section contains the R command(s) needed to execute the R/Bioconductor tool.A Custom R file for the Kmer_enumerate tool is available as Supplementary File 2. This tool uses the R/Bioconductor package seqTools 10 to read in a fastq file of DNA sequences, count the number of k-mers in the sequences where the value k is supplied by the user, and output the k-mers and their counts.
Unlike running standalone R scripts in the command line or using a graphical interface, it is not necessary to define the working directory (e.g. using setwd()) in the Custom R file.By default, Galaxy executes the R script in the same directory where the files are located.Similarly, Galaxy writes output files to the same directory, which enables the results to be displayed in the Galaxy history panel.Before attempting to integrate an R/Bioconductor tool, it is strongly recommended to test the Custom R file as a standalone script.For example, the Kmer_enumerate Custom R file can be executed in the command line using the following (test input and output files are available as Supplemental File 3 and Supplemental File 4, respectively): Test data.The Test data directory includes data file(s) intended as input to test the R script and any expected output data file(s).
During testing, Galaxy runs the R/Bioconductor tool with the input files in the test data directory and compares the output with the output file(s) in the same directory to ensure that the tool is producing expected results.As an added benefit, including testing data for the tool provides an example for other users of the data formats needed to run the tool.

Avoid writing R code in <configfiles>
It is recommended to keep the Custom R file and the Tool definition file separate and avoid placing R code in a configfiles tag within the Tool definition file.If the R code is embedded inside the Tool definition file, it cannot be used outside of Galaxy due to syntax differences and it cannot be easily tested.This following example should be avoided: # In Tool definition file, after the <command> tag <configfiles> <configfile name="Extract expression"><![CDATA[ # Read in data inputfile <-as.character(options$input)data <-ReadAffy(filenames = inputfile) # More R code.

.. ]]> <configfile> <configfiles>
For tools that output plots, generating test data becomes an issue.To test whether R figures and plots are generated correctly, they should be saved as PNG files instead of PDF.Saving plots and figures as PDF files is a common practice in R/Bioconductor packages, but when PDF files are generated, they are timestamped.Galaxy will not consider two PDFs identical -even if they display the same image -if they were generated at different times and thus have different timestamps.

Tool integration
The following steps outline how to integrate the new R/Bioconductor tool into Galaxy after the tool files have been generated: Step 1: Assemble the Tool definition file, Custom R file, Tool dependency file, and Test data directory with test data files in a single directory.Update the Tool definition file to provide the full path where appropriate.Alternatively, if the tool directory is saved in the $GALAXY_ROOT/tools/ directory, a relative path is sufficient.
Step 2: Copy the Tool configuration file tool_conf.xml.sample, if it does not already exist, and save it as tool_conf.xml.cp $GALAXY_ROOT/config/tool_conf.xml.sample\ $GALAXY_ROOT/config/tool_conf.xml Step 3: Modify tool_conf.xmlby adding a new section under which the integrated tool will exist.The value given to "name" in the Tool configuration file will appear in the tool panel, and the value given to "name" in the Tool definition file will appear under this new section.Provide the full path to the Tool definition file if the tool directory is not in $GALAXY_ROOT/tools/.Otherwise, the relative path is sufficient.

Tool testing and execution
Including test cases for newly integrated tools -while not strictly required -is highly recommended because it enables easier debugging and ensures that the tool is working as expected.To test, for example, the Kmer_enumerate tool, first upload the testing input file (test_input.fq.gz) to Galaxy.Choose the Kmer_enumerate tool from the tool panel, update the input file and k-mer parameters, and execute the tool.In this example, when Kmer_enumerate is executed, test_input.fq.gz is passed by the --input1 argument and the k-mer value is passed by the --input2 argument in the Tool definition file to the Custom R file.The Custom R file executes and sends the results back to the Tool definition file.The output of the tool is then available for viewing in the Galaxy history panel.Since Galaxy interprets any output written to standard error (stderr) as a failed job, it is important that developers ensure that any output (e.g.files, messages) generated by the R script is sent to stdout and not stderr.
When the developer is satisfied that the integrated tool executes correctly, the tool is ready to be used for data analysis.At this point, the newly integrated R/Bioconductor tool can be published in the Galaxy Tool Shed, so that it will be available to the Galaxy community.Detailed instructions for how to submit tools for publishing in the Galaxy Tool Shed can be found online at http:// planemo.readthedocs.io/en/latest/publishing.html.

Use cases
The above guide outlines a straightforward approach for integrating a relatively simple R/Bioconductor tool into Galaxy.However, generating the Tool definition XML file and ensuring all required dependencies, which are available to Galaxy, remain a difficult task.To simplify R/Bioconductor tool integration and address these remaining challenges, we have added a new command to Planemo v0.34.1, a suite of command-line utilities to assist in building and publishing Galaxy tools.We also take advantage of the package manager Conda v4.2.4 and, specifically, the Bioconda channel 11 of Conda, which distributes bioinformaticsrelated software.The new Planemo command, bioc_tool_ init, creates a Bioconda recipe of all dependencies for the given R/Bioconductor tool and writes the path to this recipe to the Tool definition file.This command eliminates the need for recursively parsing the R/Bioconductor dependency tree to create the Tool dependency file.Further, this approach has the added benefit of creating an artifact, which, while created to use with a Galaxy tool, is potentially useful outside of Galaxy by anyone using Bioconda.Additional arguments to the bioc_tool_init command specifically address R/Bioconductor tool integration and are described in the two use cases below.
The bioc_tool_init command functions by invoking another new Planemo command, bioc_conda_recipe_ init.In this command, Bioconda uses BioaRchive v2.35.0, a Bioconductor package version archive, to retrieve the correct package versions if they exist.The bioconductor_ skeleton.pyscript has been modified to not only find missing R/Bioconductor package dependencies, but also create them in the local Bioconda repository specified by the user.BioaRchive improves reproducibility of the Bioconda recipes of different versions of the same Bioconductor package.
Here, we describe how to generate a Tool definition file using bioc_tool_init for a Custom R script that uses the R/Bioconductor package affy v1.52.0 12 .We assume that the user has installed Planemo (v0.35.0 or newer) along with the requirements for the software and has git configured with the appropriate ssh keys (https://help.github.com/articles/generating-an-ssh-key/).We first describe the simplest integration scheme (Case 1) requiring only one parameter, followed by more complex integration schemes (Case 2) with multiple parameters.

Case 1: Generating a Tool definition file and Bioconda recipe with a single parameter
The following command is the simplest way to generate a Tool definition file and Bioconda recipe for integrating a Custom R file in Galaxy (Supplementary File 6-Supplementary File 9).This tool, subsequently referred to as "Extract_expression", implements the affy package to extract probe expression levels from an Agilent CEL file generated by a microarray experiment.
Only the --command option is required here because all of the necessary information for building and running the tool is present in the Rscript call.The full path to the Custom R file should be given so that the Tool definition file can correctly locate it.In addition, extensions for all --input and --output files are also required, as they are needed to populate the format parameter in the inputs and outputs tags in the Tool definition file.The key to this usage of the bioc_tool_init command is that the R command given to --command successfully executes in the command line.The example input 13 and output files used here are available as Supplementary File and Supplementary File 11, respectively.In a similarly simple example, each parameter in the Rscript command can be given to bioc_tool_init using the following arguments: Using this command, the tool dependencies and requirements in my_affy_tool.R are automatically written to the Tool definition file (Supplementary File 12).As in Case 1, appropriate input and output data formats are inferred from the --input and --output arguments given to --command.It is important to note that the Tool definition file and the Bioconda recipe generated by the planemo bioc_tool_init command are meant to be working, usable code for tool integration.However, the files are not 100% complete in terms of following best practices for Galaxy tool development, and may require additional work to reach the standards for which Galaxy tools are published to the Tool Shed.Galaxy R/Bioconductor tool developers are strongly encouraged to meet best practice standards for any tool.

Discussion
Integrating R/Bioconductor tools into Galaxy can be challenging for both novice and advanced tool developers, but it is an important part of increasing the availability and reproducibility of research tools for the scientific community.We provide here a complete guide for R/Bioconductor tool integration that includes: (1) a description of the components needed to integrate the tool, (2) step-by-step instructions for incorporating the tool components into Galaxy, (3) examples of how to use the new bioc_tool_ init Planemo command for easier tool integration, and (4) best practices for R/Bioconductor tool integration.A more detailed guide for R/Bioconductor tool integration into Galaxy is available on GitHub at https://github.com/nturaga/bioc-galaxy-integration/blob/master/README.md.By providing a way to semi-automate the integration process, we hope that R/Bioconductor tool developers can focus more on developing new and essential tools rather than on how to integrate them into Galaxy.
A key feature of the simplified tool integration method described in this work is the addition of the bioc_tool_init command to Planemo.This new capability specifically improves tool integration for developers in two ways.First, the bioc_tool_ init command generates nearly complete Tool dependency files and Bioconda recipes by directly parsing the Custom R script being integrated, eliminating the need for developers to manually update the correct tool names and versions in all the tool files.Second, the bioc_tool_init command alleviates dependency management issues by recursively identifying and installing all required tool dependencies using Bioconda.This ensures that tool dependencies are compatible with and accessible across different platforms and eliminates the need for developers to manually install all required dependencies.We hope that these improvements will encourage more R/Bioconductor tool developers to share and publish their tools on Galaxy.
Future work for improving ease of R/Bioconductor tool integration into Galaxy includes improving the bioc_tool_ init command to automate more tasks.For example, we are currently working on a functionality that automatically generates example test cases to include in the Tool definition XML file.We are also working on extending the bioc_tool_init command to handle integration of multiple R/Bioconductor functions by passing a formatted text file to bioc_tool_init.We are also working on a Planemo command that automatically submits a wrapped R/Bioconductor tool to the public Tool Shed.Finally, development of a Planemo command that can automatically wrap an entire R/Bioconductor package based on a published vignette would be ideal for quickly integrating and publishing Galaxy-wrapped tools.These and other improvement are currently undergoing development.

Table 1 . Description of best practices for integrating R/Bioconductor tools into Galaxy. Described
in this table are eight key "best practices" for developing R/Bioconductor tools that will be integrated into Galaxy.
The Tool dependency XML file informs Galaxy where to find the required tool dependencies and should explicitly reference each of the requirements listed in the Tool definition file.It is important to differentiate between the requirements tag in the Tool definition file and the Tool dependency file: the Tool definition file identifies what dependencies are needed, and the Tool dependency file identifies where to get the dependencies.For example, many of the required tools -R/Bioconductor or otherwise -are available in the Galaxy Tool Shed.In the Tool dependency file, each requirement is listed under its own repository tag with its "name" and "owner" parameters as they appear in the Tool Shed.Available Tool Shed tools can viewed at https://toolshed.g2.bx.psu.edu/.A Tool dependency file for Kmer_enumerate is supplied as Supplementary File 5.
In both examples, the bioc_tool_init command first clones the Bioconda repository in the home directory and creates a new recipe for each dependency in $HOME/biocondarecipes/recipes/ if it does not exist.This allows the user to have a local copy of the recipe and all of the Bioconda package dependencies for that recipe.The bioc_tool_init command then creates a new Tool definition file, my_affy_tool.xml, in the current directory with the newly generated Bioconda recipe as a requirement.The bioc_tool_init command is used to create, not update, the dependency requirements and provide a suitable blueprint for the Tool definition file.Semi-automated creation of the Tool definition XML file enables users to quickly generate usable code for R/Bioconductor tool integration.For example, the -name option sets the name that will appear in the Galaxy tool panel (defaults to the name of the Tool definition file) and should be a brief statement of what the tool does.The --description option provides additional information about the tool's function and appears immediately after the tool name in the tool panel.The --help_text option populations a field in the Galaxy tool form that provides additional information about what the tool does.Finally, developers should utilize the --doi option to include citation information for the tool in the tool form.An exhaustive list of available arguments can be found by using planemo bioc_ tool_init --help.The following is an example command to generate the Tool definition file and Bioconda recipe for the Extract_expression tool: