Keywords
Interoperability, Bioconductor, R, Galaxy, Open Source, Bioinformatics
This article is included in the Galaxy gateway.
This article is included in the Bioconductor gateway.
This article is included in the RPackage gateway.
This article is included in the Bioinformatics Education and Training Collection collection.
Interoperability, Bioconductor, R, Galaxy, Open Source, Bioinformatics
The Bioconductor Project (https://www.bioconductor.org/) provides one of the largest suites of open source bioinformatics tools for analyzing genomics and biomedical data from diverse high-throughput assays, including DNA microarrays, flow cytometry, and deep sequencing1,2. Bioconductor has an active user community of researchers from across a spectrum of biomedical professions, who routinely analyze complex genomics datasets. Bioconductor tools and packages are primarily based on R (https://www.r-project.org/), a programming language and integrated environment developed for statistical computing and data visualization3. By submitting R packages to Bioconductor, developers can easily distribute powerful statistical analysis and visualization tools, while supporting efforts towards reproducible scientific research.
To further increase their usability and distribution, some R/Bioconductor tools have been integrated into Galaxy (https://galaxyproject.org/), an open source, web-based platform for performing, reproducing, and sharing data analyses, which utilize a variety of bioinformatics tools4,5. R/Bioconductor tool integration into Galaxy is a multi-step process that, although straightforward, poses unique challenges for both novice and advanced tool developers. The first major challenge is identifying and installing all of the dependencies needed for an R/Bioconductor tool. Each dependency needs to be available to Galaxy through its dependency management system, called the Tool Shed (https://toolshed.g2.bx.psu.edu/). If a dependency is not available in the Tool Shed, then it must be installed manually, which can often be difficult. The second major challenge is generating the required files/code for tool integration, which is a time-consuming process. Ensuring that the correct files are generated with correct syntax can be a frustrating task and is often the biggest hurdle for less-experienced developers. Given these issues, tool integration remains a daunting task for some developers, especially for those less familiar with command-line processes. Ideally, the R/Bioconductor tool integration process would be easy and intuitive for both novice and advanced developers.
Recognizing the need to improve the R/Bioconductor tool integration process, we outline here a guide and best practices (Table 1) for R/Bioconductor tool integration into Galaxy. Furthermore, we highlight the addition of new functionality to Planemo (http://planemo.readthedocs.io)6, a suite of command-line utilities that assists in building and publishing Galaxy tools. To simplify generation of required tool integration files, we introduce a new Planemo command that creates template files based on an R/Bioconductor tool developed by the user. This new command also addresses dependency management issues by leveraging Bioconda (https://bioconda.github.io), a channel of the Conda package manager, which provides bioinformatics software7. Bioconda locates specific versioned R/Bioconductor packages via BioaRchive (https://bioarchive.galaxyproject.org/)8 and CRAN (https://cran.r-project.org/)9 and includes these as requirements for using the R/Bioconductor tool. These new Planemo functionalities simplify R/Bioconductor tool integration and will enable developers to focus more attention on R/Bioconductor tool development and less on tool integration into Galaxy.
Described in this table are eight key “best practices” for developing R/Bioconductor tools that will be integrated into Galaxy.
Here, we present a complete guide for developers to integrate an R/Bioconductor tool into Galaxy. We first describe the necessary components for wrapping an R/Bioconductor tool in Galaxy. Next, we outline the steps required for integrating the tool components into Galaxy, testing the tool, and executing the tool. Finally, we describe a simplified integration process using the new Planemo command, bioc_tool_init. The instructions provided in this guide assume that the developer has system access to Galaxy source code files (e.g. using a local instance or cloud-based instance of Galaxy), has an active Internet connection, and has installed the Planemo python package (v0.35.0 or later; https://pypi.python.org/pypi/planemo/)6. Furthermore, the user must have installed any packages that are dependencies for the R/Bioconductor tool being integrated.
An R/Bioconductor Galaxy tool is defined by four major components. The first component is a Tool definition file (Tool wrapper) in XML format, which provides the interface between Galaxy and the R/Bioconductor tool being integrated. The second component is a Custom R file, which calls the R/Bioconductor tool(s) to perform a particular analysis. The third component is a Tool dependency file, which tells Galaxy where to find the required tool dependencies. The fourth component is the Test data directory, which includes both input and output files that will be used to test the Custom R file. These four components should be organized using the following directory structure:
example_tool/
├── my_tool.R # Custom R file
├── my_tool.xml # Tool definition file
├── tool_dependencies.xml # Tool dependency file
├── test_data/ # Test data directory
│ ├── test_input.fq # Example fastq input file
│ ├── test_output.txt # Example output text file
│ └── ... # Additional input or output files
Tool definition file. The Tool definition file informs Galaxy how to handle parameters in the Custom R file. The value given to “name” in the file header appears in the Galaxy tool panel and should be set to a meaningful short description of what the tool does. The minimal structure of the Tool definition file contains seven key sections. The requirements section defines the tool dependencies needed to run the R script, and includes the version of R used to develop the tool. The command section defines the R command that is executed in Galaxy via the R interpreter. Importantly, input and output parameters are denoted as $input1, $input2, ... and $output1, $output2, ..., respectively, and the full path to the Custom R file is essential. The inputs section establishes how input parameters given to the Custom R file appear in Galaxy, while the outputs section defines the name and format of output files generated by the Custom R file. Each input and output parameter requires its own entry in the Tool definition file, and the values assigned to “name” should match those in the command section. The tests section defines the input parameters needed to test the R command and what output to expect as a result. This section is important for tool testing and debugging. The help section should be used to describe the R/Bioconductor tool and will appear at the bottom of the Galaxy tool form. Finally, appropriate references for the tool can be provided using the citations section. References can be cited, for example, using a DOI or a BibTeX entry. An example Tool definition file for an R/Bioconductor tool that enumerates k-mers in a fastq file is available as Supplementary File 1. This tool will subsequently be referred to as “Kmer_enumerate” and will be referenced throughout the remaining sections of this guide.
Custom R file. The Custom R file establishes the R environment and informs Galaxy what R command(s) to execute. The first section of this file contains a header of information that handles error messages, loads required R libraries, and parses options. These requirements are needed for every R/Bioconductor tool being integrated; however, the list of imported R libraries will be specific to each tool. The next section defines the list of parameters to pass to the R command, including input and output parameters. Each parameter, if using the getopt command line parsing library, requires a unique name, a unique single letter designation, a flag indicating whether the parameter is required (0=no argument; 1=required; 2=optional), and the parameter type (e.g. character, integer, float). Optionally, variable names and values can be printed to standard output (stdout), which can be viewed in Galaxy when the tool executes. While not required, these printed statements can assist in debugging and inform whether the R/Bioconductor tool was executed correctly. The final section contains the R command(s) needed to execute the R/Bioconductor tool. A Custom R file for the Kmer_enumerate tool is available as Supplementary File 2. This tool uses the R/Bioconductor package seqTools10 to read in a fastq file of DNA sequences, count the number of k-mers in the sequences where the value k is supplied by the user, and output the k-mers and their counts.
Unlike running standalone R scripts in the command line or using a graphical interface, it is not necessary to define the working directory (e.g. using setwd()) in the Custom R file. By default, Galaxy executes the R script in the same directory where the files are located. Similarly, Galaxy writes output files to the same directory, which enables the results to be displayed in the Galaxy history panel. Before attempting to integrate an R/Bioconductor tool, it is strongly recommended to test the Custom R file as a standalone script. For example, the Kmer_enumerate Custom R file can be executed in the command line using the following (test input and output files are available as Supplemental File 3 and Supplemental File 4, respectively):
cd /path/to/example_seqTools/
Rscript my_seqTools_tool.R \
--input1 test_data/test_input.fq \
--input2 2 \
--output test_data/test_output.txt
Tool dependency file. The Tool dependency XML file informs Galaxy where to find the required tool dependencies and should explicitly reference each of the requirements listed in the Tool definition file. It is important to differentiate between the requirements tag in the Tool definition file and the Tool dependency file: the Tool definition file identifies what dependencies are needed, and the Tool dependency file identifies where to get the dependencies. For example, many of the required tools - R/Bioconductor or otherwise - are available in the Galaxy Tool Shed. In the Tool dependency file, each requirement is listed under its own repository tag with its “name” and “owner” parameters as they appear in the Tool Shed. Available Tool Shed tools can viewed at https://toolshed.g2.bx.psu.edu/. A Tool dependency file for Kmer_enumerate is supplied as Supplementary File 5.
Test data. The Test data directory includes data file(s) intended as input to test the R script and any expected output data file(s). During testing, Galaxy runs the R/Bioconductor tool with the input files in the test data directory and compares the output with the output file(s) in the same directory to ensure that the tool is producing expected results. As an added benefit, including testing data for the tool provides an example for other users of the data formats needed to run the tool.
For tools that output plots, generating test data becomes an issue. To test whether R figures and plots are generated correctly, they should be saved as PNG files instead of PDF. Saving plots and figures as PDF files is a common practice in R/Bioconductor packages, but when PDF files are generated, they are time-stamped. Galaxy will not consider two PDFs identical - even if they display the same image - if they were generated at different times and thus have different timestamps.
The following steps outline how to integrate the new R/Bioconductor tool into Galaxy after the tool files have been generated:
Step 1: Assemble the Tool definition file, Custom R file, Tool dependency file, and Test data directory with test data files in a single directory. Update the Tool definition file to provide the full path where appropriate. Alternatively, if the tool directory is saved in the $GALAXY_ROOT/tools/ directory, a relative path is sufficient.
Step 2: Copy the Tool configuration file tool_conf.xml.sample, if it does not already exist, and save it as tool_conf.xml.
cp $GALAXY_ROOT/config/tool_conf.xml.sample \
$GALAXY_ROOT/config/tool_conf.xml
Step 3: Modify tool_conf.xml by adding a new section under which the integrated tool will exist. The value given to “name” in the Tool configuration file will appear in the tool panel, and the value given to “name” in the Tool definition file will appear under this new section. Provide the full path to the Tool definition file if the tool directory is not in $GALAXY_ROOT/tools/. Otherwise, the relative path is sufficient.
<section id="testingtools" name="Testing Tools">
<!-- Full path: -->
<tool file="/path/to/example_seqTools/my_seqTools_tool.xml" />
<!-- Relative path: -->
<tool file="example_seqTools/my_seqTools_tool.xml" />
</section>
Step 4: Restart Galaxy to integrate the modified tool_conf.xml file.
Including test cases for newly integrated tools - while not strictly required - is highly recommended because it enables easier debugging and ensures that the tool is working as expected. To test, for example, the Kmer_enumerate tool, first upload the testing input file (test_input.fq.gz) to Galaxy. Choose the Kmer_enumerate tool from the tool panel, update the input file and k-mer parameters, and execute the tool. In this example, when Kmer_enumerate is executed, test_input.fq.gz is passed by the --input1 argument and the k-mer value is passed by the --input2 argument in the Tool definition file to the Custom R file. The Custom R file executes and sends the results back to the Tool definition file. The output of the tool is then available for viewing in the Galaxy history panel. Since Galaxy interprets any output written to standard error (stderr) as a failed job, it is important that developers ensure that any output (e.g. files, messages) generated by the R script is sent to stdout and not stderr.
When the developer is satisfied that the integrated tool executes correctly, the tool is ready to be used for data analysis. At this point, the newly integrated R/Bioconductor tool can be published in the Galaxy Tool Shed, so that it will be available to the Galaxy community. Detailed instructions for how to submit tools for publishing in the Galaxy Tool Shed can be found online at http://planemo.readthedocs.io/en/latest/publishing.html.
The above guide outlines a straightforward approach for integrating a relatively simple R/Bioconductor tool into Galaxy. However, generating the Tool definition XML file and ensuring all required dependencies, which are available to Galaxy, remain a difficult task. To simplify R/Bioconductor tool integration and address these remaining challenges, we have added a new command to Planemo v0.34.1, a suite of command-line utilities to assist in building and publishing Galaxy tools. We also take advantage of the package manager Conda v4.2.4 and, specifically, the Bioconda channel11 of Conda, which distributes bioinformatics-related software. The new Planemo command, bioc_tool_init, creates a Bioconda recipe of all dependencies for the given R/Bioconductor tool and writes the path to this recipe to the Tool definition file. This command eliminates the need for recursively parsing the R/Bioconductor dependency tree to create the Tool dependency file. Further, this approach has the added benefit of creating an artifact, which, while created to use with a Galaxy tool, is potentially useful outside of Galaxy by anyone using Bioconda. Additional arguments to the bioc_tool_init command specifically address R/Bioconductor tool integration and are described in the two use cases below.
The bioc_tool_init command functions by invoking another new Planemo command, bioc_conda_recipe_init. In this command, Bioconda uses BioaRchive v2.35.0, a Bioconductor package version archive, to retrieve the correct package versions if they exist. The bioconductor_skeleton.py script has been modified to not only find missing R/Bioconductor package dependencies, but also create them in the local Bioconda repository specified by the user. BioaRchive improves reproducibility of the Bioconda recipes of different versions of the same Bioconductor package.
Here, we describe how to generate a Tool definition file using bioc_tool_init for a Custom R script that uses the R/Bioconductor package affy v1.52.012. We assume that the user has installed Planemo (v0.35.0 or newer) along with the requirements for the software and has git configured with the appropriate ssh keys (https://help.github.com/articles/generating-an-ssh-key/). We first describe the simplest integration scheme (Case 1) requiring only one parameter, followed by more complex integration schemes (Case 2) with multiple parameters.
The following command is the simplest way to generate a Tool definition file and Bioconda recipe for integrating a Custom R file in Galaxy (Supplementary File 6–Supplementary File 9). This tool, subsequently referred to as “Extract_expression”, implements the affy package to extract probe expression levels from an Agilent CEL file generated by a microarray experiment.
planemo bioc_tool_init \
--command "Rscript /path/to/my_affy_tool.R \
--input test_input.CEL \
--output test_output.txt"
Only the --command option is required here because all of the necessary information for building and running the tool is present in the Rscript call. The full path to the Custom R file should be given so that the Tool definition file can correctly locate it. In addition, extensions for all --input and --output files are also required, as they are needed to populate the format parameter in the inputs and outputs tags in the Tool definition file. The key to this usage of the bioc_tool_init command is that the R command given to --command successfully executes in the command line. The example input13 and output files used here are available as Supplementary File 10 and Supplementary File 11, respectively. In a similarly simple example, each parameter in the Rscript command can be given to bioc_tool_init using the following arguments:
planemo bioc_tool_init \
--rscript /path/to/my_affy_tool.R \
--input test_input.CEL \
--output test_output.txt
In both examples, the bioc_tool_init command first clones the Bioconda repository in the home directory and creates a new recipe for each dependency in $HOME/bioconda-recipes/recipes/ if it does not exist. This allows the user to have a local copy of the recipe and all of the Bioconda package dependencies for that recipe. The bioc_tool_init command then creates a new Tool definition file, my_affy_tool.xml, in the current directory with the newly generated Bioconda recipe as a requirement. The bioc_tool_init command is used to create, not update, the dependency requirements and provide a suitable blueprint for the Tool definition file. Semi-automated creation of the Tool definition XML file enables users to quickly generate usable code for R/Bioconductor tool integration.
Additional options are available for bioc_tool_init and are strongly recommended for generating a tool that follows Galaxy tool development best practices. For example, the --name option sets the name that will appear in the Galaxy tool panel (defaults to the name of the Tool definition file) and should be a brief statement of what the tool does. The --description option provides additional information about the tool’s function and appears immediately after the tool name in the tool panel. The --help_text option populations a field in the Galaxy tool form that provides additional information about what the tool does. Finally, developers should utilize the --doi option to include citation information for the tool in the tool form. An exhaustive list of available arguments can be found by using planemo bioc_tool_init --help. The following is an example command to generate the Tool definition file and Bioconda recipe for the Extract_expression tool:
planemo bioc_tool_init \
--command "Rscript /path/to/my_affy_tool.R \
--input test_input.CEL \
--output test_output.txt" \
--name "Extract Expression" \
--description " values from CEL Agilent microarray data" \
--help_text "This tool reads in Agilent CEL data and \
outputs probe expression values." \
--doi "10.1093/bioinformatics/btg405" \
--tool my_affy_tool_Case2.xml
Using this command, the tool dependencies and requirements in my_affy_tool.R are automatically written to the Tool definition file (Supplementary File 12). As in Case 1, appropriate input and output data formats are inferred from the --input and --output arguments given to --command. It is important to note that the Tool definition file and the Bioconda recipe generated by the planemo bioc_tool_init command are meant to be working, usable code for tool integration. However, the files are not 100% complete in terms of following best practices for Galaxy tool development, and may require additional work to reach the standards for which Galaxy tools are published to the Tool Shed. Galaxy R/Bioconductor tool developers are strongly encouraged to meet best practice standards for any tool.
Integrating R/Bioconductor tools into Galaxy can be challenging for both novice and advanced tool developers, but it is an important part of increasing the availability and reproducibility of research tools for the scientific community. We provide here a complete guide for R/Bioconductor tool integration that includes: (1) a description of the components needed to integrate the tool, (2) step-by-step instructions for incorporating the tool components into Galaxy, (3) examples of how to use the new bioc_tool_init Planemo command for easier tool integration, and (4) best practices for R/Bioconductor tool integration. A more detailed guide for R/Bioconductor tool integration into Galaxy is available on GitHub at https://github.com/nturaga/bioc-galaxy-integration/blob/master/README.md. By providing a way to semi-automate the integration process, we hope that R/Bioconductor tool developers can focus more on developing new and essential tools rather than on how to integrate them into Galaxy.
A key feature of the simplified tool integration method described in this work is the addition of the bioc_tool_init command to Planemo. This new capability specifically improves tool integration for developers in two ways. First, the bioc_tool_init command generates nearly complete Tool dependency files and Bioconda recipes by directly parsing the Custom R script being integrated, eliminating the need for developers to manually update the correct tool names and versions in all the tool files. Second, the bioc_tool_init command alleviates dependency management issues by recursively identifying and installing all required tool dependencies using Bioconda. This ensures that tool dependencies are compatible with and accessible across different platforms and eliminates the need for developers to manually install all required dependencies. We hope that these improvements will encourage more R/Bioconductor tool developers to share and publish their tools on Galaxy.
Future work for improving ease of R/Bioconductor tool integration into Galaxy includes improving the bioc_tool_init command to automate more tasks. For example, we are currently working on a functionality that automatically generates example test cases to include in the Tool definition XML file. We are also working on extending the bioc_tool_init command to handle integration of multiple R/Bioconductor functions by passing a formatted text file to bioc_tool_init. We are also working on a Planemo command that automatically submits a wrapped R/Bioconductor tool to the public Tool Shed. Finally, development of a Planemo command that can automatically wrap an entire R/Bioconductor package based on a published vignette would be ideal for quickly integrating and publishing Galaxy-wrapped tools. These and other improvement are currently undergoing development.
Automated build available from: https://hub.docker.com/r/nitesh1989/bioc-galaxy-integration/
Latest source code: https://github.com/nturaga/bioc-galaxy-integration
Archived source code as at the time of publication: DOI, 10.5281/zenodo.16655114
License: Academic Free License version 3.0
More information on tool building in Galaxy and additional best practices are available at http://planemo.readthedocs.io/en/latest/writing.html. Planemo documentation can be found at https://github.com/galaxyproject/planemo.
NT and MAF designed and implemented the features presented with advice from DB, JC, AN, and JT. JC implemented the Planemo tool on which this work is built. NT, MAF, and DB wrote the paper. Members of the Galaxy Team developed the Galaxy framework on which this work relies and provided advice on the project. All authors have agreed to the final content.
This project is funded in part by the National Institutes of Health [U41 HG006620].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
We are grateful to the Galaxy Team (https://wiki.galaxyproject.org/GalaxyTeam) and the developer community for the Galaxy Framework for their guidance on this project. We would also like to thank the developers of Bioconda for providing a great service to the community. We give special mention to Ryan Dale for developing the script to create Bioconda recipes for Bioconductor packages. We also thank the contributors of BioaRchive.
Supplementary File 1: Tool definition file for Kmer_enumerate. An example Tool definition file for the Kmer_enumerate tool, which enumerates k-mers in a fastq file. This file can be opened with any text editor.
Click here to access the data.
Supplementary File 2: Custom R file for Kmer_enumerate. An example Custom R file for the Kmer_enumerate tool. This file can be opened with any text editor.
Click here to access the data.
Supplementary File 3: Input for Kmer_enumerate. An example input fastq file for the Kmer_enumerate tool. This fastq formatted file has been compressed with gzip. The file must be uncompressed (using gunzip) before it can be opened with any text editor.
Click here to access the data.
Supplementary File 4: Output for Kmer_enumerate. An example output text file for the Kmer_enumerate tool. This file can be opened with any text editor.
Click here to access the data.
Supplementary File 5: Tool dependencies for Kmer_enumerate. An example Tool dependencies file for the Kmer_enumerate tool. This file can be opened with any text editor.
Click here to access the data.
Supplementary File 6: Tool definition file for Extract_expression Case 1. An example Tool definition file for the Extract_expression tool created using Case 1. This file can be opened with any text editor.
Click here to access the data.
Supplementary File 7: Yaml file for Extract_expression. An example Bioconda recipe yaml file for the Extract_expression tool created using Case 1. This file can be opened with any text editor.
Click here to access the data.
Supplementary File 8: Bash file for Extract_expression. An example Bioconda recipe bash file for the Extract_expression tool created using Case 1. This file can be opened with any text editor.
Click here to access the data.
Supplementary File 9: Custom R file for Extract_expression. An example Custom R file for the Extract_expression tool created using Case 1. This file can be opened with any text editor.
Click here to access the data.
Supplementary File 10: Input CEL file for Extract_expression. An example input CEL file for the Extract_expression tool13. This file is a binary data file created by Affymetrix DNA microarray image analysis software, so it cannot be viewed with a text editor. The contents of this CEL file can be accessed in R via the readAffy() function of the affy Bioconductor package12.
Click here to access the data.
Supplementary File 11: Output for Extract_expression. An example output text file for the Extract_expression tool. This file can be opened with any text editor.
Click here to access the data.
Supplementary File 12: Tool definition file Extract_expression Case 2. An example Tool definition file for the Extract_expression tool created using Case 2. This file can be opened with any text editor.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 24 Nov 16 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)