Using bio.tools to generate and annotate workbench tool descriptions

Workbench and workflow systems such as Galaxy, Taverna, Chipster, or Common Workflow Language (CWL)-based frameworks, facilitate the access to bioinformatics tools in a user-friendly, scalable and reproducible way. Still, the integration of tools in such environments remains a cumbersome, time consuming and error-prone process. A major consequence is the incomplete or outdated description of tools that are often missing important information, including parameters and metadata such as publication or links to documentation. ToolDog (Tool DescriptiOn Generator) facilitates the integration of tools - which have been registered in the ELIXIR tools registry (https://bio.tools) - into workbench environments by generating tool description templates. ToolDog includes two modules. The first module analyses the source code of the bioinformatics software with language-specific plugins, and generates a skeleton for a Galaxy XML or CWL tool description. The second module is dedicated to the enrichment of the generated tool description, using metadata provided by bio.tools. This last module can also be used on its own to complete or correct existing tool descriptions with missing metadata.

This article is included in the gateway. report report report report

Introduction
Over the last few years, bioinformatics has played a major role in the field of biology, raising the issue of best practices in software development for the members of the bioinformatics community [1][2][3] . These practices include facilitating the discovery, deployment, and usage of tools, and several helpful solutions are available.
Tool discovery is facilitated by various online catalogs and registries 4-6 . The ELIXIR Tools and Data Services Registry, bio.tools 7 , describes bioinformatics software using extensive metadata descriptions, supported by the EDAM ontology 8 .
For software deployment, distribution systems are available 9-13 that let users locally install the tools that they need in convenient, portable and reproducible ways. Workbench and workflow systems such as Galaxy 14,15 ,Taverna 16 or Chipster 17 allow the execution and composition of bioinformatics tools in integrated environments which aim at improved usability, interoperability and reproducibility. Finally, the Common Workflow Language 18 (CWL) is a recent project that defines a standardized and portable tool and workflow description format, usable across different platforms.
All of the above systems rely on components that provide the necessary information to describe, install, or run a specific piece of software. Gathering this information and formatting it into tractable tool descriptions is often a complex and time consuming task for developers. Indeed, it requires a deep knowledge of both the tool itself and the description format. A significant part of the metadata stored in the descriptions is, however, common to registries and workbench environments systems 19 , and strategies relying on a mapping between these different description formats can help avoid redundancy and mislabeling of tools Figure 1). The ReGaTE utility 20 illustrates this by using tool descriptions from Galaxy to publish available services on bio.tools. Another application is to facilitate workbench environment integration, by reusing tool descriptions from registries. Here we present "ToolDog" (Tool DescriptiOn Generator), an application that enables workbench integration for tools registered in the bio.tools registry.
Tool descriptions Bioinformatics tools are described in various formats and levels of detail, befitting different systems and use-cases. A bio.tools entry provides tool descriptions for tool end-users, primarily for search and discovery purposes. The metadata provides a basic description including the tool type, what task it performs, the main input and output data, who created it, where it is available, and its license. This description, based on the BiotoolsSchema model, can be accessed through the bio.tools API and retrieved in JSON format. Conversely, Galaxy and CWL tool descriptions must support tool discovery, execution, and integration into homogeneous environments. This requires an extensive description of their command line syntax (or other type of API). Galaxy tool descriptions are written in XML or YAML, and the corresponding XSD is available. CWL tool descriptions are described using the YAML-based SALAD format.
All three of these tool description formats provide the possibility of specifying EDAM terms. In bio.tools this can be done directly. CWL supports these annotations through the addition of bioschemas mark-up, and Galaxy supports EDAM through specific tags mapping to its internal typing system 21 . The EDAM ontology helps with the description of the tools by providing a common vocabulary that includes terms to describe topics that specify which particular domains of bioinformatics the tool serves, operations that describe what the tool does, and data and formats that specify the type and format of the inputs and outputs.

Completeness of Workbench tool description
Tool descriptions for workbench systems are expensive to create and maintain, because they require exhaustive knowledge of both the described tool, and the syntax used for the description 19 . Consequently, tool descriptions are sometimes incomplete or out of date. For instance, in the case of Galaxy, the analysis of the main server and the server of the Institut Pasteur 22 shows that some tools are not adequately described (see Figure 2). Specifically, although most of the tools have a help section and a description, important elements such as citation information are often missing. The evolution of the Galaxy framework itself also generates a need for maintenance, through changes in the tool description format. With the recent addition of EDAM annotations tags in the format, tools had to be updated to support this new feature. The users of such graphical workbench platforms do not typically handle tool discovery and deployment tasks. Thus, detailed tool descriptions are fundamental, because they are the main source of information for the scientists who use them.
Different approaches exist to help improve the quality of the corpus of tool descriptions. (1) Tooling facilitates the creation and validation of the tool descriptions, using Planemo 23 in the case of Galaxy.
(2) Community approaches such as the Intergalactic Utilities Commission design and promote best practices for the The objective is to integrate the bio.tools registry with workbench environments in two ways: (1) "ReGaTE", a utility for en masse registration of services from Galaxy instances; (2) the "ToolDog" utility, to translate the description of any tool or service that is registered in bio.tools, into the format required by the existing major workbench environments. ToolDog complements all of these approaches. It leverages the information available in bio.tools to simplify the integration of bioinformatics software into workbench environments.

Methods
ToolDog is a command-line utility written in Python. It consists of two modules, which handle (1) the generation of a skeleton for the tool description, based on the analysis of the source code of the tool, and (2) the enrichment of the tool description, using the bio.tools metadata. The tool description generation pipeline ( Figure 3) leverages bio.tools and includes both a module to generate a tool description using only the registry, as well as a module to enrich an existing tool description with information from the registry.

Source code analysis
For a number of bioinformatics tools, a significant part of their description can be extracted from an analysis of the source code. The source code analysis module of ToolDog does this, currently only with python-based tools that use the argparse library for parsing command-line arguments. This module uses the argparse2tool package to retrieve the list of parameters and generate Galaxy or CWL tool description skeletons. To generate such skeletons, ToolDog runs a Docker software container that will download, install, analyze the source code, generate the tool description and then retrieve it. This strategy avoids the pollution of the local user's environment and provides a completely preconfigured, ready-to-use installation of ToolDog.

Tool description enrichment
Galaxy and CWL tool descriptions, whether they were manually authored or automatically generated by source code analyses, can be improved by the description enrichment module. This retrieves additional metadata from the corresponding bio.tools entries, and fills in the missing information in the workbench tool description when available.
Internally, the input tool description is parsed into an object model of the tool. The metadata from bio.tools are then mapped onto this object model, which is later exported to Galaxy or CWL formats. Parsing and export capabilities of ToolDog leverage the galaxyxml or cwlgen libraries to import and export the updated descriptions.

Generation of a tool description from a bio.tools entry
Here we illustrate the generation of a tool description with the example of IntegronFinder 24 , an analysis tool dedicated to the identification of integrons in bacterial genomes. Launching ToolDog in "generation mode" on the IntegronFinder entry in the bio.tools registry allows the generation of a significant portion of the tool description (Figure 4), either in CWL or Galaxy format. Some manual modifications (corrections + additions) are still necessary to complete the tool description and to make it functional. For instance, software requirements, which specify what software needs to be installed for the tool to run correctly, cannot be automatically generated, because this information is currently not available in bio.tools. Additionally, the mapping between inputs and the generated command line, as well as between outputs and the file names they refer to is not present.

Enrichment of an existing collection of tool descriptions
In addition to novel tool description generation, ToolDog can also perform the automated enrichment of existing tool descriptions with bio.tools metadata. To test this approach, we ran ToolDog on the tool descriptions available on the Galaxy main instance that lack EDAM annotations. All of the Galaxy descriptions from the main instance were retrieved, and mapped to bio.tools entries using the citation identifiers (DOI). The goal was to add EDAM terms describing the topic of application and the operation(s) performed by the tools. To avoid linking unrelated entries, we took a conservative approach, only mapping by default two entries when they referred to, and only to, the same publication. The results ( Figure 5) show that whenever this linking can be reliably done, the enrichment can easily be performed, with a total of 217 Galaxy tool descriptions being enriched out of 224 being initially mapped to bio.tools. A detailed description of this analysis, including the original and annotated tool descriptions, is available at (https://github.com/khillion/ galaxyxml-analysis/annotate_usegalaxy).

Discussion
The ToolDog utility allows a developer to generate new tool descriptions for tools which are compatible with the code analysis module, and reuse the metadata provided by bio.tools to enrich existing tool descriptions. There are some limitations to this approach: 1. The "plugin" libraries used for code analysis are specific to the programming languages, libraries or framework used to build the command line interface. To this date, they don't cover most of these.
2. The generation of the tool descriptions through code analysis must assume certain coding practices, such as the use of specific functions to define input or output parameters, which are not uniformly adopted.
3. Some of the input/output operations performed by some programs are a lot more difficult to detect through code analysis because they are typically not included in command line parsing frameworks, such web service and database queries and submissions, or in place file modifications.
The automated enrichment of existing tool descriptions provides a convenient way to improve them, especially if they lack most of the metadata provided by bio.tools. Performing this enrichment efficiently en masse, however, would require the wide adoption of an identification system for bioinformatics software. This mechanism would allow to avoid the complex and sometimes ambiguous mapping procedures based on publication identifiers we performed when testing it on the Galaxy tools. A recent update to bio.tools has added stable and unique tool identifiers, based on registered tool names, yielding persistent references to tools, for example https://bio.tools/signalp. Future work will make use of these identifiers to improve the generation of tool descriptions. For instance, linking of the bioconda and biocontainers repositories to bio.tools will enable ToolDog to generate software requirements compatible with workbench platforms 25 .

Conclusions
During the last years, integration of various tools has been eased by the use of workbench systems such as Galaxy, and frameworks using the Common Workflow Language. Still, it remains time consuming and not straightforward to adapt resources to such environments. ToolDog lays the foundation for future work, that will provide a Workbench Integration Enabler for the bio.tools registry as an online service. Furthermore, integration with Planemo, the main utility to develop Galaxy and CWL tools, will be further developed in order to make the simple, bio.toolsbased metadata enrichment of ToolDog available to the widest possible audience.

Data availability
The scripts and results of the analysis performed to motivate and test our approach are available at: https://github.com/khillion/ galaxyxml-analysis, and are archived at the time of publication at: https://doi.org/10.5281/zenodo.1038005 26 .

Software availability
The "Using bio.tools to generate and annotate workbench tool descriptions" is an article that describes a tool descriptor program known as ToolDog. It was designed to generate Galaxy XML or CWL from particular bioinformatics tool source code as well as metadata annotations on bio.tools. The idea is great, since the issue is a real one in the community. Namely, there are a lot of tools out there but typically they lack descriptors in Galaxy of CWL format. And this makes it harder to use in "workbench" and workflow systems. Creating a tool that tool authors can use to help create descriptors is awesome. Source is available in GitHub and the tool can be installed via pip.

Feedback/Questions
Can the authors rename the article? I think it should include ToolDog in the article title.
What are the plans for other languages (if any)? Do the authors see ToolDog as something that others will extend for, say, WDL generation?
I think it would be interesting to hear more about future plans. Specifically, how will the authors expand this to a Workbench Integration Enabler? Do they see this as being an automated process? How will they leverage the work of bioconda and biocontainers (they did mention this briefly) and will the goal be to generate CWL/GalaxyXML for everything in bio.tools + bioconda/biocontainers?
Alternatively, if the goal not to automatically export CWL/GalaxyXML for everything in bio.tools, is it, instead, to provide a tool for tool authors to use when building their tool to jumpstart their descriptor creation? Some clarification on the intended audience I think would be helpful.
The authors described generating CWL/Galaxy XML for IntegronFinder. Did they try other tools and, if so, how successful was that? What about generation in bulk?
Can they comment on what a tool author should do with the generated CWL or Galaxy XML? They mention in the results that some work is required to make the tool run correctly. Is the tool author then suggested to check in the CWL/Galaxy XML to their source repo and maintain it? What is the recommendation here?
Is the rationale for developing the new software tool clearly explained?

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed.

Competing Interests:
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. The article 'Using bio.tools to generate and annotate workbench tool descriptions' describes the software tool ToolDog. ToolDog improves the interoperability of bio.tool-deposited entries within workbenches by converting their descriptions into formats that are compatible with workflow standards.
ToolDog is a convenient addition to the existing capabilities for the integration of bio.tools entries with workbench environments. I found Figure 2 particularly interesting, describing the metadata coverage descriptions from two of the main Galaxy servers. Do you have the raw data with which this figure was created? It would be good to have it openly shared. Figure 2 illustrates the problem of the significant lack of completeness in crucial metadata descriptions of Galaxy tools.
My main recommendation for this article would be to provide a step-by-step guide on how to run ToolDog using a self-contained example. I feel unable to test the tool because I do not know how to download the metadata from a bio.tools entry and need to set up my python environment, download the code and make it work. This article, although it is geared toward a programmer audience, it would be hard to test/reproduce for someone who is not a seasoned python programmer. I would thus recommend a beginner's guide for those of us who are not so technical.
Other than that, I am glad to see all the source code adequately deposited both in github and Zenodo for the snapshot image for this publication. The MIT license is also commendable as it allows free reuse and modification.
Finally some minor corrections: Link in the first paragraph of the results section 'of ' is a significant portion of the tool description broken Link on the second paragraph of the results section ' is broken https://github.com/khillion/galaxyxml-analysis/annotate_usegalaxy' Discussion section bullet point #3 'such web service' ==> such as web services Is the rationale for developing the new software tool clearly explained? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed. Competing Interests: 1.

4.
that tooling information is consistently described but also updatable. In my opinion this should be accepted, with some minor suggested revisions.

Speaking of 'suggestions':
The current title 'Using bio.tools to generate and annotate workbench tool descriptions' suggests the paper will talk more generally about bio.tools, whereas the text focuses primarily on the specific component ToolDog. The title should be modified to reflect this.
The graphs in Fig.2 would be more effective if they were displayed in an integrated manner (single bar chart?), so that the improvements that ToolDog makes are more easily compared to one another.
The discussion about the challenges in autogenerating tool documentation (language, code practices, etc), in the discussion, are spot-on. However not much is discussed on if / how ToolDog might address some of these challenges, though there are suggestions on how to more readily map existing tool descriptions to add to or update. Maybe this could be elaborated on, even if it's indicating the problems may not be easily overcome?
I'm wondering whether the information in Fig. 5 might be better displayed (or augmented) as a before / after comparison to more readily demonstrate how ToolDog could automatically improve tool descriptions. Another option is whether this information could be somehow connected to the data in Fig. 2 to show how ToolDog improves the overall documentation.

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly No competing interests were disclosed.

Competing Interests:
Referee Expertise: Computational biology I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Referee Expertise: Bioinformatics, big data genomics, immunogenomics I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com