Explain your data by Concept Profile Analysis Web Services

The Concept Profile Analysis technology (overlapping co-occurring concept sets based on knowledge contained in biomedical abstracts) has led to new biomedical discoveries, and users have been able to interact with concept profiles through the interactive tool “Anni” (http://biosemantics.org/anni). However, Anni provides no way for users to save their procedures, results, or related provenance. Here we present a new suite of Web Service operations that allows bioinformaticians to design and execute their own Concept Profile Analysis workflow, possibly as part of a larger bioinformatics analysis. The source code can be downloaded from ZENODO at http://www.dx.doi.org/10.5281/zenodo.10963. This article is included in the International Society for Computational Biology Community Journal gateway. Kristina Hettne ( ) Corresponding author: K.M.Hettne@lumc.nl The authors declare that they have no competing interests. Competing interests: The work in this paper was funded by the Seventh Framework Programme of the European Commission (Digital Libraries and Grant information: Digital Preservation area ICT-2009.4.1 project reference 270192) (Wf4Ever), and grant agreement No. 305444 (RD-Connect). © 2014 Hettne K  . This is an open access article distributed under the terms of the , which Copyright: et al Creative Commons Attribution License permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Hettne K, van Schouwen R, Mina E How to cite this article: et al. Explain your data by Concept Profile Analysis Web Services [version 1; F1000Research 2014,  :173 ( ) peer review: 2 approved with reservations] 3 https://doi.org/10.12688/f1000research.4830.1 25 Jul 2014,  :173 ( ) First published: 3 https://doi.org/10.12688/f1000research.4830.1 1 1 1 1


Introduction
Concept Profile Analysis (CPA) has proven a powerful tool for interpreting and prioritizing results of bioinformatics analysis, and for linking data sets based on the best "educated guess" when precise links are not available. The technology uses the vector space model to relate concepts (such as genes and biological processes) mined from the literature to each other. Vectors can be compared efficiently and transparently 1 , and the model yields a measure of the strength of the relationship between concepts. We call these vectors "concept profiles". The CPA algorithms have for example been successfully applied to compare microarray studies 2 , for predicting proteins putatively associated with muscular dystrophy pathways 3 , and for associating chemical structures with gene expression data 4 .
The standalone application Anni 10 supports a number of standard CPA operations. For example, to perform pathway analysis for a gene expression experiment, a user first provides a list of gene database identifiers for the most significantly expressed genes. Anni uses these identifiers to query the concept profile database for the corresponding concept profiles, and subsequently constructs a "concept set" of these profiles. To match the list of genes with pathways, the user performs the operation "match concept sets" for the gene concept set with a predefined concept set of the category "Gene Ontology (GO) biological process". Note that we refer here to GO concept profiles. The concept profile matching scores between the two concept sets are calculated by Anni, resulting in a ranked list of GO biological processes for the gene list. Finally, literature evidence in the form of documents containing co-mentions of the gene and biological processes can be retrieved by Anni from a supporting documents database, or from documents providing enough statistical evidence to support the gene-biological process associations without actually mentioning the gene and the biological process together in an abstract.
Here we present a new suite of Web Service (WS) operations that allows bioinformaticians to design and execute their own CPA workflow outside the Anni Web tool, possibly as part of a larger bioinformatics analysis. The WS was designed according to the outcome of an Anni usage analysis, where the common user and machine operations were identified.

Technical specifications
We implemented the CPA WS using Java, Model-View-Controller (MVC) Spring framework, and Apache Tomcat following the Java API for XML WS (JAX-WS) specifications. We compiled the Anni Java code for the different operations into separate libraries, for which wrappers were written in Java. Spring MVC was used as a WS interface to remote applications. The WS was implemented according to the JAX-WS standard, enabling an auto-generated WSDL specification and use of Java Annotations to specify operations. Apache Tomcat was used for deployment. The CPA WS uses a database of indexed PubMed records. The thesaurus behind the Anni Web application was converted to Simple Knowledge Organization System (SKOS), and the SKOS concept IDs were implemented as resolvable Unique Resource Identifiers leading to aVirtuoso Universal Server triple store.

User and machine operations as Taverna workflows
As an example on how to work with the CPA WS we implemented several workflows in the workflow management system Taverna workbench v 2.4 5 following the best practices for workflow design 6 . The whole suit of CPA workflows consists of 11 workflows collected in a myExperiment pack [http://www.myexperiment.org/ packs/368]. These workflows are of two different types: 1) nine workflows calling one WS operation, and 2) two pipelines of nested workflows calling more than one WS operation. The workflows of type 1 are the building blocks to make pipelines of type 2, and were implemented with re-usability in mind.
Here we describe the workflow "Match concept profiles with predefined set" (Figure 1) in order to illustrate the design and use of the WS and workflows. The workflow invokes the WS operation "getSimilarConceptProfilesPredefined". The operation takes three input parameters, which can be accessed using the XML splitter function in Taverna. The user specifies the concept(s) to be matched ("Query concept IDs"), the concept set to match against ("Match concept set"), and a cutoff number of matched concepts to return ("Cutoff").
Opening the "Run workflow" window in Taverna will result in showing the structured annotations for the whole workflow and the input parameters, as well as the example values ( Figure 2). WS functional annotations can be accessed via the "Details" tab in Taverna (Figure 3). When the workflow is run, it will produce a ranked list of concepts associated to the query concept(s), and their similarity scores.
The above described workflow executes the core functionality of concept profile matching. The other WS operations implement functionality such as explaining the association found (by listing the common concepts contributing most to the score) and showing the literature evidence (by retrieving the links to the abstracts in PubMed). Workflows implementing these WS can be coupled to the "Match concept profiles with predefined set" workflow to form a pipeline of nested workflows. Examples of such pipelines are the "GWAS to biomedical concept" nested workflow, which performs Single Nucleotide Polymorphism annotation (SNP), and the "Annotate gene list with top ranking concepts" nested workflow for gene annotation (Figure 4).

Discussion
The CPA WS and workflows raise the level of reproducibility of bioinformatics experiments that make use of CPA compared to Anni, and the CPA WS can more easily be used together with other tools. For example, CPA-based SNP annotation can be performed with the CPA WS by coupling an external WS to map the SNP identifiers to Entrez gene identifiers 7 . With Anni, the SNP to Entrez gene identifier analysis would have to be performed separately, decreasing the reproducibility.
Some of the functionalities in Anni have not been migrated to the WS. For example, Anni provides a function for hierarchical clustering of the results. Clustering is not a CPA function by itself, but we are considering to implement workflows that perform this function. We are also working on a workflow implementation of the process

Competing interests
The authors declare that they have no competing interests.

Acknowledgements
We would like to thank Peter-Bram 't Hoen for his comments about the WS design and functionality.
that creates the data underlying the Anni WS, possibly using the recently developed text-mining workbench Argo 8 , allowing for more flexibility in performing CPA 9 . Specialization of the underlying resources for services to use in specific research domains, such as plant breeding or metabolomics, is a topic for future work.

Conclusions
By creating a WS building upon the Anni interactive tool, we made available the CPA technology in a way that users can easier integrate the technology with other software and save their procedures, results and related provenance. The motivation of promoting the Web services presented in this paper is unclear to me. The abstract says, "However, Anni provides no way for users to save their procedures, results, or related provenance." I guess that the authors wrote this sentence in order to explain the advantage of using the Web services over Anni. However, I am not sure whether bioinformaticians prefer the Web services to Anni in their analyses. Some may find Anni more useful than the Web services because Anni was designed to perform pre-defined workflows. For this reason, I am also not sure whether the conclusions are justified sufficiently on the basis of the developed Web services. The impact of this paper would be greater if the authors could explain the sales point(s) of this software (especially to bioinformaticians, real users) more concretely.

Software availability
The explanation of how the Web services work on Taverna workbench is helpful for us to understand what can be achieved by the Web services. However, this paper lacks the comprehensive detail of the Web services, e.g., what kinds of services were implemented, what is the functionality of each service, what is the underlying research technologies for implementing the each service (although software technologies such as Apache Tomcat are explained). These details may be useful for readers to imagine the possible way to integrate the Web services with other tools.
Minor comment: It would be better if this paper could introduce Concept Profile Analysis (CPA) with references.
No competing interests were disclosed.

Competing Interests:
I confirm that I have read this submission and believe that I have an appropriate level of I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. Figure 2 might be more interesting if it showed an example of a "Match concept set" rather than the " " value. How are such concept sets defined? On the other hand, parameters such as " " should Cutoff cutoff be explained, at least in the figure caption, for completeness.
There are several details provided that are not fully explained. Specifically, there is a mention of "the XML " for input parameters. Again, please consider whether this is important for a splitter function in Taverna user to understand, or rather for a developer to understand the code. Also, what are "WS functional "? (cf. also the reuse of the term " " with possibly a different sense in " annotations annotation Single " and " "). What do you refer to with " Nucleotide Polymorphism annotation gene annotation the data "? underlying the Anni WS "? underlying the Anni WS The conclusions are justified, if a bit unspecific. A more targeted summary of the functionality would be preferable, rather than " ". we made available the CPA technology Note a few small English usage issues. (1) " " should be "considering considering to implement implementing" (2) " " should be "users can more easily integrate". users can easier integrate No competing interests were disclosed. Competing Interests: I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com