autoHGPEC : Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network

Identification of novel disease-gene and disease-disease associations is an important task in biomedical research. Recently, we have developed a Cytoscape app, namely HGPEC, using a state-of-the-art network-based method for such task. This paper describes an upgrading version of HGPEC, namely autoHGPEC, with added automation features. By adding these functions, autoHGPEC can be used as a component of other complex analysis pipelines as well as make use of other data resources. We demonstrated the use of autoHGPEC by predicting novel breast cancer-associated genes and diseases. Further investigation by visualizing and collecting evidences for associations between top 20 ranked genes/diseases and breast cancer has shown the ability of autoHGPEC.


Introduction
One of the challenging tasks in biomedicine is to prioritize candidate genes and diseases by the degree of their relevance to a disease of interest. This is the starting point to identify novel disease-gene and disease-disease associations. A large number of computational methods including network-and machine learning-based ones have been proposed for such a task 1,2 . Stateof-the-art network-based methods often integrate diseases and genes together to form a heterogeneous network, then a propagation algorithm is applied to exploit the similarity between diseases/genes and known disease-gene associations to predict novel associations 3-7 . Some tools have been also developed to facilitate the use of the state-of-the-art methods. However, most of them only focus on predicting novel disease-gene associations 8-10 , including some tools which were developed as apps of Cytoscape 11 . Recently, we have developed a Cytoscape app, HGPEC 12 , to predict both disease-gene and disease-disease associations based on a state-of-the-art method on a heterogeneous network of diseases and genes 3 . HGPEC was shown to be better than two other network-based Cytoscape apps for prediction of novel disease-gene associations, GPEC 13 and PRINCIPLE 14 in terms of prediction performance 12 . In addition, HGPEC can prioritize candidate genes of diseases without known molecular basis and collect evidence to support novel predictions from various data resources such as Gene Ontology 15 , Disease Ontology 16 , KEGG pathway 17 , GeneRIF 18 , PubMed 19 , protein complexes 20 and OMIM 21 . Being developed as an app of Cytoscape, HGPEC can exploit advanced features of Cytoscape such as data visualization and integration. However, Cytoscape is a desktop-based tool, thus HGPEC cannot link to other analysis tools such as R and Python flexibly. Therefore, this also limits the use of HGPEC because it cannot be used automatically as a component of a complex analysis pipeline in these tools. In addition, this prevents Cytoscape from integrating data from other data resources. Recently, automation features have been added to Cytoscape to facilitate those tasks.
In this study, we upgrade HGPEC by adding automation features into it and name the new app as autoHGPEC. Basically, auto-HGPEC has the same functions as HGPEC. However, these functions can be called by both CyREST functions and commands, thus can be called from external environments. To use autoHGPEC, a heterogeneous network of diseases and genes composing of a disease similarity network, a gene/protein network and known disease-gene associations has to be given. Then, a disease of interest must be selected from the disease similarity network. After that, the disease and its known associated genes (if any) are used as training/seed data. A set of candidate genes then has to be defined by selecting from the gene network or chromosome. These candidate genes and all remaining diseases are then ranked by a RWRH-based method (see the Methods section). Finally, users can select top ranked genes/diseases for further analyses such as visualization and evidence collection. We show the ability of autoHGPEC in predicting novel genes and diseases associated with breast cancer.

RWRH-based method
autoHGPEC was implemented using a ranking algorithm, random walk with restart on a heterogeneous network (RWRH) 12 .
Briefly, this network-based algorithm propagates the disease information embedded in a disease of interest and its known associated genes (also known as seed/training nodes) to other diseases and genes in the heterogeneous network. This propagation is performed by random walking from the seed nodes. At each node, the random walker goes to adjacency nodes or goes back to the seed nodes with a prior probability. This process is repeated iteratively until a steady-state is reached. A score assigned to each node at this state represents the degree of relevance to the seed nodes, thus relevance to the disease of interest. Finally, candidate genes and diseases are ranked by the scores and top ranked candidates can be selected as promising genes and diseases for further investigation.
Implementation autoHGPEC is an upgrading version of HGPEC 12 with added automation features. Therefore, main functions such as prioritization, visualization and evidence collection of HGPEC were kept. In addition, as in HGPEC, a number of databases were preinstalled in autoHGPEC to facilitate the use of this app. These include disease similarity networks, gene/protein networks and known disease-gene associations as well as annotation data such as Gene Ontology 15 , Disease Ontology 16 , KEGG pathways 17 , GeneRIF 18 , and protein complexes 20 . However, users can also select other networks by themselves. In order to provide automation features for HGPEC, we first refactor source code of HGPEC to implement Cytoscape Tunable annotations to replace control panels of HGPEC in the west by a menu system. Therefore, all the functions of HGPEC are accessed through the menu system. In addition, the workflow of HGPEC is exposed to the users by using CyREST Command API (which can be followed in Swagger UI under the menu autoHGPEC). The CyREST API is developed with appropriated functions as well. Thus, the result of each step in the workflow can be passed on to the caller for further analysis in R or Python in JSON format.
Operation autoHGPEC is designed to predict novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network with added automation features. Therefore, it operates in the same workflow as in HGPEC 12 . However, in addition to desktop-based Cytoscape though the menu system, its functions can be called using CyREST Command API and from other analysis tools such as R. Figure 1 show the workflow of autoHGPEC in three running environments (see user manual in Supplementary File 1). As an app of Cytoscape with automation features, autoHGPEC can be run on any computer which satisfies the minimal requirements to run Cytoscape.

Use cases
To demonstrate functions of autoHGPEC with automation features, we showed its ability in predicting novel genes and diseases associated with breast cancer (OMIM ID: 114480). Here, we briefly describe this case study by following the 5-step workflow in Figure 1 (see user manual in Supplementary File 1 for more detail): -First, a heterogeneous network of genes and diseases was constructed by connecting a preinstalled disease similarity network (i.e., Disease_Similarity_Network_5) including 5,080 diseases and 19,729 interactions, a preinstalled human protein interaction network (i.e., Default_Human_ PPI_Network) including 10,486 genes and 50,791 interactions, and known disease-gene associations collected from OMIM 21 . This step can be accomplished by following commands from within R: > commandRun('autoHGPEC step1_construct_ network DiseaseGene="Disease-gene from OMIM" diseaseNetwork="Disease_Similarity_ Network_5" geneNetwork="Default_Human_PPI_ Network"') -Second, breast cancer (OMIM ID: 114480) was selected for investigation. This disease is known to be associated with 21 genes, which are also available in the human protein interaction network. Then, the training set was built with these genes and the disease of interest. We can run two following commands within R for this task: -Third, we selected all of 10,465 remaining genes in the protein interaction network as candidate genes. This option can be done by following command: Remaining') -Fourth, all genes and diseases in the heterogeneous network are ranked by applying the RWRH-based method with back-probability, jumping probability and subnetwork importance weight were set to 0.5, 0.6 and 0.7, respectively. The following command can be used to accomplish this task: backProb=0.5jumpProb=0.6subnetWeight=0.7') -Finally, we visualized and collected evidence for the associations between 20 highly ranked candidate genes/ diseases and breast cancer. The users must highlight the diseases and genes of their interest in the corresponding network. These tasks can be performed using two following commands, respectively: Visualization results (Figure 2a and b) show that most of the top ranked candidate genes are directly connected to known breast cancer-associated genes. In addition, highly ranked candidate diseases are directly connected to either known/training genes or the disease of interest. For evidence collection, we annotated and searched evidence for promising associations between the top ranked candidate genes/disease and breast cancer. Evidence collection results showed that each of the promising associations is supported by at least two data sources. More detail about interpretation on the results of visualization and evidence collection for these associations can be found in the HGPEC study 12 .
Beside the fact is that almost commands of autoHGPEC return results in JSON format, the results of autoHGPEC is revealed via CyREST API as well (menu Help/Automation/CyREST Api). For example, the command in R, commandRun('autoHGPEC step2_1_select_disease diseaseName="breast cancer"'), in Step 2 can be performed directly by CyREST API with the request URL http://localhost:1234/autohgpec/v1/selectDisease/breast%20cancer (this URL is available after successfully constructing the heterogeneous network in Step 1). Then, it returns a list of OMIM IDs associated with "breast cancer" in JSON format as follows: Therefore, users can easily call this CyREST API and use this result in their workflow as they need.

Discussion and conclusions
Random walk with restart algorithm on heterogeneous network of diseases and genes was shown as a state-of-the-art method for predicting novel disease-gene and disease-disease associations compared to other network-based algorithms 3,12 . However, its prediction performance highly depends on the used heterogeneous network, which is a combination of a disease similarity network and a gene/protein interaction network and known disease-gene associations. Indeed, a study showed that the prediction performance can be improved by using a gene ontology-based gene similarity network instead of using the Figure 2. Visualization of highly ranked candidate genes and diseases in topological relationships with breast cancer. (a) Topological relationships between highly ranked candidate genes and known breast cancer-associated genes. (b) Topological relationships between highly ranked candidate diseases and breast cancer and its known associated genes. human protein interaction network 22 . In addition, we have recently shown that using the disease similarity network constructed by Human Phenotype Ontology 23 improved the prediction performance of disease-associated genes 24 as well as disease-associated non-coding RNAs 25,26 . Therefore, to facilitate the use of the similarity networks of diseases/genes, we enable user to provide these networks by themselves. For gene/protein network, user can import the network from various molecular interaction data sources or from other analysis pipelines. Similarly, disease similarity networks can be inputted from other analysis tools such as DOSim 27 and HPOSim 28 . Moreover, the ranked candidate genes can be used as inputs of other annotation and enrichment toolkits to support more about their associations with the disease of interest such as DAVID 29 and GSEA 30 . Taken together, with added automation features, autoHGPEC can be more useful and reached by a wider range of users.

Summary
Identification of novel disease-gene and disease-disease associations is an important task in biomedical research. Recently, we have developed a Cytoscape app, namely HGPEC, using a state-of-the-art network-based method for such task. This paper describes an upgrading version of HGPEC, namely autoHGPEC, with added automation features. By adding these functions, autoHGPEC can be used as a component of other complex analysis pipelines as well as make use of other data resources. We demonstrated the use of autoHGPEC by predicting novel breast cancer-associated genes and diseases. Further investigation by visualizing and collecting evidences for associations between top 20 ranked genes/diseases and breast cancer has shown the ability of autoHGPEC.
Software and data availability

Competing interests
No competing interests were disclosed.

Grant information
The author(s) declared that no grants were involved in supporting this work.

Thanh Le Van
Janssen Pharmaceutica NV, Beerse, Belgium Summary: The paper "autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network" presents an enhanced implementation of HGPEC, the previous work of one of the co-authors of the paper. The new implementation allows users to integrate the data analysis steps in Cytoscape with other data analysis pipelines in R or CyREST API. Indeed, this new feature would be useful as users now can take the advantages of the network-based data analysis and visualization in Cytoscape as well as the power of statistical data analysis of R, for example.
Below are my detail comments: Is the rationale for developing the new software tool clearly explained?
In my opinion, the "automatic features" is not well explained in the paper. The first place where the authors introduce the concept of "automatic features" is the last sentence of the first paragraph in the Introduction section. However, there is no further explain of this concept. Hence, it is very easy for people in the machine learning community to be confused with the concept of automatic feature selection in the automated machine learning field.
To clear the possible confusion, we can do two things: 1) add a citation of the paper/website where Cytoscape orginally introduce this concept; 2) briefly explain how Cytoscape provides this type of feature and how HGPEC can leaverage the facilities provided by Cystocape.

1.
Is the description of the software tool technically sound? Yes 2.
Are sufficient details of the code, methods and analysis (if applicable) provided to allow 3.
replication of the software development and its use by others?
The user manual is quite detail. However, there are rooms for improvements of the presentation, for example, the space between pictures and paragraphs, and the ident of paragraph are not always consistent and pleasant to read. I highly recommend to use latex to produce the mamual.
There is a Vietnamese sentence on page 10 of the manual, which should be removed.
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly -Please add a citation when mentioning that breast cancer is known to be associated with 12 genes (first paragraph, page 5) -Please briefly explain why the results of the demonstration make sense

4.
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes 5.
Overall, the author should have further explain: what is automatic features and why it is worthy of investigation.

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes Competing Interests: No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
however testing these cases could give evidence for which is more reliable and in what circumstances.

Minor comments
"Recently, we have developed a Cytoscape app, HGPEC, to predict both disease-gene and disease-disease associations based on a state-of-the-art method on a heterogeneous network of diseases and genes." Be specific. I understand what the paper is trying to say here, with "state of the art" meaning either network or machine learning based approaches however it took a few times to get a clear understanding. Mention how pertinent these approaches are but afterwards, mention which state of the art approach is being used. It will get the point across quickly without introducing any confusion.

1.
Technical and grammatical writing errors: -"…we first refactor source code of HGPEC to implement Cytoscape Tunable annotations to replace control panels of HGPEC in the west" --"Beside the fact is that almost commands of autoHGPEC return results in JSON format…" -"autoHGPEC is an upgradED version of HGPEC…" -"For gene/protein networkS, user can import the network from various molecular interaction" There are more throughout the paper and especially the supplementary document. Read through and correct carefully. Pay attention to your font formatting and spacing to keep things consistent.

2.
Change the node colors on page 8 of the supplementary document. It is a little confusing to see red as top ranked, green as bottom ranked and white and light-green as middle ranked. Maybe provide a legend.

3.
Bottom of page 10 -change column names for abbreviation for association. Pick something other than ass. Maybe assoc.

4.
Overall, good job. Two crucial benefits to the new app is that it can take data in from multiple sources as opposed to only one source -possibly lowering the chance of error and bias, and that it lets you integrate it with R and Python, allowing for integration as a component of more complex analysis pipelines. However, it seems this app can only be used if known diseasegene/protein associations and disease similarity networks are given. Why doesn't the paper mention protein interactions any further? Specifically, what pertinent information is being taken away from these said protein interactions?
Is the rationale for developing the new software tool clearly explained? Yes

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?