PPInfer: a Bioconductor package for inferring functionally related proteins using protein interaction networks [version 3; peer review: 1 approved, 1 approved with reservations]

Interactions between proteins occur in many, if not most, biological processes. This fact has motivated the development of a variety of experimental methods for the identification of protein-protein interaction (PPI) networks. Leveraging PPI data available in the STRING database, we use a network-based statistical learning methods to infer the putative functions of proteins from the known functions of neighboring proteins on a PPI network. This package identifies such proteins often involved in the same or similar biological functions. The package is freely available at the Bioconductor web site (http://bioconductor.org/packages/PPInfer/).

This article is included in the RPackage gateway.
This article is included in the Bioconductor gateway.

Reviewer Status
Invited Reviewers

Introduction
The function of many proteins remains unknown. This is a big challenge in functional genomics. As proteins consisting of the same protein complex are often involved in the same cellular process, the pattern of protein-protein interactions (PPIs) can give information regarding protein function. Thus PPI databases can be useful to predict protein function, complementing conventional approaches based on protein sequence analyses.
Many databases store information on protein interactions and complexes. For example, the Biological General Repository for Interaction Datasets (BioGRID) includes over 500,000 manually annotated interactions 1 . The STRING database aims to provide a critical assessment and integration of protein-protein interactions, including direct (physical) as well as indirect (functional) associations. The basic interaction unit in STRING is the functional association, i.e. a specific and productive functional relationship between two proteins 2 .
Once the PPI has been obtained experimentally, there are numerous methods to analyze the network. A neighbor counting method for protein function prediction was developed 3,4 . The theory of Markov random fields was used to infer functions using PPI data and the functional annotations of its interacting partners 5 . Many algorithms integrate multiple sources of data to infer functions 6-8 . We propose a method that can infer the putative functions of proteins by solving classification problem and thus identifying closely connected proteins known to be involved in a certain process.
We use the kernel method for the graph as a similarity measure between proteins, instead of using the first or second level neighbors, and thus our proposed method provides scores or distances derived from a graph kernel. The main advantage of the proposed method is that the neighbors of a set of proteins can be ranked in terms of scores. Generally, we need two or more classes for classification problem. Although we only know the target, we want to apply semi-supervised learning techniques to the PPI through this package. Thus, the main idea of this method is how to find another class so that two classes can be used for classification problem. Eventually, we can classify proteins to identify such closely related proteins by using this package.
Finally, functional enrichment analyses such as over-representation analysis (ORA) and gene set enrichment analysis (GSEA) are incorporated to predict the protein function from the closely related proteins. Although various functional annotations could be used to categorize genes, gene ontology (GO) is one of the most popular function categorization. Kyoto Encyclopedia of Genes and Genomes (KEGG) is commonly used for categorization in pathway analysis. Also, they provide annotations of diverse organisms.

Methods
The support vector machine (SVM) is one of the most widely used methods for classification 9 . Suppose we have a dataset in the real space and that each point in our dataset has a corresponding class label. A SVM is involved in a convex optimization problem to separate data points in the dataset according to their class, by maximizing distance between class and minimizing a penalty for misclassification for each class, at the same time. Unfortunately, the graph data is not in the real space. Cover's theorem 10 provides the useful idea behind a nonlinear SVM, which is to find an optimal separating hyperplane in the high-dimensional feature space mapped by using a suitable kernel function, just as we did for the linear SVM in the original space.
Graph (network) data is ubiquitous and graph mining tries to extract novel and insightful information from data. Graph kernels are defined in the form of kernel matrices, based on the normalized Laplacian matrix for a graph. The best-known kernel in a graph is the diffusion kernel 11 . The motivation is that it is often easier to describe the local neighborhood than to describe the structure of the whole space 12 . Another method is called a regularized Laplacian matrix 13 and is widely used in areas such as spectral graph theory, where properties of graphs are studied in terms of their eigenvalues and vectors of adjacency matrices 14 . Broadly speaking, kernels can be thought of as functions that produce similarity matrices 15 . In the package, we choose only the regularized Laplacian matrix as a graph kernel for the PPI. The kernel K is the symmetric matrix which is given by where K is the N × N matrix, I is an identity matrix, L is the normalized Laplacian matrix, and γ is an appropriate decay constant.
In many biological problems, datasets are often compounded by imbalanced class distribution, known as the imbalanced data problem, in which the size of one class is significantly larger than that of the other class. Many classification algorithms such as a SVM are sensitive to data with an imbalanced class problem, leading to a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. A possible solution to this problem is to use the one-class SVM (OCSVM) by learning from the target class only 16 . In one-class classification, it is assumed that only information of one of the classes, the target class, is available, and no information is available from the other class, known as the background. The OCSVM can be solely applied because we have only one class, the target. However, it is known that one-class classifiers seldom outperform two-class classifiers when the data from two classes are available 17 .
In the SVM, the training data set contains m observations x 1 , …, x m , with corresponding target values y 1 , …, y m , where y i ∈ −1, 1. Consider the linear classification, w T x + b, w, x ∈ ℝ n , and b ∈ ℝ. The distance between the two support lines, or margin, is 2/||w||. Thus maximizing the margin is equivalent to minimizing ||w||/ 2. The optimization problem is defined as Here, ξ i are slack variables that allow each observation to be on the wrong side of the margin or the hyperplane, while adding a penalization term to the minimization problem. In the SVM, the cost for the misclassification error is controlled by the margin parameter C. For a large value of C, misclassification is suppressed, while for a small value of C, misclassification is allowed for observations that are away from the gathered data 18 . Consider the dual problem. For the dual variable α ∈ ℝ m , we have In the feature space, the linear decision function is given by w T φ (x) + b, using a nonlinear vector function φ. Using the kernel The decision function is given by Unlike the SVM, the one-class SVM is based on the hyperplane approach 19 . The hypersphere has its center a and radius R. When constructing the hypersphere, its volume should be minimized to tightly encompass observations Here, ξ i are slack variables to allow data to lie outside of the hypersphere so as to control the trade-off between the volume and the errors by using the regularization parameter ν between 0 and 1. Its dual problem is Given a new observation z, the decision function is given by for any support vectors x s . If the value of the decision function is greater than zero, it is classified as a target, otherwise an outlier 17 .
The strategy of this package is to make use of the OCSVM and classical SVM, sequentially. The regularized Laplacian matrix for the whole graph is used as input for them while only nodes with the class label are used for training. Unlabeled nodes are scored as output. First, we apply the OCSVM by training a one-class classifier using the data from the known class only. Let n be the number of proteins in the target class. Also, let I 1 be the index sets of rows or columns of the matrix K for the target class. Thus we have * * where K* is the n × n matrix which is equal to K p,q for p, q ∈ I 1 . This model is used to identify distantly related proteins among remaining N − n proteins in the background by using the matrix K p,q for p ∉ I 1 and q ∈ I 1 . Indeed, we do not know whether or not each of proteins in the background interacts with the proteins of the target. Thus, it does not always imply that proteins in the background do not interact with the target. In fact, their associations with the target are not observed to date yet. Proteins with zero similarity with the target class are extracted. Then they are potentially defined as the other class by pseudo-absence selection methods 20 from spatial statistics. The target class can be seen as real presence data. The main idea of the proposed method is to adopt the pseudo-absence class. For the data to be balanced, assume that two classes contain the same number of proteins. Next, by the classical SVM, these two classes are used to identify closely related proteins whose scores ranging from -1 to 1 are close to 1 among remaining N − 2n proteins. Let I 2 be the index sets of rows or columns of the matrix K for these two classes. The corresponding optimization problem is given by * where K** is the 2n × 2n matrix which is equal to K p,q for p, q ∈ I 2 . The matrix K p,q for p ∉ I 2 and q ∈ I 2 is used for scoring. Cross-validation can be used to prevent overfitting and evaluate the above procedure.
Semi-supervised learning can be applied to make use of large unlabeled data and small labeled data. Some of these methods directly try to label the unlabeled data. Eventually, those found by this procedure can be functionally linked to the target proteins. This is usually based on the assumption that unannotated proteins have similar functions as their interacting proteins.

Use cases
We need a list of proteins and the kernel matrix to infer functionally related proteins. With the STRING database for mouse, it is supposed that the target is the set of proteins in the RAS signaling pathway from KEGG pathway (http://www.genome.jp/kegg-bin/show_pathway?mmu04014).

Discussion
The proposed method is highly dependent on protein interaction networks. There are many databases of protein interactions with millions of predicted protein interactions as well we interactions reported in thousands of publications. Therefore, the choice of data can be critical. Also, the one-class support vector machine can be sensitive to changes in parameters. Although the classical support vector machine has good performance, overall performance may be dominated by the OCSVM since the OCSVM is followed by the SVM. However, it would be worthwhile to provide potential to infer the putative functions of proteins from PPI by complementing conventional methods based on protein sequence analyses.

Data availability
The License: Artistic-2.0

Competing interests
No competing interests were disclosed.

Grant information
This material is based upon work supported by the National Science Foundation/EPSCoR Grant Number IIA-1355423 and by the State of South Dakota. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The method described in the paper returns a set of proteins that are similar in function to a target set. The way I would expect to see this evaluated is to compare the results for the RAS pathway against a gold standard set which would be determined beforehand and would include all and only the proteins that should be annotated the same as the RAS pathway. The paper should justify how this gold standard was determined, whether it was hand annotated or otherwise. I would then expect to see the recall and precision of the method calculated against this gold standard, along with an error analysis to show what types of proteins are false positives and negatives, and to discuss why these proteins are returned/missed. Sorry for not making this clearer in the original report, but this is what I intended by questions 5-7. The discussion of the results shown in Figures  1-4 centres on the true positives, but it is discussion and quantification of the false positives and false negatives that will provide credibility to your method. As it stands now, the evaluation that is provided in the paper is not sufficient for me to feel comfortable about using the results for a biological analysis.

Competing Interests:
No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
together with the top 20 predicted proteins. The authors claim that because there are STRING interactions between these two sets of proteins that this is support for them having the same function. To support this, I would at least like to see this number of interactions compared to those between a random set of 20 proteins and the proteins of the Ras pathway, if not additionally some error analysis of proteins that are also connected to the pathway but not picked by the algorithm. It is also not clear to me if the training data for the SVMs make use of the network interactions; if it does, then providing STRING interactions as support for validating the output is circular.
Since the paper leaves open some fundamental questions in the design of the machine learning approach, I cannot approve this article without major reservations. I hope that the authors will add these additional details, since this article addresses an interesting problem, and I am curious to see their method described in detail.
Some minor comments on the article follow: All acronyms such as ORA should be defined in the text.

○
The Figure 1 generated by the example code is not exactly the same as the plot included in the manuscript, but is very close. The figure caption for Figure 1 should specify what the size of the points means. not.
8. What parameters were used for each SVM, and how/why were they selected?
The parameter for misclassification cost is used for the SVM. This is a constant of the regularization term in the Lagrangian function. Similarly, the 'nu' parameter about a regularization for a hypersphere is needed for the OCSVM. Also, the parameter for convergence is used for both models. In this example, we used default parameters of the function 'ksvm' in the package 'kernlab'. These parameters are set as default in this package.
For the results of functional enrichment analysis, categories related to GTPase activity may not be shown in figures. Since some part of results is just reported due to limited space, you can see such categories in the full list of significant categories. Also, the result may not be exactly reproducible because functional categories are frequently updated and there still exists randomness in the function 'fgsea' even setting a seed. Finally, the number of interactions we obtained is compared to those between a random set of 20 proteins and the proteins of the RAS pathway by using bootstrapping. Please, see the text and figure for further discussion.
Error in `[.data.frame`(EnrichTab, , c(category, size, count, pvalue)) : undefined columns selected This seems to be because there is no column in resORA called "Category"--the GO terms are the rownames. We could work around this problem by adding: resORA$Category <-rownames(resORA) Is the rationale for developing the new software tool clearly explained? Partly

Is the description of the software tool technically sound? Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Partly

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly