Computational identification of signaling pathways in protein interaction networks

The knowledge of signaling pathways is central to understanding the biological mechanisms of organisms since it has been identified that in eukaryotic organisms, the number of signaling pathways determines the number of ways the organism will react to external stimuli. Signaling pathways are studied using protein interaction networks constructed from protein-protein interaction data obtained from high-throughput experiments. However, these high-throughput methods are known to produce very high rates of false positive and negative interactions. To construct a useful protein interaction network from this noisy data, computational methods are applied to validate the protein-protein interactions. In this study, a computational technique to identify signaling pathways from a protein interaction network constructed using validated protein-protein interaction data was designed. A weighted interaction graph of was constructed. Saccharomyces Cerevisiae The weights were obtained using a Bayesian probabilistic network to estimate the posterior probability of interaction between two proteins given the gene expression measurement as biological evidence. Only interactions above a threshold were accepted for the network model. We were able to identify some pathway segments, one of which is a segment of the pathway that signals the start of the process of meiosis in . S. Cerevisiae


Introduction
For biologists and scientists in the life sciences, the successful sequencing of the genome is only one step out of many involved in understanding organisms. This has produced a lot of information that will not be useful unless refined. Biologists are interested in understanding the intricacies of the workings of the cells of an organism -the activities and reactions of such an organism to its environment. This information is useful in designing necessary interventions in order to modify the biological mechanisms of an organism or its reactions to external stimuli.
According to the central dogma of molecular biology, genes are composed of DNA which is transcribed into RNA and the RNA is then translated into protein. Ultimately, all organisms are composed mainly of proteins in different forms and quantity.
Proteomic data and protein-protein interaction data from organisms form a key component in understanding an organism due to the major role played by proteins in cellular mechanisms. Proteinprotein interactions are the foundation of biological mechanisms such as signal transduction, cell cycle control, DNA replication and transcription and enzyme-mediated metabolism 1,2 .
As a result of these interactions, understanding of organisms is facilitated by modeling the Protein Interaction Network (PIN) with a network constructed using the protein-protein interaction data. With a model such as this, a lot can be learned of the organism from its reaction to external stimuli and the effects of interventions on the biological mechanisms of the organism. For instance, it has been shown that the phenotypic effects of the deletion of a single gene depend on the position of that gene in the complex web of protein interactions 3 .
Apart from the importance of the protein-protein interactions map in studying the machinery of the proteome and the cellular behaviour of an organism, they are also practically important in the creation of interventions aimed at producing desired phenotypic outcomes such as new drug designs or disease prevention 4,5 .
Protein-protein interaction data from organisms are obtained on a large scale using a number of high throughput techniques such as Yeast Two-Hybrid (Y2H), Co-Immunoprecipitation (Co-IP), Mass Spectrometry etc. These high throughput techniques have however, been identified to have high rates of false positives and false negatives. False positive interactions are protein-protein interactions that are reported to exist with any of the experimental techniques but do not exist in reality, while false negative interactions are true interactions that do not get reported using an experimental technique. Rates of false positives in protein interaction data have been reported to be as high as 50% 6-8 . As a result of analysis based on the integration of gene expression level measurement data and proteinprotein interaction data, only about 30-50% of the interactions have been suggested to be biologically relevant. Reference 9 reported 47% true protein-protein interactions where a Paralogous Verification Method (PVM) was applied. The PVM may have performed better owing to its incorporation of information on paralogs of other organism to strengthen the biological evidence.
These high rates of protein-protein interaction data inaccuracy are due to peculiarities of the techniques used to generate them. For instance, unlike other affinity-based methods that cannot detect transient interacting proteins, Tandem Affinity Purification (TAPtag) tag methods can detect transient interacting proteins which are however lost during the purification process 10 .
Furthermore, these new high throughput methods of detecting protein interactions have no doubt rapidly generated much more data than have been collected by traditional methods in small scale experiments. This thus makes it impractical to start verifying each of these interactions by the traditional methods used in small scale experiments 11 .
In order to make sense of the vast data and obtain insightful information, these data need to be subjected to analytical procedures that will extract signal from the noise. This task of analyzing genomic data takes a computational approach due to the magnitude of the information involved. In reducing this level of noise in the protein interaction data different computational techniques aimed at improving the reliability of the data are applied. To predict true interactions between protein pairs, many authors have suggested a number of methods for estimating and assigning reliabilities to the interactions in the experimental data. These methods include using a logistic regression distribution function over a number of parameters to assign confidence scores to the interactions 17,18 , the use of expression profile and paralogs to assign reliability scores to already observed interactions 9 the use of maximum likelihood technique for the estimation of domain-domain interactions in order to infer protein-protein interactions 7 . For computational biologists, the challenge would be the development of methods of transforming the high-throughput data obtained from these different sources into biological insights.
In this paper, we seek to bridge the gap between protein-protein interaction data and other biological data in constructing useful signaling pathway models that will lead to insightful knowledge of biological processes. We propose a probabilistic approach using Bayesian networks to assign weights to protein-protein interactions. These weighted interactions are then used to construct the weighted PIN from which signaling pathways are predicted. Refer to Figure 1 for the schematic of the computational approach used. This work thus describes a computational means to clean up the background noise inherent in the various methods of proteomic data acquisition in order to better understand bio-molecular mechanisms.

Data sources
Protein interaction data was obtained from the publicly available Saccharomyces Genome Database (SGD). The protein interaction data is an amalgamation of the interactions obtained using eight different high throughput experimental procedures -Yeast Two-Hybrid, Affinity Capture Mass Spectrometry, CO-Purification, Affinity Capture Western, Biochemical Activity, Reconstituted Complex, Protein-Peptide and Far Western. This data contained 22,650 interactions between 2554 different proteins. The file contains proteins with their expression measurements used in this study, obtained from DNA microarrays of Yeast. Data was obtained from the work of Spellman et al. (1998) and available from http://genome-www.stanford.edu/cellcycle/data/rawdata/ CellCycle98.xls 41 .

ORF -Open Reading Frame
The S. cerevisiae expression measurement data was obtained from the Yeast Cell Cycle Analysis Project of the Stanford University. The data is housed at a publicly available database maintained by the Saccharomyces Genome Database at the Department of Genetics, School of medicine. The Yeast Cell Cycle Analysis project aimed at identifying all the genes whose mRNA levels are regulated by the cell cycle 12 . This data is available at the Yeast Cell Cycle Analysis Project site. The data contained the expression profiles of 800 proteins of the S. cerevisiae organism. Contains pairwise protein interaction data used in this study, obtained using high throughput experimental techniques such as Y2H, Affinity Capture-MS, Co-purification, Biochemical Activity and Reconstituted Complex, available from the publicly available SGD database 42 .

Data processing
The processing of the data obtained from the Yeast protein-protein interaction data and the Yeast Expression measurement was carried out by first filtering for the proteins that have expression level measurement. Only the proteins in the protein-protein interaction dataset that were also present in the gene expression measurement data were used. This was done based on the hypothesis that proteins occurring in the same complex and are known to physically interact have higher correlation than proteins that are not known to directly interact. This hypothesis is supported by 13-16 , where it has been observed that true protein interactions have a high mRNA expression for the proteins involved. The filtration of the dataset produced 306 protein-protein interactions that have expression level measurements from the 22,650 protein interactions and the 800 gene expression measurements. These 306 protein-protein interactions represent the intersection of the two datasets as depicted in Figure 2. With reference to the yeast protein interaction data and expression measurement data respectively, these figures correspond to 0.013% and 0.382% of the original dataset respectively.

Interaction data validation
In this work, the probability estimation of protein interactions was done using a Bayesian probabilistic model. According to Bayes' theorem, the posterior probability density is proportional to the prior probability density and the likelihood function.
Our interest is in drawing inference about the parameter f from a probability model p(ɣ | f) to give rise to observed data ɣ. Allocating a prior probability π(f) to the parameters assuming they are uncertain, we can obtain a posterior probability according to Bayes' theorem where p(ɣ) which is the marginal density for ɣ is obtained by integrating over the prior. Refer to Equation 1.
Equation (1) can be rewritten as Equation (2) since π(f | ɣ) is a function of f for observed ɣ which shows the direct proportionality between the posterior probability and the product of the likelihood and the prior probability 19 .
Likelihood function In order to make use of the Bayesian model, there must be an approximation or full specification of the prior probability distribution and the likelihood function. The first step in the determination of the likelihood function which is based on the probability of observing the data is to fix a probability distribution f(q) where q is the parameter defining the probability distribution.
In outcome space, for a given dataset (Y1, Y2, Y3, … ,Yn), the probability of observing the dataset given q is written as In parameter space, the likelihood function in terms of the probability of observing the dataset given q is For a Bernoulli distribution, the probability distribution is Therefore for a sample of N observations (Y 1 , Y 2 , Y 3 , …,Y n ), the joint distribution is as Equation 6, and can be rewritten as Equation 7 ( ) The likelihood function determines what value of q makes the dataset (Y1, Y2, Y3, …,Yn) most probable.
Estimating the maximum likelihood of the parameter q, we maximize the function with respect to q and then set it to zero to obtain the Maximum Likelihood Estimation of the parameter q 20 .
Weighted graph construction We formalize the problem of constructing a weighted graph that is instrumental in building a PIN. Let (V, E, w) be the protein-protein interaction network where V = p 0 , p 1 ,...p n is the set of all proteins and E = {e = (p i , p j ) | p i , p j ∊ V} is the set of interactions among these proteins in the set V, and w is the weight of each edge that belongs to E. w being the weight of the interaction between two proteins (p i , p j ) is a measure of the reliability of interaction between the two proteins (p i , p j ) obtained using the Bayesian probabilistic approach described above.
The vertices of the interaction graph are contained in the set of the unique proteins obtained after the computation of the reliability of interaction between the protein pairs i.e. | V | = 306, and the edges of the graph are the set of interactions between these proteins in V.
The protein-protein interaction graph is constructed with an undirected sparse graph due to the sparse nature of biological networks.

Graph implementation
The implementation of the algorithm for this computational technique was done in Java programming language using the Java Universal Network Graphics (JUNG) framework for graphs. The JUNG framework is an open-source collection of libraries providing common language for modeling, analyzing and visualizing any data that can be represented as a graph or network. The JUNG framework is extensible in order to tailor it to specific needs and also includes implementation of a host of algorithms for network analysis, graph theory and data mining.
The graph implementation in JUNG supports the representation of the different types of graphs such as directed and undirected graph, multimodal graphs, graphs with parallel edges and hypergraphs.
For this work, we used the JUNG 2.0.1 API released in January, 2010 which can be found at http://jung.sourceforge.net.
Simple path-finding A pathway is an ordered list of distinct proteins in V such that each consecutive pair is found in E 21 .
Given an undirected sparse graph G = (V, E, w) and a pair of nodes {(p i , p j ) | (p i , p j ) ∊ V} corresponding to the starting and ending proteins respectively, we wish to find a simple path from p i to p j which will be a segment of a pathway.
With the graph constructed, which is the PIN of the S. cerevisiae organism based on the data supplied, we queried the graph with a pair of proteins (p i , p j ) ∊ V which are respectively the starting protein and ending proteins of the path of interest. We are interested in having a simple path corresponding to the signal transduction path from the starting protein to the ending protein returned by the search algorithm.
The search is done using a Depth First Search (DFS) algorithm. The start protein becomes the root node for the algorithm and examines all the outgoing nodes to it, expands the first child node of the apparent tree and progressively continues the search until the target node (the ending protein) is found. If the DFS algorithm however encounters a node that has no children, it backtracks to the previous node to continue exploring the children nodes.

Protein mapping to functional annotation
In order to understand and make meaning of the pathway segments that are obtained from the PIN, we compared the proteins to their functional annotation. Mapping proteins in known signaling pathways and PINs to their functional annotations has an important function. The proteins in an organism may have similar biological functions such that one protein effectively replaces another in a pathway, then such proteins should share the same set of gene annotation terms. The Gene Ontology annotation, which is a functional annotation scheme, provides this basis for the identification of functional description of proteins and their interactions with other proteins and other molecules.
In this work, we used the Gene Ontology (GO) annotations to interpret the pathway segments that have been identified from the protein-protein interaction network constructed for the S. cerevisiae organism.

Results
To validate the protein-protein interaction data that we used in this work, we applied the method that was described in section II-B to first filter the data. As was described the filtration of the data was done by integrating the gene expression measurement of the regulated Yeast Cell-cycle in order to obtain a dataset that is an intersection of both datasets.
This step was taken based on the hypothesis that there is a high correlation between the expression levels of truly interacting proteins 13-16,22 and also using the gene expression measurement as a source of biological information 23-25 for the computational inference.
Using the protein-protein interaction data comprising 22,650 interactions between 2554 unique proteins and the gene expression levels of 800 genes, we applied the computational approach based on Bayesian probability described earlier. Further in the validation process, the application of the Bayesian probabilistic model on the data to estimate the posterior probability of an interaction existing between two proteins given the biological evidence produced the weight estimate for the interactions. With the estimation of the interaction weight and the rejection of interaction weights below the threshold obtained from the mean expression level measurements, we obtained a dataset containing 306 proteinprotein binary interactions. This dataset was used in constructing the PIN of the S. cerevisiae organism.
The 306 protein-pairs represented the proteins that had expression profile measurement, which corresponds to the intersection set of the two datasets. With reference to the protein-protein interaction data and the gene expression level measurement, this is a mere 0.013% and 0.382% of the original dataset respectively.
We applied the method described in section 2.7 to identify pathway segments in the constructed PIN for S. cerevisiae. Given a graph G = (V, E), a pathway has been described as an ordered list of distinct proteins in V such that each consecutive pair is found in E 21 . With a starting protein and an ending protein of interest, a simple path between these two corresponds to a pathway. Due to the size limitation of the expression measurement dataset used and the effective reduction in the overall number of proteins used to construct the graph, we were only able to identify pathway segments. A pathway segment is a chain of interacting proteins which is a part of a larger pathway. Some of the resulting pathways identified with this technique are presented in Table 1 and Table 2. These tables elucidate the protein description, the GO function and GO process of the proteins involved in the pathway segments as obtained from the AmiGO website http://www.geneontologyproject.org/go.

Myosin binding
Interacting selectively and non-covalently with any part of a myosin complex; myosins are any of a superfamily of molecular motor proteins that bind to actin and use the energy of ATP hydrolysis to generate force and movement along actin filaments. [GO:0017022]

Peroxisome inheritance
The acquisition of peroxisomes by daughter cells from the mother cell after replication.
In Saccharomyces cerevisiae, the number of peroxisomes cells is fairly constant; a subset of the organelles are targeted and segregated to the bud in a highly ordered, vectorial process. Efficient segregation of peroxisomes from mother to bud is dependent on the actin cytoskeleton, and active movement of peroxisomes along actin filaments is driven by the class V myosin motor protein, Myo2p.

ATPase activity Definition
Catalysis of the reaction: ATP + H2O = ADP + phosphate + 2 H+. May or may not be coupled to another reaction.

Meiotic sister chromatid cohesion
The cell cycle process in which sister chromatids of a replicated chromosome are joined along the entire length of the chromosome during meiosis. [GO: 0051177] Table 1 and Table 2 present some of the pathway segments identified using the computational approach proposed in this paper. The understanding of the paths is facilitated by using Gene Ontology associations to understand the biological processes the proteins are involved in. A signaling pathway is characterized by a starting protein that is a receptor at the membrane and ends with a transcription factor.

Discussion
From Table 1 we identified a pathway segment {YMR163C-YOR326W-YCL063W-YER150W} along with the genes coding for each of the proteins using GO annotation.
The pathway segment starts with the protein YMR163C, identified to be a receptor important for peroxisome inheritance. Signaling pathways are often characterized by an activator at the membrane of the cell binding to a receptor to initiate the chain of signal transduction. These peroxisomes are organelles that metabolize fatty acids and are numerous in the S. cerevisiae organism. By blocking peroxisome transport through point mutants in the MYO2p gene that binds to it, the levels of MYO2p gene expression increased 26 . The implication of this is that signal is transmitted to the mother cell to stop further peroxisome transfer by lowering INP2 gene expression. The next protein, YOR326W, in the pathway segment is coded for by the MYO2p gene whose level of expression is modified in the activation of signal that is relayed to alter the level of the INP2 gene that codes for the YMR163C receptor protein. The next protein in the chain, YCL063W, coded for by the gene VAC17 has been identified to be a vacuole-specific receptor for myosin MYO2P and is involved in vacuole inheritance, a molecular anchoring function.
The last protein in the pathway segment, YER150W, coded for by the gene SPI1 contributes to transcriptional regulation induced under conditions of stress during the diauxic shift 27 .
It is observed that this pathway which signals the start of the process of meiosis suddenly breaks off to a gene (SPI1) that participates in catalysis at the molecular level. This is not abnormal as these are pathway segments and not the full transduction pathway activated by the receptor protein.
In a similar approach, 28 used protein-protein interaction data and expression data to model pathways. They ranked candidate signaling pathways of interacting proteins using expression data based on the rationale that proteins in the same signaling network must simultaneously exist with the activation of the pathway; the genes coding for these proteins must also under the same environmental factors required for the signaling network and about the same time, be transcribed.
Their approach to predicting pathways included specifying the starting protein, a membrane protein, and an ending protein of interest, such as a DNA-binding protein, based on a prior knowledge of genetic relationship between them. In their findings, the pathways that the algorithm identified were not complete pathways owing to incomplete maps.
21 also applied a computational approach that is similar to our own by assigning scores to protein-protein interaction data, creating a PPI network from the data and mining signaling pathways from the network. The parameters for the search on the network included a starting protein and an ending protein as well as the length of the pathway segment. Although their approach involved training the algorithm using association rules mining from known pathways, they were only able to mine pathway segments too.
The incompleteness of pathways mined from using computational techniques on protein-protein interaction data can be attributed to false negative interactions that were not detected by the high throughput experiments that generated the data.
Furthermore, a number of computational techniques that have been applied to cleaning the noise in the protein-protein interaction data used often entails eliminating some data presumed to be noise from the dataset. The proteins removed in this manner could be important proteins that would then be missing in the modeled PIN. Our own approach involved filtering the protein-protein interaction data with the gene expression measurement data such that only the proteins with expression level measurement were used in the construction of the protein interaction network. This resulted in a reduction of the 22,650 pair-wise interactions by the gene expression measurement for just 800 proteins to 306 pair-wise interactions. This reduction in the size of the data used to construct the protein interaction network was a constraint on the number of pathways identified using this approach.

Conclusion
In this paper, we proposed a simple computational approach to identify signaling pathways in PINs by first estimating true interactions within protein-protein interaction data obtained from high throughput experimental techniques which are susceptible to generating high rates of false positive and false negative interactions. We proposed a technique using Bayesian Probability to estimate the probability of true interactions between two proteins and assigned weights to the pair wise interaction based on this. Using the validated protein-protein interaction data, we constructed a PIN of the S. cerevisiae organism from where simple paths between two proteins of interest were mined. Using the Gene Ontology annotation to understand the biological process taking place within the pathway, we were able to identify a pathway which signals the start of the process of meiosis, albeit broken off for want of more data.
Knowledge of signaling pathways are generally useful in designing biological interventions on an organism aimed at producing specific desired outcomes such as new drugs design and disease prevention and control. Author contributions AUM conceived the study and supervised it. TAO carried out the study. Both authors interpreted the results and were involved in the revision of the final draft manuscript and agreed to the content.

Competing interests
No competing interests were disclosed.

Grant information
The author(s) declared that no grants were involved in supporting this work.

Lynn Fink
Diamantina Institute, University of Queensland, Brisbane, Australia This paper describes a computational method for extracting information from a large variety of inherently noisy biological data describing protein-protein interactions and purports to be able to discover signalling pathways, or at least segments of signalling pathways.
Not being an expert on Bayesian modeling, I can't comment directly on the method although it seems to be predicated on well-supported hypotheses and aims to be conservative in the interests of decreasing noise and increasing biological validity. Significantly, the pathway segments suggested by the model are annotated with existing functional information from Gene Ontology annotations. The authors claim to validate their results by correlating the proposed interactions with existing gene expression relying on the hypothesis that highly co-expressed genes are true interactors. It should also be noted that this study was performed on , a highly studied model organism for which the authors had access to 8 S. cerevisiae different types of high-throughput methods aimed at inferring protein-protein interactions (PPIs).
I've personally always struggled with the validity of using computational methods to amalgamate high-throughput PPI data for the purposes of pathway discovery. PPI networks are dynamic and I'm not convinced that we can measure them completely (in every condition, cell type, tissue type, etc.) or that we can always assume that PPIs observed in one cell, organism, or condition can be extrapolated to others so any attempt we make to catalog PPIs is necessarily vastly incomplete. Furthermore, given the wealth of data necessary to attempt applying a computational method I wonder how generally applicable these methods can be. For example, this paper relied on data from 8 different methods -how often can we expect to have that much information about a cell or organism of interest?
Is it appropriate to validate computationally-derived PPIs with gene expression data? Would it not be more appropriate to perform an assay that directly or indirectly interrogates the actual interaction between proteins?
I also wonder how circular the logic behind these computational methods is. The authors used data from SGD, a well-known public resource, in order to generate PPI networks and then bootstrapped these networks by using GO, another well-known public resource. I would be surprised if GO annotation was performed without knowledge from SGD so can we believe that the networks derived in this paper are performed without knowledge from SGD so can we believe that the networks derived in this paper are based on solely on the Bayesian model? Or are we just re-discovering information we partially already knew?
And if we believe the signalling pathway segments reported here are newly and independently discovered, how widely applicable is the proposed method? Can we use it for other organisms or for yeast under changed conditions, for example? How much PPI data do we need before a computational method is more efficient and informative than well-designed biochemical experiments? Were the two reported pathway segments the only ones that could be inferred from 22,650 interactions between 2554 proteins (roughly half of the entire proteome)? Is there anything exciting (and new) to be found if the model is allowed to be less conservative?
What contribution to biology do the authors expect from this method?
No competing interests were disclosed. Competing Interests: I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Winston A. Hide
Sheffield Institute of Translational Neuroscience, University of Sheffield, Sheffield, USA This work is appropriately applied in principle with adequate application of methods. According to the criteria: "work has been well designed, executed and discussed" it has not quite yet been adequately designed would benefit from a more synthetic discussion that explores the results in context of existing work.
Of concern is that the aim of the project appears to be to improve the delivery of signal over noise in PINs. But there is no means to judge if there has been an improvement -no tests, validation or comparison over a start state. Instead there is provision of results that show some interactions that are already knownwhich is promising -but no ability to judge if this is an improvement over just the use of protein-protein interaction data, or just the use of gene expression data.
There could be some more reference to existing work -reference and comparison with that which is current in the field (see examples of refs below).
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com