An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity [version 1; peer review: 2 approved with reservations]

Disease processes are usually driven by several genes interacting in molecular modules or pathways leading to the disease. The identification of such modules in gene or protein networks is the core of computational methods in biomedical research. With this pretext, the Disease Module Identification (DMI) DREAM Challenge was initiated as an effort to systematically assess module identification methods on a panel of 6 diverse genomic networks. In this paper, we propose a generic refinement method based on ideas of merging and splitting the hierarchical tree obtained from any community detection technique for constrained DMI in biological networks. The only constraint was that size of community is in the range [3, 100]. We propose a novel model evaluation metric, called F-score, computed from several unsupervised quality metrics like modularity, conductance and connectivity to determine the quality of a graph partition at given level of hierarchy. We also propose a quality measure, namely Inverse Confidence, which ranks and prune insignificant modules to obtain a curated list of candidate disease modules (DM) for biological network. The predicted modules are evaluated on the basis of the total number of unique candidate modules that are associated with complex traits and diseases from over 200 genome-wide association study (GWAS) datasets. During the competition, we identified 42 modules, ranking 15 at the official false detection rate (FDR) cut-off of 0.05 for identifying statistically significant DM in the 6 benchmark networks. However, for stringent FDR cut-offs 0.025 and 0.01, the proposed method identified 31 (rank 9) and 16 DMIs (rank 10) respectively. From additional analysis, our proposed approach detected a total of 44 DM in the networks in comparison to 60 for the winner of DREAM Challenge. Interestingly, for several individual benchmark networks, our performance was better or competitive with the winner.


Motivation & background
A variety of genomic data has been used to construct biological networks.Biological networks are scale free by nature 1 and it is well-known that scale-free networks exhibit community like structure [2][3][4][5] .Community like structure in networks is equivalent to presence of a high degree of modularity 5 .In biological networks, the modules often comprise of genes or proteins that are involved in the same biological functions.Network module identification methods, commonly known as community detection [4][5][6][7][8][9] and graph partitioning methods [10][11][12] , attempt to reveal these functional units 2,13,14 which is key to derive biological insights from genomic networks [15][16][17][18] .However, the performance of different community detection methods using diverse parameter settings to uncover biologically relevant modules in myriad networks remain poorly understood because there has been no community effort to transparently evaluate module identification methods on common benchmarks and across diverse types of genomic networks.Thus, it is very difficult to objectively compare the strengths and limitations of alternative approaches.Evaluation of module identification methods typically relied either on random graphs 13 , which do not allow for assessment of biological relevance of modules, or on pre-annotated functional gene sets 18 (e.g., gene ontology or molecular pathway databases such as KEGG), which are still primarily incomplete and biased towards well-studied pathways.
To address these issues, an open community DREAM challenge enabling comprehensive and rigorous assessment of module identification methods across a broad range of gene and protein networks was initiated.The task in sub-challenge 1 was to identify functional modules in 6 individual benchmark networks s.t. the module size satisfied the constraint: 3 ≤ modul esize ≤ 100.
The predicted modules were evaluated based on data from disease-relevant genome-wide association studies (GWAS).GWAS have successfully identified thousands of genetic loci associated with a broad range of complex traits and diseases.The variants are mapped to genes allowing to ask whether specific network modules are enriched in these genes 19 .The DREAM challenge organizers employed a comprehensive collection of over 200 GWAS datasets, thereby, covering a broad spectrum of functional units, many of which have not been annotated previously.
In this paper, we focus on sub-challenge 1 where the goal is to predict functional modules for individual anonymized networks across a broad range of gene and protein networks.Our proposed pipeline requires a hierarchical tree from any state-ofthe-art hierarchical community detection technique as input.The pipeline first identifies the optimal level of hierarchy using an F-score comprising of quality metrics like conductance 13 , modularity 2 and connectivity 1 .Then it traverses the hierarchy bottom-up from the optimal level allowing to merge smaller communities based on the weighted connectivity criterion as long as they fit the size constraint.Further, it splits giant connected components (modulesize > 100).
For each giant connected component, we re-build the hierarchical tree using a linkage based agglomerative hierarchical technique and identify the optimal cut (number of clusters k) using the proposed F-score criterion.Finally, we propose a metric to indicate the confidence in each module among the final set of detected modules and develop a method to automatically select the right confidence threshold to prune less meaningful modules.Figure 1 depicts the proposed pipeline for the constrained disease module identification problem.

Data
The disease module identification methods were evaluated using 6 benchmark networks.Details of the networks are provided in Table 1.

Preprocessing
There are several preprocessing steps performed before the input network can be processed by the pipeline.The node IDs are mapped to a continuous set of integers starting from 1.If the aforementioned procedure is not performed, the network will end up with several isolated nodes and missing IDs.All the edge-weights in each network are normalized between 0 and 1.
The input network are considered weighted and undirected in all our pipeline.
We experimented with removal of edges with a weight lower than a threshold t = 0.05 but observed that the corresponding results deteriorated.Hence, we recommend keeping all the edges in the network.

Preliminary experiments
In the initial submission rounds, we ran several out-ofthe-box state-of-the-art community detection techniques including Order Statistics Local Optimization Method (OSLOM) 4 , Louvain 5 , Multi-level Hieararical Kernel Spectral Clustering (MHKSC) 6,7 , Dynamic Tree Cut 20 and METIS 10 .We also tried to use the results obtained from these methods as input to consensus clustering based method PCAgglo 21 and ensemble clustering based method Ensemble-Clue 22 which are evaluated using complex traits and disease modules in 76 European GWAS datasets.
OSLOM is based on the local optimization of a fitness function expressing the statistical significance of communities with regards to random fluctuations, which is estimated with tools of extreme and order Statistics.The Louvain method is a greedy optimization method that attempts to optimize the modularity of a network partition.MHKSC technique uses a kernel spectral clustering formulation to random walk and exploits the structure of the projections in the eigenspace to automatically determine a set of increasing distance thresholds.It then uses these distance thresholds in a test phase to obtain multiple levels of hierarchy using principles of agglomerative hierarchical clustering.Dynamic Tree Cut method implements novel dynamic branch cutting technique for hierarchical clustering where it detects clusters in a dendogram depending on their shape.They are capable of identifying nested clusters, can identify clusters of various shape and are suitable for automation.METIS is a set of serial programs for multilevel recursive partitioning of the graph to produce fill reducing orderings for sparse matrices.PCAgglo performs logistic PCA on the concatenated node membership matrix formed from k different methods and then agglomerative hierarchical clustering is performed on the principal components.For METIS, Dynamic Tree Cut, PCAgglo and Ensemble-Clue, we selected that level of hierarchy for which the average module size was close to the best as per the exploratory data analysis provided by the DREAM Challenge organizers.The results that we obtained by direct application of out-ofthe-box state-of-the-art community detection methods is depicted in Table 2.

Insights gained
The Best of All result were not submitted during the preliminary rounds of the Challenge because the Best of All method depicts the maximum number of enriched modules that can be identified by a simple 'max' combination of these techniques at default settings.However, as per our understanding the goal of the challenge is to develop a method or a generic framework which can optimally identify disease modules from various gene and protein interaction networks at different parameter settings.We gained several insights from these preliminary results including: • Methods like OSLOM, MHKSC and PCAgglo generated a set of clusters whose cluster size distribution is nearly power law.
• For most of these methods there were several giant connected components which were ignored due to the strict upper bound constraint on the module size.
• For most of these methods nearly half of the nodes in each network were part of giant connected components that were removed due to size constraint.
• METIS generated uniform sized clusters and included most of the nodes in each network, hence can't be optimized further.

Quality metrics
We provide summary of quality metrics used and definition of proposed quality metrics below: 1. Modularity: Modularity is a global metric which takes value between −1 and 1.It measures the density of links inside communities compared to links between communities.For a weighted graph, modularity of a network partition is defined as: is the difference between m s , the number of edges between nodes in s and E(m s ), the expected number of such edges in a random graph with identical degree sequence.Modularity value ≤ 0 indicate that the corresponding partition behaves worse than a random partition of the network.Modularity score can only be obtained for graph partitions.
where K is p-step random walk kernel used to define pairwise connectivity between the nodes, I is an identity matrix, W is the weighted adjacency matrix and D is the weighted diagonal degree matrix.We set p = 4 for our biological networks as it allows to capture all meaningful interactions for paths of length ≤ 4. The connectivity of a node i is estimated as ( )  , is Here |⋅| represents the cardinality function.

F-score:
We need a quality metric which evaluates the quality of a partition using modularity, conductance and connectivity.While modularity captures global information, conductance and connectivity can capture local information.The proposed quality metric is defined as: Higher value of modularity indicates better quality clusters, lower value of conductance leads to good quality communities and higher value of connectivity indicate better quality of modules.We need to maximize modularity and connectivity while minimizing conductance.Hence, we take harmonic mean of modularity and connectivity in the denominator of F-score metric to give importance to both of the quality metrics.Thus, with conductance in the numerator, the minimum value of F-score corresponds to the partition S with best quality cluster.However, if modularity value is ≤ 0 then we set F-score to a very large value which depicts the poor quality of the partition.

Inverse Confidence:
We need a metric to rank all the modules generated from the proposed framework.We first considered the average connectivity metric CN(s) for a community s.However, the connectivity criterion prefers smaller size modules which tend to be more cliquish than bigger modules.We also considered using the conductance CC(s) of a community s to rank all the modules in partition S.However, conductance value decreases as size of the community increases due to larger volume of the module (which is denominator of CC(⋅)).We propose an inverse confidence metric to rank all the communities in a partition S as:

CC s IC s CN s n s
We utilized the Inverse Confidence metric in conjugation with modularity to remove out less meaningful communities as illustrated in Figure 2 and explained within the proposed framework.
We finally convert the inverse confidence value of each module into a confidence score as: where the denominator is used for normalization.

Proposed generic framework
We followed the steps indicated in Figure 1 to build the proposed framework for constrained disease module identification.
1. Given an input network we perform the preprocessing step to create a modified input network where the node IDs are monotonically increasing, edge weights are noramlized, and the network becomes weighted and undirected.
2. Run a state-of-the-art hierarchical community detection technique to generate the hierarchical tree structure.
3. Estimate quality of each level of hierarchy using modularity, conductance and connectivity.
4. Select that level of hierarchy for which the F-score is minimum.
5. For communities of size > 100 go to Step 2 until either the constraint exceeding communities cannot be split further or modularity of resulting cluster memberships becomes very poor.
6.In the merge step, we start with the partition (S) at the best level of hierarchy and traverse the hierarchical tree from that level in a bottom-up fashion.We iteratively merge those communities whose weighted mean connectivity score is less than the connectivity score for a module at next level of hierarchy where the module consists of those previous communities i.e.a.Here p an q are modules at level h − 1 and s is community at level h such that p, q ∈ s.This results in an intermediate partition set or a set of modules.
7. We then consider all the communities s s.t.n s > 100.
For each such community s, we consider the sub-graph comprising only the nodes from that community.We transform the corresponding weighted adjacency matrix i.e.
We then build the agglomerative hierarchical tree using the linkage clustering with Ward's distance.
8. For each community s (n s > 100), once we obtain the agglomerative hierarchical tree, we cut the tree for different values of k i.e. the number of clusters.We evaluate each such partition using the F-score and select that partition which has the minimum positive F-score.9. Using Steps 6-7 on these bigger modules and the small size communities which satisfy the size constraint, we generate another set of intermediate clusters.
10.We rank this intermediate set of communities using the inverse confidence score i.e.IC(s), ∀s ∈ S. Lower inverse confidence corresponds to higher rank.We now remove all modules whose size exceeds the size constraint i.e. n s ≤ 3 and n s ≥ 100.
11.In this final step, we propose a mechanism to select the best set of modules for evaluation in an automated fashion independent of the network.We can calculate the maximum and minimum value of inverse confidence (IC) from the inverse confidence (IC) scores of all the communities in the intermediate partition S. We iteratively decrease the inverse confidence threshold from maximum to minimum thereby pruning clusters.
At each such threshold, we calculate the modularity of the remaining set of partition using the subgraph corresponding to this partition S′ i.e.GS′.We select the threshold where the difference between Q(S′, θ) and Here |⋅| represents the absolute value, Q prev is the modularity of the partition obtained at Step 2 and calculated in Step 3.For the final submission, we consider all the modules in the optimum partition i.e. s ∈ S′ obtained by pruning communities whose IC (s) ≥ θ .

Results
For our final submission, we utilized the method which is the fastest and most suitable for hierarchical graph partitioning i.e.Louvain method 5 as we were allowed to make only 1 submission.We formulated a recursive version of Louvain method where communities of size greater than 100 were recursively partitioned.We also designed a constraint satisfying version of MHKSC 6,7 and compared its performance with the recursive Louvain method within the proposed generic framework.The evaluation criterion used in the Challenge was the total number of significant modules identified in the 6 benchmark networks on a hold-out set of 104 GWAS datasets at the false discovery rate (FDR) cut-off 23 of 0.05 for multiple testing.We compare the results obtained from proposed generic framework using both the Louvain and MHKSC methods with the winners of the DREAM Challenge in Table 3.
From Table 3, we observe that the winners (Double Spectral Clustering and Resolution Adjusted Clustering) perform far better than Constrained Louvain method on the protein-protein interaction networks (Networks 1 and 2) and homology network (Network 6).However, for the signaling, co-expression and the cancer networks (Networks 3, 4 and 5), proposed Constrained Louvain method has comparable performance with the winners of the challenge.To gain a sense of the robustness of the ranking with respect to the final GWAS data, the challenge organizers sub-sampled the hold-out set by drawing 76 GWASs (same number as during the preliminary phase) out of the 104 GWAS datasets.They created 1, 000 subsamples of the hold-out set.The methods were then scored on each subsample (Sub-sampling was done here without replacement.) The performance of each competing method t for a given network was compared to the highest scoring method across the sub-samples by the paired Bayes factor B t i.e. the method with the highest score on this network in the hold-out set (all 104 GWASs) was defined as reference.The score n s (t, b) of method t in subsample b was thus compared with the score n s (ref , b) of the reference method in the same subsample b.The Bayes factor B t is defined as the number of times the reference method outperforms method t, divided by the number of times method t outperforms or ties the reference method over all subsamples.Methods with B t < 4 were considered a tie with the reference method (i.e., method t outperforms the reference in more than 1 out of 5 subsamples).For networks 3, 4 and 5, the Bayes factor of proposed Constrained Louvain method was less than 4.This indicates that the proposed generic framework, though not the winner, is useful, generic and robust enough for identification of statistically significant disease modules in biological networks.
With the availability of the de-anonymized version of the networks along with the scoring tools used during the competition, we were able to perform additional experiments for the Constrained Louvain method.After the challenge, we identified an error in labeling the nodes in the significant disease modules that we submitted for the homology network (Network 6) during the competition.After correcting the labeling error, we identified 2 significant disease modules from Network 6.
Moreover, we performed additional analysis using 5 different FDR cut-offs (multiple testing) for each of the 6 benchmark networks to obtain the trends in the number of significant disease modules identified by the proposed generic framework for these cut-offs.This result is depicted in Figure 3.The FDR cut-off used as evaluation criterion during the competition was 0.05.

Discussion
The DREAM Challenge organizers made the GWAS datasets along with de-anonymized networks available to the challenge participants.This allowed us to further analyze our results.For each benchmark network, we identified the proteins or genes that make up the significant disease modules.

Open Peer Review
Current Peer Review Status:

Yunpeng Liu
Massachusetts Institute of Technology (MIT), Cambridge, MA, USA In this paper, the authors describe a new pipeline for identifying disease modules from large-scale biological networks in the DREAM challenge.The pipeline builds upon off-the-shelf hierarchical community detection methods and first generates an initial partitioning of the network using a given community detection algorithm.It then integrates multiple properties of the network including modularity, conductance and connectivity into an F score to benchmark the partitioning at different levels of the hierarchy and selects the best partitioning.Next the pipeline merges modules in a way that increases connectivity, and resulting modules that exceed size threshold are partitioned again using hierarchical clustering and split at a level corresponding to the F score minimum.After a second round of merging similar to previous steps, a final set of modules are generated and ranked using a score termed inverse confidence.Using known disease-gene associations obtained from GWAS datasets, the authors verified the modules identified from their pipeline with multiple community detection algorithms, and compared performance across difference networks with that of the top team in the challenge.The authors conclude that despite the top performing team scoring highest overall, there are several cases where the number of modules identified in this paper are at least comparable to that from the top scoring method.Additionally, the authors claim that their pipeline is a generic framework for identifying statistically significant disease modules from biological networks.The methodology put forward in this paper seems novel.However, neither the utility of the module identification pipeline nor its generalizability are adequately demonstrated.This is likely due to vagueness in the description of methods and lack of theoretical justification and supporting computational experiments to validate the procedures and scoring metrics devised by the authors.Specific comments are listed in detail below.
The description of the module identification framework seems elusive and lacks detail, rendering it unclear whether it is based on sound theoretical foundations.The framework works through a series of merge-split-merge steps that seems to hint at an iterative procedure to refine network partitioning.However, the method stops at the second merging step and discards all modules whose sizes exceed thresholds.The authors need to provide a rationale for this -is this because of empirical results that the modules identified from the pipeline do not change much after these steps and thus do not require further iterations in general?In addition, the paper seems to switch between methods and scoring schemes in different steps of the pipeline, for example the merging step is performed using increasing connectivity as criterion, whereas the splitting step uses a 1.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.In their paper, "An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity", Mall present an approach et al to identifying modules of genes from different types of networks, where their approach uses a novel quality metric to evaluate the quality of the partitions based on a number of network metrics such as modularity, conductance and connectivity.Through a series of steps defining the module detection pipeline employed by the authors, they identify modules for the different types of networks and assess the "truth" of the module using enrichment metrics based on GWAS findings as defined by the DMI Dream competition in which the authors had submitted their approach and results.
The approach detailed by the authors seems reasonable, and the idea of the DMI to test a great variety of methods against one another is excellent.The processing of networks to identify biologically meaningful modules is still an important area of research and competitions such as DMI have helped to assess progress and identify best practices.However, while the work presented is potentially interesting, as written the general DMI approach and specific results are difficult to understand and so hard to evaluate in light of this (as detailed below in the specific comments).Further, this paper represents a detailing of one of many approaches that were used in the DMI challenge, and shows results only compared to the winner of the challenge.The approach detailed apparently ranked 15 in the competition, and so was beaten by 14 other approaches.There is no discussion around this, no discussion on why a reader should care to know about one approach that ranked 15 compared to, say, the 19 other approaches that ranked in the top 20 (14 of which would have beaten the described method).There is no motivation provided on why knowledge of the authors' approach should be considered in light of 14 other approaches that beat it in the DMI competition.Do the authors believe the DMI competition was the best way in which to assess module identification methods and that the field should adopt the top-scoring methods for this as the state of the art?Do the authors believe that for the type of networks such as coexpression where their approach was comparable to the winners, that their approach has broader utility?Where the other top 13 methods similar with respect to performance across network types?Specific Comments: The paper is somewhat oddly written in that the methods section contain some methods along with th th 1.

4.
The paper is somewhat oddly written in that the methods section contain some methods along with some results and generally a failure to really describe the methods employed by DMI to compare methods.Without this the only way to understand the results in the paper is to invest much time going through the challenge description and the results, and so on.That is a huge burden placed on the reader.The components necessary to understand the results given in the paper should be described in the paper, and if there are references that describe fuller details, then that could be summarized in the methods so that the reader understands what was done and where to go for more details.(Furtherdetails on what is missing are given in the following comments.)The authors given preliminary experiments done in the methods section, which ostensibly drove thinking and refining on the approach they ultimately settled on.The preliminary experiments are not really methods, they are more results.And then there is an "insights gained" section in the methods, which again is not really a method but rather detail learnings from these earlier results.While the authors do detail their own module identification process, the way in which the validity of the modules were assessed is not clearly articulated.What were the criteria set forth by DMI?How were the genes identified given a GWAS finding?There is error associated with identifying the vast majority of genes associated with a GWAS finding, so how was this handled?Was an enrichment score used for genes from GWAS being identified in the module?Was it per disease and combined over all diseases?Did effect sizes come into play?Etc.There should at least be a summary of this so that the reader can understand what it means to be able to count a module in the accuracy score for the competition.But there is nothing on this.The results speak to the paired Bayes factor that was used to compare methods, but you can't really understand the appropriateness of that without have an understanding of the above questions.There is a link to the Synapse platform regarding the challenge, with many scores of pages of material and then a paper posted on biorxiv that provides details on the challenge and the findings.But it is not yet peer reviewed, it does not appear to be published yet, and so all of the missing detail in this present paper simply points to other papers that are not peer reviewed.It's again a pretty tall order to ask a reviewer to sift through endless pages of material to understand the context of a paper they have been asked to review and to then review on top of that papers upon which the paper they were asked to review is based.

Are sufficient details provided to allow replication of the method development and its use by others? Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article?No I note that I am co-founder and a board member of Sage Bionetworks, the Competing Interests: I note that I am co-founder and a board member of Sage Bionetworks, the Competing Interests: institution that ran the challenge upon which the results of this paper are based.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com

Figure 1 .
Figure 1.Steps involved in proposed generic constrained disease module identification framework.

Figure 2 .
Figure 2. Figure2showcases the modularity values for different partitions obtained at various inverse confidence thresholds for network 3_signal.Here we also highlight the optimal inverse confidence threshold value.

Figure 3 .
Figure 3. Number of disease modules identified by Constrained Louvain method for different false discovery rate (FDR) cut-offs for 6 benchmark networks.
/doi.org/10.5256/f1000research.15518.r32972© 2018 Schadt E. This is an open access peer review report distributed under the terms of the Creative Commons , which permits unrestricted use, distribution, and reproduction in any medium, provided the original Attribution License work is properly cited.Eric E Schadt Department of Genetics & Genomic Sciences, Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA

Table 2 . Preliminary results using several state-of-the-art hierarchical module identification techniques.
Comparison of several out-of-the-box community detection methods along with one consensus and one ensemble based clustering method for disease module identification on 6 different biological networks.Here N represents total number of candidate disease modules and n s represents the number of significant/detected disease modules in the 76 genome wide association study (GWAS) datasets.OSLAM -Order Statistics Local Optimization Method, MHKSC -Multi-level Hieararical Kernel Spectral Clustering.

Table 3 . Final submission results comparing the winners with proposed generic framework.
Here the proposed generic frameworks are referred as Constrained Louvain and Constrained Multi-level Hieararical Kernel Spectral Clustering (MHKSC) and we use * to represent the winners of the competition.Here N represents total number of candidate disease modules and n s represents the total number of significant disease modules identified in the104 genome wide association study (GWAS) datasets.In the final round of the challenge, we submitted the results corresponding to Constrained Louvain method.

Table 9 . Significant disease/trait modules identified for 6_homology network by proposed Constrained Louvain method after the challenge.
This is an open access peer review report distributed under the terms of the Creative Commons Attribution , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is License properly cited.