Enhancing gene set enrichment using networks

Differential gene expression (DGE) studies often suffer from poor interpretability of their primary results, i.e., thousands of differentially expressed genes. This has led to the introduction of gene set analysis (GSA) methods that aim at identifying interpretable global effects by grouping genes into sets of common context, such as, molecular pathways, biological function or tissue localization. In practice, GSA often results in hundreds of differentially regulated gene sets. Similar to the genes they contain, gene sets are often regulated in a correlative fashion because they share many of their genes or they describe related processes. Using these kind of neighborhood information to construct networks of gene sets allows to identify highly connected sub-networks as well as poorly connected islands or singletons. We show here how topological information and other network features can be used to filter and prioritize gene sets in routine DGE studies. Community detection in combination with automatic labeling and the network representation of gene set clusters further constitute an appealing and intuitive visualization of GSA results. The RICHNET workflow described here does not require human intervention and can thus be conveniently incorporated in automated analysis pipelines.


Introduction
Interpretation of whole-transcriptome differential expression studies is often difficult because the sheer volume of the differentially expressed genes (DEGs) can be overwhelming. It is common place in designed experiments with more than just a marginal biological effect to find several thousands of differentially expressed genes (DEGs). One way to handle the vast numbers and to identify the biological consequences of gene expression changes is to associate them with overarching processes involving a whole set of genes, such as Gene Ontology (GO) terms or Kyoto Enzyclopedia of Genes and Genomes (KEGG) pathways.
Curated genesets have been designed or discovered for a wide range of common contexts, such as, a biological process, molecular pathway, or tissue localization 1,2 . They have been introduced in the past not only to reduce complexity and to improve interpretability but also to increase statistical power by reducing the number of performed tests. As it turns out, this often results in finding hundreds of differentially regulated pathways 1 .
As with co-expressed genes, many of the pathways exhibit strong mutual correlation because they contain a large proportion of shared genes which is in turn a result of the fact that many of them describe closely related aspects of an overarching biological theme. Therefore, to further increase interpretability of differential geneset regulation and to capture the global change of a biological phenotype, it would be desirable to identify possibly existing umbrella organizations among genesets.
Networks are ideal to model dependencies, interactions, and similarities among individuals 3-5 , be it people, computers, genes, or genesets. The degree of connectivity between them can have an influence on information flow and defines communities or cliques, i.e., clusters of highly connected nodes within and infrequent connections between them.
In order to construct a geneset network, a similarity measure is required and can be defined as the fraction of common genes, also called the Jaccard index 6 . Other ways to measure similarity among genesets include, for instance, coexpression strength as implemented in WGCNA 7,8 .
Community detection based on network topology is a standard problem in the analysis of social networks 9, 10 . Well-established algorithms allow for computationally efficient clustering of genesets and can be used to identify highly connected sub-networks. There is no unique or optimal method available but many options exist. Popular methods to define clusters include the edge-betweenness criterion, the Infomap or the Louvain algorithm (igraph), as well as hierarchical or kmeans clustering.
Once geneset clusters are defined they can be characterized by their size and connectivity and thus prioritized and ranked. In particular, the clusters can be categorized as singletons, doublets, medium and large or dense and loose clusters.
Network analysis not only allows for detection of clusters and performance of measurements on them, networks are also straightforward and appealing visualizations of similarities among genesets. There are a couple of interactive visualization software tools available, of which Cytoscape is probably the most popular 11 . In some cases interactivity is useful but the emphasis here is to provide some of Cytoscape's features without any human intervention for easy integration into automatic analysis pipelines. For instance, automatic labeling of communities using the n most frequent terms was adopted here, similar as in Kucera et al. 12 .
The purpose of this step-by-step workflow is to provide a fully automated and reproducible procedure for downstream analysis and visualization of differential geneset analysis results in R 13 . The focus is on supporting scientists in result interpretation by bringing order into the list of differentially regulated genesets based on biological rather than pure statistical arguments. The workflow is suitable for any kind of geneset library including new or custom sets and any kind of geneset analysis method.
Starting with differential expression analysis of a model dataset, geneset analysis is performed based on the MSigDB library. A geneset network is constructed to identify isolated genesets (singletons) and geneset pairs (doublets). Larger connected sub-networks are then split into smaller clusters of closely related genesets describing similar processes. The effect of each modification step on the network topology is visually documented in Figure 1-Figure 4. Using the most frequently occurring terms in the geneset names of a cluster, an attempt to automatically assign cluster labels is made. Finally, all labeled clusters of genesets are plotted to provide a one page overview of the results.

Mapping Ensembl IDs to ENTREZ IDs
We are using the popular org.Hs.eg.db package based on the UCSC annotation database and keep only genes with a unique mapping.

Network construction
We construct a gene set network based on the proportion of common genes as the inverse distance measure. The nodes are gene sets which are connected by edges if the Jaccard index

Number of common genes
Number of all genes J = is larger than a preset threshold, J > 0.2. While this threshold is somewhat arbitrary it has proven to be a reasonable one in many projects. As a guide for finding a reasonable threshold a broad distribution of disjoint cluster sizes is desired. Network analysis does not help if the cutoff is too large (no connections) or too small (all sets are connected with each other). In any case, it is strongly recommended to investigate its effect on the quality of the results. length(intersect(unlist(x), unlist(y) )) ) ) diag(m.adj) = 0 # Jaccard index matrix NGenes = sapply(gset, length) m.union = outer(NGenes, NGenes, "+") -m.adj m.jacc = m.adj / m.union The Jaccard matrix, or adjacency matrix, can be conveniently used to construct a network object using the function igraph::graph_from_adjacency_matrix(). In this example geneset, similarity is measured using all member genes irrespective of whether they were detected and present in the data. Alternatively, one could include only genes present in the data depending on whether the current data seem more relevant and trustworthy or the prior information given by the geneset definition. Graphical display is achieved here using ggnet::ggnet2() (Figure 1).

Network modifications
In the following, components of the network for which network analysis does not improve interpretability are identified and put to aside. This includes singletons, i.e., genesets not connected to any other geneset, and doublets, also termed binary systems or dumbbells, i.e., pairs of genesets connected with each other but isolated from the rest.

Identify binary systems (2 sets)
Next we also want to separate clusters with less than 3 gene sets. To do so, we separate disjoint subnets as individual objects, count their members, and delete all vertices belonging to clusters of size smaller than 3.

Automatic annotation of gene set clusters
In analogy to the popular interactive network visualization tool cytoscape 12 , we attempt to generate automatic labels for gene set clusters. Gene set names are split into individual words and counted within each cluster. The four most frequent terms occurring at least twice are used as labels. The function clust_head() is defined for this purpose and contains an exclusion list of words not used.

Discussion
We have presented an automated workflow based on a small number of R packages for prioritization and visualization of gene set analysis results using networks, which we call RICHNET. We demonstrated how community detection facilitates categorization of differentially regulated gene sets into singletons and clusters of different size ranges. Automated label generation allowed to associate these clusters with biological themes or processes of which the member gene sets are part of.
The RICHNET workflow could be altered or extended quite naturally in a number of ways but the version presented here is the one we typically apply in our research service projects. One advantage over other approaches is that it does not depend on a particular geneset library or geneset analysis method. Any means of selecting genesets of interest can be used. Specific hierarchically constructed genesets, such as GO terms, would offer a straightforward way to arrive at a more global process description using higher levels in their tree structure. A second advantage is that it does not depend on the existence of a good quality gene or protein interaction network for the particular organism or disease state which is often not feasible. Only very few genesets are network-based (e.g. KEGG pathways) and would thus offer a straight-forward way to use an a priori network topology. Thirdly, similar as in reference 8, a geneset similarity network could be constructed in the form of a co-enrichment network from GSVA enrichment scores 20 using weighted co-expression network analysis (WGCNA) 7 . However, this approach relies on a relatively large sample size whereas the sample size requirement of RICHNET is not more than the GSA it relies on.
As an alternative to the networks of genesets described here, networks of genes could be created in a reciprocal way. The underlying similarity metric between genes could be defined as the proportion of common genesets among all genesets they are part of. This approach would be equivalent to a STRING-DB network with "databases" as the only interaction allowed 21 .
One possible future extension of the RICHNET workflow could be the introduction of a consensus similarity metric from multiple initial networks and different community detection or cluster algorithms to improve stability against noise. A second avenue forward could be the introduction of interactive graphics in 2D or 3D 17 to allow moving, pulling, rotation or zoom and display of node specific or edge specific information.
Some may argue in favor of encapsulating the RICHNET workflow in an R or Bioconductor package. However, it is our strong believe that for the sake of transparency and given the straightforward nature of the code it serves better to publish it openly. This way we encourage the users to adapt it to their specific requirements, to improve and expand on it.

Data availability
The data used in this workflow is included in the airway R-package 19 .

License: Creative Commons CC BY license.
Packages used This workflow depends on various packages from version 3.7 of the Bioconductor project, running on R version 3.5.0 or higher. A complete list of the packages used for this workflow is shown below:   Regretfully, many of our previous comments have not been adequately addressed in the current revision and the author's response, particularly in regard to comparing the proposed method to existing work and justifying its ability to prioritize gene sets in a biologically relevant way. We also have concerns about design decisions in the algorithm and workflow, including the choice of Jaccard cut off and the use of categorical enrichment scoring, that are not adequately explained in the manuscript and responses and continuing technical problems in running the code.
The author acknowledges the similarity to the Enrichment Map method (Merico et al. 2010 ) and notes that it is cited in the manuscript. However, the method is never explicitly mentioned in the text nor is its similarity to the author's method mentioned even though the similarity is mentioned in the response to review. Given this similarity we feel the text should include a comparison or a discussion of benefits or drawbacks of the respective methods. While the author responded that Merico et al. is cited, the citation only refers to the use of the Jaccard method. The response to review contains the following sentence, "The present workflow implements a similar functionality in R and allows for integration in automated analysis pipelines." Why not just edit or expand on this, specifying that it refers to Enrichment Map, and insert into main text?
The author has not clarified whether "umbrella" is meant as a synonym for hierarchy or if it is describing some other structure. As the author does note that a collection need not have an explicitly defined hierarchy for this method to be useful, how would a potential user determine if such an "umbrella" structure was present and whether or not a given gene set collection would be a good candidate for this approach?
In discussing the Jaccard cut off, the author mentions the general properties preferred in an optimal network, but it is still not clear how a user would robustly evaluate their choice of Jaccard coefficient. Is it sufficient to simply assure that there are at least two distinct components in the graph or is another heuristic more appropriate? Again, some indication of quantitation would be helpful to the user.
helpful to the user.
The author notes that additional validation with respect to different parameter settings and input datasets is planned for future work. We are not clear why the author cannot indicate how the results in Figure 5 recapitulate known biology in the airway dataset or if they would help the user draw new conclusions. It is unclear from the manuscript whether or not these clusters are in fact useful for further biological interpretation downstream of the proposed method. Wouldn't this be the main point of using such a method?
With regard to the choice of gene set collection (Hallmarks, KEGG, Reactome, etc.), we would encourage discussion as to the merits or drawbacks of different collections. We note that the Hallmark collection is the most frequently accessed gene set collection in MSigDB and was built using a method that included, in part, clustering of existing gene sets. The paper could be significantly strengthened by comparison of the results of the gene set clustering and prioritization based on enrichment scores in RICHNET to the existing Hallmark collection. We also note that collections of gene sets used to create each individual Hallmark gene set are available on MSigDB.
The author notes that a continuous coloring scheme for output visualizations could be developed using the negative log10 of the p-values. This is certainly possible, but the author has not addressed our concerns involving the coarse, discrete categorization of gene sets as "Up" or "Down" when many popular algorithms for calculating continuous enrichment scores exist. This design decision should be justified in a bit more detail.
Technical problems in running the code: The author has fixed the original bugs in the analysis. However, we encountered the following technical issues when running the new version of the R markdown notebook: a.The EGSEA package is not included in the setup code. Thus, if a user did not have it previously installed, then an error would be triggered later in the analysis. b. The R markdown notebook itself as accessed from the DOI contains notes to comment or delete code "for real submission". We are therefore uncertain if we are running the correct code for this analysis. c. We encountered the error "Error in `diag<-`(`*tmp*`,value=0): only matrix diagonals can be replace" at line 185 of the notebook and were unable to continue with the analysis. There is currently a high burden on the users of this code to identify and solve package dependencies and other errors whose origins are not immediately clear to those not familiar with the underlying packages. These concerns should be addressed to make the workflow distribution more robust and user-friendly. No competing interests were disclosed.

Competing Interests:
We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.

10.
Jaccard index in the visualization to allow the user to see the effect of various thresholds? In fact, Enrichment Map (cited above) uses the Jaccard Index itself to determine the weight of the line connecting two overlapping gene sets. These approaches need to be compared and the choice of threshold better explained and justified.
Prioritizing and ranking the active pathways/gene sets by number of gene sets in a network hub and degree of connectivity seems inadequate if one is looking for biological insights. A biological measure of prioritization would be preferable which includes the levels of activation. Often a study is not done in a vacuum and so in fact some signals may be expected -this may serve as an additional measure for prioritization and even validation of the method. This issue is not addressed and should be and some validation of the results of the only test set analysed should be supplied.
It is important to include and describe the numerous network-based enrichment methods that have been published and specifically Enrichment Map (cited above) which is a plugin to both Cytoscape and GSEA and does exactly what the author is describing (downstream analysis of gene set enrichment results) but is never mentioned or referenced. At the very least its performance should be compared to the author's method. This is a requisite for any newly proposed method.
The goal of this method is to "focus is on supporting scientists in result interpretation by bringing order into the list of differentially regulated gene sets based on biological rather than pure statistical arguments." But one might ask why do this based on network measures of association and the visualization is very sparse in what it represents as noted above. Also, if this is the goal -it is incumbent on the author to show why this works better than other existing methods (see above) or even using a more sensitive/specific collection of gene sets, e.g., the Hallmark collection in MSigDB (Liberzon , 2015 ). This provides a collection of sets for which essentially this work et al. has been done with additional biological curation. Comparison with other methods and with the use of other gene set collections should be included.
The author's example on the "airway" data set employs an analysis with KEGG, Reactome, and BioCarta. Using these 3 databases together means there will be a large amount of redundancy and overlap in the gene sets he uses -again it would be important to compare his results using these collections against the Hallmark collection. Furthermore -why just the results from one data set?
The method would be better tested against multiple data sets -some where the signal is very strong and some where the signal is weaker.
The author uses the CAMERA method for testing gene sets. This method produces only a binary "Up" or "Down" measurement of enrichment which is used to color the nodes in the resulting network visualizations. This is a very coarse way of testing gene sets. The interpretability of the network visualizations could be improved by using a method such as GSEA, which gives a continuous enrichment score, and coloring the nodes with a gradient to compare degrees of up-or down-regulation and weighting the edges as is done in Enrichment map as noted above.
Finally -after application of RICHNET the author only describes and discusses the nature of his resulting networks. There is no discussion that we could see of the biological insights gained, how realistic they were, whether they recapitulated known signals in the data set, etc.
Technical concerns with the code as presented: The analysis fails with an unintuitive error 2 10. Technical concerns with the code as presented: The analysis fails with an unintuitive error ("could not coerce net to a network object") if the library "Intergraph" is not installed. While this library is listed in the sessionInfo() printout in the manuscript, this library should be included in the first library() cell of the notebook since its absence is not immediately obvious given this error. No competing interests were disclosed.

Competing Interests:
We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above. The remark on the similarity of this work with prior work by others is fully justified and is discussed in the manuscript. The mentioned work by Merico is included as reference 6. et al.
The open nature of this workflow script allows for a straightforward implementation of any additional graphical feature should the user of this workflow wish to do so. additional graphical feature should the user of this workflow wish to do so. By using the term "often" I was hoping to emphasize that while the statement is true for more than a small number of cases there may be cases where the statement is wrong. How often the statement is true or whether it depends on the choice of geneset or enrichment algorithm or FDR cutoff is less important as long as there exist enough cases for which the workflow presented here may be useful.
Most geneset collections consist of largely overlapping sets which describe very similar processes or properties. The workflow presented here was made for such cases. For the workflow to be useful, an explicit hierarchy, such as with GO-terms, is not required.
The network as constructed here has indeed edges weighted by the Jaccard index. Visualization by varying edge thickness has proven to be difficult to discriminate in my hands. The open nature of this workflow script should allow for an easy implementation of this feature should a user wish to do so. To address the choice of the Jaccard index cutoff the following sentence has been added at the end of section : "As a guide for finding a reasonable threshold a broad Network construction distribution of disjoint cluster sizes is desired. Network analysis does not help if the cutoff is too large (no connections) or too small (all sets are connected with each other)." One measure of quality of the network construction and community detection is the semantic purity of the clusters. This can be easily seen for the binary systems (and is discussed there) and to a lesser extent for the larger clusters. Whether biological insight can be gained depends strongly on the choice of the genesets in relation to the field of study. RICHNET provides a means of grouping and organizing results and does not generate them. Enrichment Map is indeed very similar and a perfectly fine interactive tool (see reference 6 of the manuscript). The present workflow implements a similar functionality in R and allows for integration in automated analysis pipelines.
Many of our clients find the hallmark geneset too generic and prefer more diverse geneset collections, such as, KEGG or Reactome.
Your suggestion to include more test results with respect to different geneset libraries, Jacquard score cutoffs, community detection algorithms, and datasets is well received. We are planning to do this in the future.
Using the negative log of the enrichment p-value as enrichment score has been demonstrated in numerous cases in the literature and replacing the categorical colour scale by a continuous one is possible.
Biological interpretation of the geneset clusters produced here is beyond the scope of this work.
The workflow is tailored towards the camera analysis. What if users want to use this workflow for other GSA methods? A more generic example is needed. Any possibility to wrap the network analysis code and make it easier for users to invoke as a function? Probably, an R package named RICHNET can be developed along with this wonderful workflow. 1.

, Swiss Institute for Bioinformatics, Zurich, Switzerland Michael Prummer
Thank you for reviewing our manuscript and for your constructive comments. Below are point-by-point responses to the individual comments.
The missing packages were included in the initial package installation and loading part.
The source of the error in ggnet2 is fixed.
Uni-and bi-directional tests are preformed in parallel. Afterwards, priority is given to the uni-directional test as it is more stringent (either up-or down-regulated genes). The bi-directional case is biologically meaningful as well as an upregulation of and inhibitory member of a pathway has the same effect as the downregulation of an activatory member. But all these details are related to one particular choice of enrichment analysis whereas the extent of this work starts after a list of candidate genesets was generated by any appropriate method. Biological interpretation of the geneset clustering results produced here is beyond the scope of this work.

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed. Competing Interests: Reviewer Expertise: computational biology, systems biology, network biology, network medicine I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. While a discussion on the influence of different distance metrics on network structure is interesting it is not in the scope of this work. Sidenote: a discussion on the influence of different community detection algorithms would be interesting as well. These aspects are critical to any attempt at clustering data but it is assumed here that the qualified reader and user of this workflow is aware of it. The issue is briefly mentioned in paragraph four of the discussion.
Avoiding the use of GO terms was an attempt to avoid having to open up the discussion about the influence of their hierarchical structure on results. I am convinced this would make the discussion unnecessarily complicated and that it is better done elsewhere.
The equation defining the Jaccard index is formatted correctly in the PDF.
In the manuscript it is emphasized that nothing is removed or deemed irrelevant and that putting aside singletons and doublets is just a means of sorting. Indeed, an unusually large proportion of singletons may indicate an unexplored area of biology. In such a situation, relying on common knowledge in the form of published genesets may not be the wisest way to go at all.
There are a number of different possibilities to obtain a representative label for a cluster and in this manuscript a relatively simple one was chosen. It may not be the most sophisticated but it is straight forward to understand.
The following text is included in section on : "This Lattice of annotated networks missing titles may be indicative for a semantically mixed cluster or for sparse prior knowledge." No competing interests were disclosed. Competing Interests:

Comments on this article
Version 1