A unified GenomeSpace recipe to identify essential genes and associated subnetworks from Genome-Scale CRISPR-Cas9

We present a unified GenomeSpace recipe that combines the results of a high throughput CRISPR genetic screen and a biological network to return a subnetwork that suggests a mechanistic explanation of the screen’s results. The explanatory subnetwork is found by network propagation, a popular systems biology approach.  We demonstrate our pipeline on an alpha toxin screen, revealing a subnetwork that is both highly interconnected and highly enriched for hits in the screen.


Introduction
The rise of next generation sequencing technology and CRISPR gene editing technology has opened up new opportunities for high throughput genetic screens. Increasingly systems biology and molecular networks are becoming more important in the analysis of the mechanisms that are implicated in these screens. Here we present a GenomeSpace recipe providing a standardized pipeline for combining the analysis of a screen with networks and represents a logical next step in providing user-friendly bioinformatics workflows for these types of screens.
This recipe provides a way to process the results of a CRISPR-Cas9 genome wide knockout screen. In such a screen, single guide RNAs (sgRNAs) are designed to target and knock down genes by binding to the target gene and introducing double strand DNA breaks. (Koike-Yusa et al., 2014). Those bound mRNA are subsequently digested by the Cas9 complex and thus do not yield a gene product. In a cell, if the sgRNA is introduced for a gene that is essential for the survival of the cell, that cell will die, and the sgRNA will be depleted. Thus, by sequencing the sgRNAs and looking for a depletion of the sgRNAs targeting a particular gene, we can infer the essentiality of that target gene. Since a large number of sgRNAs can be introduced in a single screen, the essentiality of many genes can be tested at once. However, there are challenges that can arise in the normalization and processing of the read counts; often more than one sgRNA corresponds to a gene but with different efficiency, and significant biases exist in sequencing different sgRNAs. For these reasons the MAGeCK (Li et al., 2014) method was developed to handle data resulting from such a screen.
On the systems biology side of the analysis, we have chosen to employ network propagation as a method of identifying subnetworks representing inferred mechanisms that are implicated by the CRISPR screen. Network propagation has become an essential tool in many network applications; it has been used to identify mechanisms of cancer (Leiserson et al., 2015), to implicate genes in GWAS studies (Qian et al., 2014), and to find functional modules (Vanunu et al., 2010). Network propagation considers genes as nodes on the graph of a biological network. It performs a random walk along the edges of the graph from a set of query nodes. We expect that genes that are implicated in a phenotype will occur in regions of the network that represent mechanisms that are relevant to the screen conditions, and so the random walk will be likely to land on relevant genes. Genes that are near query nodes are therefore implicated by association. For a review of the many flavors and applications of network propagation, see (Cowen et al., 2017).

Methods
An overview of the pipeline appears in Figure 1. The recipe begins using the raw read counts of sgRNAs as input to the MAGeCK module in GenePattern (www.genepattern.org). After normalizing the data, MAGeCK detects differential read counts for each sgRNA using an over-dispersed Poisson model. Next, it detects statistical underrepresentation of the sgRNAs corresponding to particular genes to infer that a gene is essential for the survival of a cell. The reasoning behind this is that if a sgRNA targets an essential gene, the cells that contain that sgRNA will not replicate and the sgRNA will be underrepresented compared to other genes.
After we determine a set of essential genes, we pass their identity via GenomeSpace to Cytoscape (Shannon et al., 2003). All analysis in Cytoscape is based on networks, thus a relevant reference molecular network must be imported; the recipe uses the NDEx database (Pratt et al., 2015) to identify such a network. In this case, we choose the National Cancer Institute's Pathway Interaction Database (Schaefer et al., 2009). The set of essential genes, i.e., the hits from the genetic screen, are imported from GenomeSpace as a table, then used as the seed nodes for network propagation.
The propagation process starts with a single unit of "heat" on each of the nodes that represent the genes that are found to be underrepresented in the screen, and therefore essential for the growth of the cell. We use a heat diffusion process, treating the network as an unweighted, undirected graph. Heat diffusion smooths the original signal over the network, iteratively passing the signal on each node to its neighbor. It identifies regions of the network that have a high concentration of hits.
Here we use a time parameter (which represents the amount of time that the heat is allowed to diffuse over the graph) of 0.1. This is a common choice for the time parameter (see Paull et al., 2013). The recipe employs the network diffusion service built into Cytoscape natively (Carlin et al., 2017).
Next, applying a cutoff of the top 200 genes with the most heat after diffusion, we choose a subnetwork that has a high concentration of hits. Finally, in order to understand the composition of the subnetwork, we apply the GeneMANIA Cytoscape plugin. For any network, GeneMANIA (Montojo et al., 2010) shows what functional Gene Ontology categories are enriched in that network.

Use case
We used a previously published CRISPR study (Koike-Yusa et al., 2014) to illustrate the use of this pipeline. In this study, the authors use mouse embryonic stem cells grown in the presence of alpha-toxin. This screen was therefore designed to expose the genes involved in the mechanism of resistance to the toxin. The largest connected network component of the top 200 genes after propagation appears in Figure 2.
The results of the GeneMania enrichment suggest that DNA repair is the single most important gene set in handling alphatoxin. This is consistent with the findings in (Bantel et al., 2001) that alpha-toxin causes an influx of monovalent ions that can cause DNA fragmentation. In the absence of DNA repair machinery, the cells cannot recover from this stress and therefore die. The complete table of the Gene Ontology terms that were significantly enriched in the subnetwork appear in Table S1.

Variations
There are several variations that can be used depending on preferences of the user. For example, different tools such as DESeq  (Anders & Huber, 2010) and edgeR (Robinson et al., 2010) can be used to identify the hits. Also, the final biological interpretation of the subnetworks was performed with the GeneMania plugin, but the gene list can also be exported and interpreted by another annotation tool. Another approach is to export the gene list and corresponding heats using GenomeSpace and use the Molecular Signature Database gene set overlap tool (http:// software.broadinstitute.org/gsea/msigdb/index.jsp), (Liberzon et al., 2011) applied to the genes in the identified subnetwork.

Data and software availability
The recipe and Koike-Yusa et al. datasets are publicly available at http://recipes.genomespace.org/view/75. GenomeSpace, an open-source bioinformatics tool, serves as the data highway allowing for seamless transfer of information between tools, and can be found at http://www.genomespace.org/. The MAGeCK

Supplementary material
Table S1: Enriched Gene Ontology categories for the network appearing in Figure 2.
Click here to access the data algorithm has been wrapped as a GenePattern module, which can be run locally or on the public GenePattern servers at http://genepattern.org. Additionally, GenePattern has added Jupyter Notebook compatibility through GenePattern Notebook (http://genepattern-notebook.org/). Finally, Cytoscape and all associated plugins (ie. GeneMANIA and NDEx) can be found at http://www.cytoscape.org/.

Grant information
This work is supported by the National Institute of Health and National Human Genome Research Institute project number 5U41HG007517-05.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. 1.

Open Peer Review Current Peer Review Status:
Version 1 1.

If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed. Competing Interests: Reviewer Expertise: Bioinformatics; Genomics; miRNA; CRISPR; I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. The CRISPR gene editing technology followed by next generation sequencing technology has brought up new means for high throughput genetic screening. However, a streamlined and systematic method for downstream data analysis is required to examine these high throughput screenings. In this work, the authors provide a pipeline for combining the analysis of a screen with network illustrations that facilitate the assay of this type of screens. The work is timely and applicable for a lot of related research.
I am not an expert in bioinformatics. Therefore, I will provide several comments from the biological side.
In a CRISPR-mediated high throughput genetic screening, people frequently apply several sgRNAs to target a single gene, which exhibit different knocking out efficiencies. In addition, these sgRNAs may target distinct isoforms of a gene, which may lead to different knocking out phenotypes. How the pipeline deals with these sophisticated cases requires elaboration. The author used one example, the alpha toxin resistance screen, to validate their method. The analysis of another independent screen is needed to test its validility. During the analysis of alpha-toxin screen, the authors conclude that the result is consistent with previous findings. This argument requires more detailed comparison with previous work. In addition, if possible, at least some of the new hits coming out of the work could be validated by wet lab experiments.
Is the rationale for developing the new method (or application) clearly explained?

Is the description of the method technically sound? Yes
Are sufficient details provided to allow replication of the method development and its use by others? Partly If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com