Exploiting the DepMap cancer dependency data using the depmap R package [version 1; peer review: 1 approved with reservations]

The `depmap` package facilitates access in the R environment to the data from the DepMap project, a multi-year collaborative effort by the Broad Institute and Wellcome Sanger Institute, mapping genetic and chemical dependencies and other molecular biological measurements of over 1700 cancer cell lines. The 'depmap' package formats this data to simply the use of popular R data analysis and visualizing tools such as 'dplyr' and 'ggplot2'. In addition, the 'depmap' package utilizes 'ExperimentHub', storing versions of the DepMap data accessible from the Cloud, which may be selectively downloaded, providing a reproducible research framework to support exploiting this data. This paper describes a workflow demonstrating how to access and visualize the DepMap data in R using this package.


1.
Any reports and responses or comments on the article can be found at the end of the article.

Introduction
The consequences of genomic alterations of cancer cells on the molecular biological landscape of the cell may result in differential vulnerabilities, or "dependencies" compared to those of healthy cells. An example of genetic dependency is a gene not necessary for the survival in healthy cells, but due to perturbations of the metabolic networks caused by cancer mutations, such a gene becomes essential for the vitality of a particular cancer cell line. However, due to the complex nature of metabolic networks, the exact mechanistic nature of many genetic dependencies of cancer are not completely understood. 1 A map illustrating the relationships between the genetic features of cancer and those of cancer dependencies is therefore desirable. The Cancer Dependency Map or "DepMap", a collaborative initiative between the Broad Institute and the Wellcome Sanger Institute, aims to map genetic dependencies in a broad range of cancer cell lines. Over 1700 cancer cell lines have been selected to be tested in this effort, intended to reflect the overall distribution of various cancer diseases in the general population. The stated aim of the DepMap Project is developing a better understanding of the molecular biology of cancer and the exploiting of this knowledge to develop new therapies in precision cancer medicine. 2 The DepMap initiative is, as of the date of this publication, an ongoing project, with new data releases of select datasets every 90 days. As of the 20Q4 DepMap release, 1812 human cancer cell lines have been mapped for dependencies. 2 The DepMap project utilizes CRISPR gene knockout as the primary method to map genomic dependencies in cancer cell lines. [2][3][4][5] The resulting genetic dependency score displayed in the DepMap data is calculated from the observed log fold change in the amount of shRNA detected in pooled cancer cell lines after gene knockout. 6,7 To correct for potential off-target effects of gene knockout in overestimating dependency with CRISPR, the DepMap initiative utilized the CERES algorithm to moderate the final dependency score estimation. 3 It should be noted that due to improvements in the CERES algorithm to estimate genetic dependency while accounting for CRISPR seed effects, the RNAi dependency measurements have been rendered redundant, and further data releases for RNAi dependency measurement have been discontinued as of the 19Q3 release. 2,4 In addition to genomic dependency measurements of cancer cell lines, chemical dependencies were also measured by the DepMap PRISM viability screens that as of the 20Q4 release, tested 4,518 compounds against 578 cancer cell lines. 2,8 A new protemic dataset was added with the 20Q2 release, providing normalized quantitative profiling of proteins of 375 cancer cell lines by mass spectrometry. 9 The DepMap project has also compiled additional datasets detailing molecular biological characterization of cancer cell lines, including WES genomic copy number, Reverse Phase Protein Array (RPPA) data, TPM gene expression data for protein coding genes and genomic mutation call data. Core datasets such as CRISPR viability screens, TPM gene expression, WES copy number and genomic mutation calls are updated quarterly on a release schedule. All datasets are made publicly available under CC BY 4.0 licence. 2 A table of the datasets available for the depmap package (as of 20Q4 release) is displayed in Table 1.
The depmap Bioconductor package was created in order to efficiently exploit these rich datasets and to promote reproducible research, facilitated by importing the data into the R environment. The value added by the depmap Bioconductor package includes cleaning and converting all datasets to long format tibbles, 10 as well as adding the unique key depmap_id for all datasets. The addition of the the unique key depmap_id aides the comparison and benchmarking of multiple molecular features and streamlines the datasets for usage of common R packages such as dplyr 11 and ggplot2. 12 As new DepMap datasets are continuously released on a quarterly basis, it is not feasible to include all dataset files in binary directly within the directory of the depmap R package. To keep the package lightweight, the depmap package utilizes and fully depends on the ExperimentHub package 13 to store and retrieve all versions of the DepMap data (as of this publication, starting from version 19Q1 through 20Q4) in the Cloud using AWS. The depmap package contains accessor functions to directly download and cache the most current datasets from the Cloud into the local R environment. Specific datasets (such as datasets from older releases), which can be downloaded separately, if desired. The depmap package was designed to enhance reproducible research by ensuring datasets from all releases will remain available to researchers. The depmap R package is available as part of Bioconductor at: https://bioconductor.org/packages/depmap.

Use cases
Dependency scores are the features of primary interest in the DepMap Project datasets. These measurements can be found in datasets crispr and rnai, which contain information on genetic dependency, as well as the dataset drug_ sensitivity, which contains information pertaining to chemical dependency. The genetic dependency can be interpreted as an expression of how vital a particular gene for a given cancer cell line. For example, a highly negative dependency score is derived from a large negative log fold change in the population of cancer cells after gene knockout or knockdown, implying that a given cell line is highly dependent on that gene in maintaining metabolic function. Genes that are not essential for non-cancerous cells but display highly negative dependency scores for cancer cell lines, may be interesting candidates for research in targeted cancer medicine. In this workflow, we will describe exploring and visualizing several DepMap datasets, including those that contain information on genetic dependency. Below, we start by loading the packages need to run this workflow.
library("depmap") library("ExperimentHub") library("dplyr") library("ggplot2") library("stringr") The depmap datasets are too large to be included into a typical package, therefore these data are stored in the Cloud. There are two ways to access the depmap datasets. The first such way calls on dedicated accessor functions that download, cache and load the latest available dataset into the R workspace. Examples for all available data are shown below: rnai <-depmap_rnai() crispr <-depmap_crispr() copyNumber <-depmap_copyNumber() TPM <-depmap_RPPA() RPPA <-depmap_TPM() metadata <-depmap_metadata() mutationCalls <-depmap_mutationCalls() drug_sensitivity <-depmap_drug_sensitivity() proteomic <-depmap_proteomic() By importing the depmap data into the R environment, the data may be mined more effectively utilzing R data manipulation tools. For example, molecular dependency for all cell lines pertaining to soft tissue sarcomas, sorted by genes with the greatest dependency, can be accomplished with the following code, using functions from the dplyr package. Below, the crispr dataset is selected for cell lines with "SOFT_TISSUE" in the CCLE name, and displaying a list of the highest dependency scores.

Discussion and outlook
We hope that this package will be used by cancer researchers to dig deeper into the DepMap data and to support their research in precision oncology and developing targeted cancer therapies. Additionally, we highly encourage current and future depmap users to combine depmap data with other datasets, such as those found through the The Cancer Genome Atlas (TCGA) and the Cancer Cell Line Encyclopedia (CCLE).
The depmap R package will continue to be maintained in line with the biannual Bioconductor release schedule, in addition to quarterly releases of DepMap data.
We welcome feedback and questions from the community. We also highly appreciate contributions to the code in the form of pull requests on github.
All packages used in this workflow are available from the Comprehensive R Archive Network (https://cran.r-project.org) or Bioconductor (http://bioconductor.org). The specific version numbers of R and the packages used are shown below.