ChIPdig: a comprehensive user-friendly tool for mining multi-sample ChIP-seq data

6 Background: In recent years, epigenetic research has enjoyed explosive growth as high-throughput 7 sequencing technologies become more accessible and affordable. However, this advancement has not 8 been matched with similar progress in data analysis capabilities from the perspective of experimental 9 biologists not versed in bioinformatic languages. For instance, chromatin immunoprecipitation followed 10 by next-generation sequencing (ChIP-seq) is at present widely used to identify genomic loci of 11 transcription factor binding and histone modifications. Basic ChIP-seq data analysis, including read 12 mapping and peak calling, can be accomplished through several well-established tools, but more 13 sophisticated analyzes aimed at comparing data derived from different conditions or experimental 14 designs constitute a significant bottleneck. We reason that the implementation of a single 15 comprehensive ChIP-seq analysis pipeline could be beneficial for many experimental (wet lab) 16 researchers who would like to generate genomic data. 17 Results: Here we present ChIPdig, a stand-alone application with adjustable parameters designed to 18 allow researchers to perform several analyzes, namely read mapping to a reference genome, peak 19 calling, annotation of regions based on reference coordinates (e.g. transcription start and termination 20 sites, exons, introns, 5' UTRs and 3' UTRs), and generation of heatmaps and metaplots for visualizing 21


Background
Interactions between nuclear proteins and DNA are vital for cell and organism function.They control DNA replication and repair, safeguard genome stability, and regulate chromosome segregation and gene expression.Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) is a powerful method for assessing such interactions and has been widely used in recent years to map the location of post-translationally modified histones, transcription factors, chromatin modifiers and other non-histone DNA-associated proteins in a genome-wide manner.This progress has been fostered by the increasing technical feasibility and affordability of this technology, with more documentation and technical support available to researchers, including commercial library preparation kits, aided by the plummeting costs of sequencing and the ease of multiplexing samples.In light of this progress, ChIP-seq data sets are continuously deposited in publicly-accessible databases, such as the National Center for .CC-BY-NC-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available The copyright holder for this preprint (which was this version posted February 27, 2018.; https://doi.org/10.1101/220079doi: bioRxiv preprint Biotechnology Information's (NCBI) Gene Expression Omnibus (GEO) and the ENCODE consortium portal [1].Therefore, there is an unprecedented wealth of epigenomic data in the public domain that can be used for integrative and correlative analyses.This important advancement had not been accompanied with similar progress in user-friendly postsequencing data analysis pipelines, which is still a significant bottleneck often handled by skilled bioinformaticians.A key step in ChIP-seq data analysis is to map reads to a reference genome assembly.
Programs such as BWA [2] or Bowtie/Bowtie2 [3,4] are frequently employed for this purpose.Aligned data are then processed to find regions of enrichment along the genome, thereby identifying potential loci of DNA binding by the target protein or of deposition of the histone post-translational modifications of interest.This process is known as peak calling and can be performed by using algorithms such as MACS/MACS2 [5,6] and SICER [7].SAMtools is another popular software used in next-generation sequencing (NSG) analysis that provides various utilities for manipulating alignments, including sorting, merging, indexing and generating alignments in a per-position format [8].
Following the initial and common steps in ChIP-seq data analysis mentioned above, downstream analysis is often more customizable and may present a hurdle, especially in the case of comparing data derived from different conditions or experimental settings.Artifacts arising from bias of DNA fragmentation, variation of immunoprecipitation efficiency, as well as polymerase chain reaction (PCR) amplification and sequencing depth bias, result in ChIP-seq experiments with distinct signal-to-noise ratios and impose great challenges to the computational analysis [9].Several methods addressing the differential enrichment analysis problem, i.e. the detection of genomic regions with changes in ChIP-seq profiles between two distinct samples or sets of replicate samples, have been proposed [10][11][12].
Generally, these methods rely on the initial detection of candidate peak regions by a conventional peak calling algorithm, and then this peak-defining information is applied to analysis with methods tailored for the differential expression analysis of RNA-seq data such as edgeR [13] or DESeq [14].Downstream .CC-BY-NC-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available The copyright holder for this preprint (which was this version posted February 27, 2018.; https://doi.org/10.1101/220079doi: bioRxiv preprint analysis of ChIP-seq data may also involve the annotation of genomic regions based on reference coordinates (e.g.distance from nearest transcription start site) and visualization of normalized coverage by means of heatmaps and metaplots.
Importantly, ChIP-seq data analysis typically relies, at least in part, on tools that have been designed to run primarily on Linux/Unix-based systems, while biologists who need to work with NGS may be unfamiliar with such operating systems.In the past few years, several packages designed to handle NGS data have been released for R, a popular platform-independent programming language and software environment for computing and graphics.Here we present ChIPdig, an open-source application that leverages on several R packages to enable comprehensive and modular analysis of multi-sample ChIPseq data sets through a user-friendly graphical user interface (GUI), allowing any experimental biologist with minimal computational expertise to use it easily.

Implementation
ChIPdig is developed in R using package Shiny and relies on multiple R/Bioconductor packages, including QuasR and BSgenome [15], for mapping reads to a reference genome assembly [16], BayesPeak for peak calling [17], csaw [12] and edgeR [13] for differential enrichment analysis, ChIPseeker for annotation of genomic regions of interest [18], and EnrichedHeatmap for generating coverage heatmaps [19].
Packages used in several analysis modules implemented in ChIPdig are GenomicRanges [20], valr [21], shinyFiles, GenomicFeatures, ggplot2, ggsignif, reshape2 and circlize.A few command lines provided in https://github.com/rmesse/ChIPdigsuffice to launch the GUI from within R Studio, allowing integrative and interactive usage of these powerful libraries without requiring programming or statistical experience from the user.
. CC-BY-NC-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available The copyright holder for this preprint (

Results
ChIPdig has the following capabilities: (1) alignment of reads to a reference genome; (2) normalization and comparison of distinct ChIP-seq data sets; (3) annotation of genomic regions; (4) generation of comparative heatmaps and metaplots for visualization of normalized coverage in user-defined specific regions.Each capability corresponds to a specific analysis module which can be loaded separate from other modules and using different input files.To illustrate each module, raw single-end ChIP-seq data for H3K4me3 and H3K36me3 in the model organism Caenorhabditis elegans were downloaded from NCBI GEO with series IDs GSE28770 and GSE28776, respectively.These two histone modification marks have different distribution profiles, namely promoter-proximal enrichment with punctuated peaks for H3K4me3, and gene body enrichment with broad peaks for H3K36me3 [22].Figure 1 shows the GUI displayed upon launching ChIPdig from R Studio.The left pane displays four radio buttons, each corresponding to one of the four analysis modules, as well a clickable box for selection of the folder containing the input files for the analysis.

• Alignment of ChIP-seq reads
The aligning module of ChIPdig relies primarily on the QuasR package, which supports the analysis of single-read and paired-end deep sequencing experiments [16].If desired, read preprocessing can be performed to prepare the input sequence files prior to alignment, e.g.removal of sequence segments corresponding to adapters and low quality reads.Sequence files in FASTQ format for each input DNA and ChIP sample corresponding to the H3K4me3 and H3K36me3 ChIP-seq experiments were aligned to the WS220/ce10 reference genome (available through the BSgenome package [15]) (Fig. 2), producing an average of 10 million mapped reads per sample.A report with several quality metrics for each alignment was automatically generated and is available in Additional file 1.
. CC-BY-NC-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available The copyright holder for this preprint (which was this version posted February 27, 2018.; https://doi.org/10.1101/220079doi: bioRxiv preprint • ChIP-seq data normalization, peak calling and differential enrichment analysis A recurrent problem in ChIP-seq data analysis lies at the comparison of multiple coverage profiles generated from different experiments or corresponding to different conditions.This problem is addressed in the second analysis module of ChIPdig.Upon its selection, the user is prompted to provide a tab-delimited text file with the names of each ChIP sample (treatment) mapped reads file in BAM format, the matching input DNA (control) file, a sample ID, the condition or target name, and a color designation for the output peak and coverage files (Fig. 3a).Any of the 657 R built-in colors can be chosen (Additional file 2).A bin size parameter set to 50 bp (base pairs) by default is used in both peak calling and genome-wide library normalization for differential enrichment analysis.If desired, duplicate reads can be removed prior to normalization and sequences can be extended to a median fragment length that can be estimated either computationally or experimentally (e.g.average size of fragments generated by chromatin shearing) (Fig. 3b).The initial mapped read processing outputs a table with library sizes and median fragment sizes, a chromosome size list, and a multidimensional scaling plot representing the similarity of samples in the data set (Fig. 3c).Data are normalized based on library sizes and on the trimmed means of M values (TMM) approach [23], which is implemented in the edgeR package [13].
Completion of the initial processing of mapped reads releases options for downstream processing which are posted to the sidebar panel, namely export of files representing sequencing coverage, peak calling and differential enrichment analysis (Fig. 4).Coverage is expressed in log2-transformed counts per million (cpm) for each genomic bin and, for each sample ID indicated by the user, three files with bedGraph format are generated: treatment (ChIP sample), control (input DNA sample) and the treatment-to-control ratio.Each file can be loaded onto a genome browser for visualization (Fig. 5).
Peak calling is performed via the BayesPeak package by using a hidden Markov model and Bayesian statistical methodology [17].The user specifies a posterior probability (PP) threshold and genomic .CC-BY-NC-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available The copyright holder for this preprint (which was this version posted February 27, 2018.; https://doi.org/10.1101/220079doi: bioRxiv preprint regions with PP above such threshold are identified as peaks.If replicates are indicated, commonly enriched genomic regions representing replicated peaks can be derived, and a track definition line can be added, allowing each file to be loaded onto the UCSC genome browser for visualization.In addition, a consensus peak set representing all candidate enriched regions across the full data set may be obtained. The user may be interested in choosing this option for posterior differential enrichment analysis.Peak calling was performed for the H3K4me3 and H3K36me3 ChIP-seq data sets, yielding 4994 replicated peaks for H3K4me3 (Additional file 3) and 25855 replicated peaks for H3K36me3 (Additional file 4).Differential enrichment analysis resorts to functions implemented in the csaw [12] and edgeR [13] packages.Following library size and TMM normalization, bin-level coverage is computed and, if desired, bins in which coverage for input DNA samples exceeds that of the corresponding ChIP samples are filtered off.If the user is interested in differential enrichment analysis in a specific set of regions, the corresponding file in BED format has to be provided.To illustrate this feature of ChIPdig, differential enrichment analysis was performed for comparing H3K4me3 coverage with that of H3K36me3 using either H3K4me3 replicated peak coordinates (Fig. 6a and Additional file 5) or those corresponding to H3K36me3 peaks (Fig. 6b and Additional file 6), with a false discovery rate (FDR) threshold of 0.1.As expected, at H3K4me3 genomic peak coordinates, H3K4me3 coverage is greater than that of H3K36me3, and the opposite is observed for H3K36me3 peaks.

• Annotation of genomic regions
Annotation of genomic regions of interest is performed via the ChIPseeker package [18].The user supplies the file with regions in BED format and chooses the reference genome assembly, as well as the distance upstream and downstream of the annotated transcription start site to be considered for assignment of promoter regions (Fig. 7a).Replicated H3K4me3 and H3K36me3 peaks were annotated and, characteristically of these marks [22], most H3K4me3 peaks were assigned to promoters (Fig. 7b .CC-BY-NC-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available The copyright holder for this preprint (which was this version posted February 27, 2018.; https://doi.org/10.1101/220079doi: bioRxiv preprint and Additional file 7), whereas H3K36me3 peaks lie predominantly at gene bodies (Fig. 7c and Additional file 8).

• Comparative heatmaps and metaplots
The user can supply multiple coverage files in bedGraph format and generate heatmaps and metaplots to visualize coverage in a specific region set in a comparative manner.This module of ChIPdig relies on a custom algorithm which builds a coverage matrix based on bin size and reference coordinates (start, end or both) selected by the user.Such matrix is then plotted in the form of a comparative metaplot and a set of heatmaps.H3K4me3 and H3K36me3 coverage files were loaded onto ChIPdig, along with a BED file with C. elegans transcription units in the WS220/ce10 reference assembly.Coverage was computed in 25 bp bins and gene bodies were either expanded or compressed to 500 bp.Windows of 250 bp upstream and downstream of each region were selected (Fig. 8a).Typical of H3K4me3, coverage is higher in the vicinity of the transcription start site, whereas H3K36me3 is enriched at transcription unit bodies.

Conclusions
ChIPdig is a user-friendly application for handling multiple ChIP-seq data sets and has diverse useful capabilities spanning a comprehensive analysis pipeline, namely: read alignment to a reference genome, ChIP-seq data normalization, peak calling, differential enrichment analysis, annotation of genomic regions, and generation of comparative heatmaps and metaplots for visualization of normalized coverage.
. CC-BY-NC-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available The copyright holder for this preprint (

.
CC-BY-NC-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available The copyright holder for this preprint (which was this version posted February 27, 2018.; https://doi.org/10.1101/220079doi: bioRxiv preprint of histone modification patterns in C. elegans.Genome Res.2011;21:227-36.23.Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNAseq data.Genome Biol.2010;11. domains