Optimized functional annotation of ChIP-seq data

Motivation: Different ChIP-seq peak callers often produce different output results from the same input. Since different peak callers are known to produce differentially enriched peaks with a large variance in peak length distribution and total peak count, accurately annotating peak lists with their nearest genes can be an arduous process. Functional genomic annotation of histone modification ChIP-seq data can be a particularly challenging task, as chromatin marks that have inherently broad peaks with a diffuse range of signal enrichment (e.g., H3K9me1, H3K27me3) differ significantly from narrow peaks that exhibit a compact and localized enrichment pattern (e.g., H3K4me3, H3K9ac). In addition, varying degrees of tissue-dependent broadness of an epigenetic mark can make it difficult to accurately and reliably link sequencing data to biological function. Thus, there exists an unmet need to develop a software program that can precisely tailor the computational analysis of a ChIP-seq dataset to the specific peak coordinates of the data and its surrounding genomic features. Results: geneXtendeR optimizes the functional annotation of ChIP-seq peaks by exploring relative differences in annotating ChIP-seq peak sets to variable-length gene bodies. In contrast to prior techniques


Introduction
The field of epigenetic research studies the process by which heritable changes in gene expression occur without underlying alterations in the DNA sequence.Epigenetics plays a key role in human biology, and dysregulation in epigenetic processes is associated with the pathogenesis of cancer and many other diseases.Epigenetic mechanisms have been demonstrated to be necessary for biological programs important for a variety of health and disease outcomes.Understanding the impact of epigenetic architecture on the accessibility of gene promoters and its effect on gene expression patterns is therefore critical for linking chromatin biology to clinical indications.One way to measure such events involves investigating histone modifications, namely post-translational modifications to histones (referred to as chromatin marks) that regulate gene expression by organizing the genome into active regions of euchromatin, where DNA is accessible for transcription, or inactive heterochromatin regions, where DNA is more compact and less accessible for transcription [1].
Chromatin marks come in a variety of different shapes and sizes, ranging from the extremely broad to the extremely narrow [2][3][4][5][6].This spectrum depends on a number of biological factors ranging from qualitative characteristics such as tissue-type [7] to temporal aspects such as developmental stage [8].Depending on the peak caller used, computational factors such as the variance observed in peak coordinate positions (peak start, peak end) -both in terms of length distribution of peaks as well as the total number of peaks called -is an issue that persists even when samples are run at identical default parameter values [9,10].This variance becomes a factor when annotating peak lists genome-wide with their nearest genes as peaks can be shifted in genomic position (towards 5' or 3' end) or be of different lengths, depending on the peak caller employed.In total, the combined effect of these factors exerts a unique influence over the functional annotation and understanding of genomic variability, which ultimately complicates the study of epigenetic regulation of biological function.
Prior software in the ChIP-seq functional annotation arena (e.g., ANNOVAR [11], ChIPpeakAnno [12], ChIPseeker [13], HOMER [14], and BEDTools [15]) has focused exclusively on distance-minimizing algorithms between peaks and the transcriptional start site (TSS) regions of their nearest genes.In contrast, geneXtendeR significantly expands this definition to include n-dimensional annotation, whereby a user can investigate second-closest, third-closest, . . ., n th -closest genes to any given peak (or set of peaks), thereby focusing on and prioritizing the biology over simply the raw numbers (in base pairs).Detailed expositions of these new methods and their implications on the interpretation of results from data analyses are presented as case studies in the geneXtendeR package vignette.

Algorithms and implementation
The key algorithm in the geneXtendeR R/Bioconductor package is the extension algorithm, implemented in the C programming language for performance and efficiency.The process of "extending" refers to performing sequential iterative gene-feature overlaps after adding to the gene-span a user-specified region upstream of the start of the gene model and a fixed (500 bp) region downstream of the gene, resulting in assigning to a gene the features that do not physically overlap with it but are sufficiently close.This process is repeated multiple times across a range of extension parameters set by the user and a series of visualizations are returned as output to help users hone in on the optimal functional annotation.This is in contrast to most past and present epigenetic analyses, in both ChIP-seq [16] and ATAC-seq [17] studies, that ad-hoc assign gene body definitions (e.g., assigning a default 2 kbp as the cutoff for gene-proximal peaks) before mapping the peaks to genomic features.Fig. 2 shows why such a practice may be limiting.
From a performance standpoint, the extension algorithm is optimized to handle the computational complexity inherent to performing compute-intensive n-dimensional annotation.This ultimately aids in efficiently capturing cis-regulatory and proximal-promoter element relationships between ChIP-seq peaks and the genes they are (dys-)regulating, as described in further detail in the vignette.All of geneXtendeR's source code is implemented in the C and R programming languages and shipped within a standalone R/Bioconductor package release that is publicly available for download from either Bioconductor or Github.Within its codebase, geneXtendeR leverages the AnnotationDbi [18], BiocStyle [19], data.table[20], dplyr [21], GO.db [22], networkD3 [23], RColorBrewer [24], rtracklayer [25], SnowballC [26], testthat [27], tm [28], and wordcloud [29] libraries.It is not uncommon for significant peaks to be located thousands of base pairs away from their nearest genes, suggesting that sequences under these respective peaks may further be extracted and analyzed for the presence of known regulatory elements or repeats (e.g., using software programs like TRANSFAC, MEME/JASPAR, or RepeatMasker) or for investigating potential enhancer effects.

Results
First, we tested geneXtendeR on all publicly available transcription factor and histone modification ChIP-seq datasets in ENCODE.After downloading and analyzing data from the ENCODE ChIP-seq Experiment Matrix (hg19) [30], our large-scale analysis (Fig. 2) indicated that ChIP-seq peaks do not concentrate within any specific upstream extension (e.g., 2000 bp) of their nearest protein-coding genes.This observation that ChIP-seq peaks drop off gradually with genomic distance from the start of a gene (first exon) suggests that there is no good general guideline cutoff for capturing proximal histone modifications (e.g., prior studies [16,17] have used 2000 bp) or transcription factor binding peaks.There are still hundreds of peak clusters that reside in proximal-promoter regions that are 2000-3000 bp away from their nearest protein-coding genes and in distal regions beyond 3 kbp, making ad-hoc decisions like 2 kbp cutoffs too general to be of broad utility across specific use cases.When applying geneXtendeR to both proximal and distal transcription factor (TF) binding peaks for all cell types, we observed some cell type-dependent and TF-dependent peak aggregation dynamics in intervals ranging from 0 to 10 kbp (Fig 3 Fig 2 .ENCODE ChIP-seq datasets.Large-scale computational geneXtendeR analysis using hg19 reference genome of 198 histone modification and 547 transcription factor ChIP-seq datasets from ENCODE.To make data directly comparable to each other, the y-axis represents a normalized count of peak clusters (number of peak clusters in a specific interval divided by the total number of peak clusters across all 0-10 kbp intervals for a given chromatin mark or TF), where a peak cluster is defined as a genomic locus harboring at least 5 overlapping peaks.The x-axis, which is segmented into 20 discrete regions ("1" = 0-500 bp interval, "2" = 500-1000 bp interval, ..., "20" = 9500-10000 bp interval), represents a genomic distance (in bp) of the closest protein-coding gene to each respective peak cluster.A steady decline in peak cluster count at further upstream intervals is detected for all (broad and narrow) chromatin marks as well as transcription factors, i.e., peak clusters do not congregate proximally within any specific region of intervals (e.g., 0-2000 bp) of their respective protein-coding genes, as there is a large number of peak clusters that reside further upstream of their nearest gene.For instance, in the 9500-10000 bp interval alone, there are 1043 peak clusters for the H2AFZ chromatin mark, 569 peak clusters for the H3K4me1 chromatin mark, and 716 peak clusters across all transcription factor ChIP-seq datasets.However, there are certainly exceptions like the H3K9me1 chromatin mark, which has only 1 peak cluster in the 7000-7500 bp interval (see the big dip at x-axis=15 in the right-hand panel) and only 7 peak clusters in the 9500-10000 bp interval (see S1 Appendix and S2 Appendix for reproducible code and data).
We then focused our attention on using geneXtendeR to perform an end-to-end analysis of a published histone modification ChIP-seq dataset [31] deposited in the Gene Expression Omnibus under accession number GSE83979.At the peak-calling stage (Fig 1 ) we ran two different peak callers (SICER [32] and CisGenome [33]) producing two highly variable peak length profiles even at default run parameters S1 Fig.Despite the stark difference in peak profiles, geneXtendeR consistently identified the same top two gene candidates, highlighting its utility for robust functional annotation even in the face of extreme peak variability.Details are discussed in the package vignette.
We followed up this computational analysis by performing n-dimensional annotation of the GSE83979 dataset to provide an expanded view of the gene neighborhood around each individual peak -effectively annotating every peak n times (once for the closest gene, once for the second-closest gene, etc.) and grouping the results into a tabular summary format.We show in the vignette how the second-closest gene may be a preferable candidate for experimental follow-up/validation, especially if the first-closest gene is putative/predicted, while the second-closest gene is known to play a role in a similar biological process based on previously published literature.January 13, 2019 5/12 Running geneXtendeR on 547 human transcription factor (TF) ChIP-seq datasets obtained from ENCODE shows that many peaks tend to reside within 500 bp upstream of their respective protein-coding genes yet, depending on the identity of the transcription factor (e.g., EP300) and the specific cell type (e.g., K562), there may be more or less peaks located further upstream and, therefore, a generalized upstream cutoff is not applicable.

Discussion
The cell-type and TF/chromatin mark-specific complexity apparent in Fig. 3 and Fig. 4 motivated the design and implementation of user-friendly functions that can calculate ratios of statistically significant peaks to total peaks in various genomic intervals (see hotspotPlot() documentation in geneXtendeR vignette).Similarly, users can transform peaks into merged peaks (see peaksMerge()).geneXtendeR also allows users to explore gene ontology differences at various extensions (see diffGO()) as interactive network graphics (see makeNetwork()) or word clouds (see makeWordCloud()).Furthermore, users can investigate mean (average) peak lengths within any genomic interval (see meanPeakLengthPlot()), showing how average peak broadness can change at different upstream extensions, or examine the variance of peak lengths within a specific genomic interval (see peakLengthBoxplot()). It is also possible to examine unique genes and their associated ChIP-seq peaks between any two upstream extension levels (see distinct()).For example, Fig. 5 displays all unique genes (and their respective gene ontologies) that are associated with peaks located between 2-3 kbp across the genome.geneXtendeR also allows users to examine the distribution of peak lengths across the entire peak set (see allPeakLengths()), a function that is useful for visualizing the length distribution of all peaks from a peak caller.These functions (and more) are all explored in detail within the package vignette.After a user has explored the peak coordinates data using these functions to determine the optimal alignment of peaks to a GTF file, the peaks file can be functionally annotated with the annotate() function or one of its counterparts (gene annotate() or annotate n()) for n-dimensional annotation.
January 13, 2019 6/12 We have successfully applied geneXtendeR during the analysis of a histone modification ChIP-seq study investigating the neuroepigenetics of alcohol addiction [34], where geneXtendeR was used to determine an optimal upstream extension cutoff for H3K9me1 enrichment (a commonly studied broad peak) in rat brain tissue based on line plots of both significant peaks and total peaks.This analysis helped us to identify, functionally annotate, and experimentally validate synaptotagmin 1 (Syt1) as a key mediator in alcohol addiction and dependence [34].This analysis is explored in detail in the package vignette.Taken together, geneXtendeR's functions are designed to be used as an integral part of a broader biological workflow (Fig. 1).

Conclusion
We present an R/Bioconductor package, geneXtendeR, that goes beyond the typical nearest-to-gene analyses commonplace to most standard computational ChIP-seq workflows.geneXtendeR offers n-dimensional functional annotation and the ability to investigate the effect of variable-length gene bodies when mapping peaks to genomic features, thereby serving as a next-generation model of peak annotation to nearby features in modern bioinformatics workflows.geneXtendeR therefore represents a critical first step towards tailoring the functional annotation of a ChIP-seq peak dataset according to the details of the peak coordinates (chromosome number, peak start position, peak end position) and their surrounding genomic features.

Fig 1
Fig 1 summarizes the key steps of a sample workflow.For an end-to-end example of a comprehensive biological workflow and case-study, please refer to the vignette.

Fig 3 .
Fig 3. ENCODE TF analysis.Running geneXtendeR on 547 human transcription factor (TF) ChIP-seq datasets obtained from ENCODE shows that many peaks tend to reside within 500 bp upstream of their respective protein-coding genes yet, depending on the identity of the transcription factor (e.g., EP300) and the specific cell type (e.g., K562), there may be more or less peaks located further upstream and, therefore, a generalized upstream cutoff is not applicable.

Fig 4 .
Fig 4. ENCODE histone modification analysis.Running geneXtendeR on 198human histone modification ChIP-seq distal peak datasets obtained from ENCODE reveals that most distal peaks are not congregating within any specific upstream region of their respective protein-coding genes (here we define "distal" as only those peaks that are more than 2000 bp away from their nearest gene).Additional comprehensive analyses (see S1 Appendix and S2 Appendix) were run for proximal peaks (≤ 2000 bp) as well as the complete set of peaks (proximal + distal) from all 198 histone modification ChIP-seq datasets, and similar patterns were observed.

Fig 5 .
Fig 5.Genome-wide network analysis of peak subsets in promoter regions.All unique genes (and their respective gene ontologies (GO)) that are associated with peaks located in promoter-proximal regions between 2-3 kbp genome-wide.Put another way, these are all gene-GO pairs associated with peaks that are distinct between 2000 and 3000 bp upstream extensions across the genome.Orange color denotes gene names, purple color denotes GO terms.A user can hover the mouse cursor over any given node to display its respective label directly within RStudio.Likewise, users can dynamically drag and re-organize the spatial orientation of nodes, as well as zoom-in and out of them for visual clarity.
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under optimization Fig 1. Sample biological workflow.Sample biological workflow using geneXtendeR in combination with existing statistical software to evaluate the role of ChIP-seq peak significance during functional annotation tasks (see description of hotspotPlot() function in package vignette).