A step-by-step guide to analyzing CAGE data using R/Bioconductor

Cap Analysis of Gene Expression (CAGE) is one of the most popular 5'-end sequencing methods. In a single experiment, CAGE can be used to locate and quantify the expression of both Transcription Start Sites (TSSs) and enhancers. This is workflow is a case study on how to use the CAGEfightR package to orchestrate analysis of CAGE data within the Bioconductor project. This workflow starts from BigWig-files and covers both basic CAGE analyses such as identifying, quantifying and annotating TSSs and enhancers, advanced analysis such as finding interacting TSS-enhancer pairs and enhancer clusters, to differential expression analysis and alternative TSS usage. R-code, discussion and references are intertwined to help provide guidelines for future CAGE studies of the same kind.


Background
Transcriptional regulation is one of the most important aspects of gene expression. Transcription Start Sites (TSSs) are focal points of this process: The TSS act as an integration point for a wide range of molecular cues from surrounding genomic areas to determine transcription and ultimately expression levels. These include proximal factors such as chromatin accessibility, chromatin modification, DNA methylation and transcription factor binding, and distal factors such as enhancer activity and chromatin confirmation [1][2][3][4] .
Cap Analysis of Gene Expression (CAGE) has emerged as one of the dominant high-throughput assays for studying TSSs 5 . CAGE is based on "cap trapping": capturing capped full-length RNAs and sequencing only the first 20-30 nucleotides from the 5'-end, so-called CAGE tags 6 . When mapped to a reference genome, the 5'-ends of CAGE tag identify the actual TSS for respective RNA with basepair-level accuracy. Basepair-accurate TSSs identified this way are referred to as CAGE Transcription Start Sites (CTSSs). RNA polymerase rarely initiates from just a single nucleotide: this is manifested in CAGE data by the fact that CTSSs are mostly found in tightly spaced groups on the same strand. The majority of CAGE studies have merged such CTSSs into genomic blocks typically referred to as Tag Clusters (TCs), using a variety of clustering methods (see below). TCs are often interpreted as TSSs in the more general sense, given that most genes have many CTSSs, but only a few TCs that represent a few major transcripts with highly similar CTSSs 7,8 . Since the number of mapped CAGE tags in a given TC is indicative of the number of RNAs from that region, the number of CAGE tags falling in given TC can be seen as a measure of expression 9 .
As CAGE tags can be mapped to a reference genome without the need for transcript annotations, it can detect TSSs of known mRNAs, but also mRNA from novel alternative TSSs (that might be condition or tissue dependent) 7,10 . Since CAGE captures all capped RNAs, it can also identify long non-coding RNA (lincRNA) 11 and enhancers RNA (eRNA). It was previously shown that active enhancers are characterized by balanced bidirectional transcription, making it possible to predict enhancer regions and quantify their expression levels from CAGE data alone 12,13 . Thus, CAGE data can predict the locations and activity of mRNAs, lincRNAs and enhancers in a single assay, providing a comprehensive view of transcriptional regulation across both inter-and intragenic regions.
Bioconductor contains a vast collection of tools for analyzing transcriptomics datasets, in particular the widely used RNA-Seq and microarray assays 14 . Only a few packages are dedicated to analyzing 5'-end data in general and CAGE data in particular: TSRchitect 15 , icetea 16 , CAGEr 17 and CAGEfightR 18 , see Table 1.
CAGEr was the first package solely dedicated to the analysis of CAGE data and was recently updated to more closely adhere to Bioconductor S4-class standards. CAGEr takes as input aligned reads in the form of BAM-files and can identify, quantify, characterize and annotate TSSs. TSSs are found in individual samples using either simple clustering of CTSSs (greedy or distance-based clustering) or the more advanced density-based paraclu clustering method 19 , and can be aggregated across samples to a set of consensus clusters. Several specialized routines for CAGE data is available, such as power law normalization of CTSS counts and fine-grained TSS shifts. Finally, CAGEr offers easy interface to the large collection of CAGE data from the FANTOM consortium 10 . TSRchitect and icetea are two more recent additions to Bioconductor. While being less comprehensive, they aim to be more general and handle more types of 5'-end methods that are conceptually similar to CAGE (RAMPAGE, PEAT, PRO-Cap, etc. 5 ). Both packages can identify, quantify and annotate TSSs, with TSRchitect using an X-means algorithm and icetea using a sliding window approach. icetea offers the unique feature of mapping reads to a reference genome by interfacing with Rsubread. Both CAGEr, TSRchictet and icetea offers built-in capabilities for differential expression (DE) analysis via the popular DESeq2 or edgeR packages 20,21 .
CAGEfightR is a recent addition to Bioconductor focused on analyzing CAGE data, but applicable to most 5'-end data. It aims to be general and flexible to allow for easy interfacing with the wealth of other Bioconductor packages. CAGEfightR takes CTSSs stored in BigWig-files as input and uses only standard Bioconductor S4-classes (GenomicRanges, SummarizedExperiment, InteractionSet 22,23 ) making it easy for users to learn and combine CAGEfightR with functions from other Bioconductor packages (e.g. instead of providing custom wrappers around other packages such as differential expression analysis). In addition to TSS analysis, CAGEfightR is the only package on Bioconductor to also offer functions for enhancer analysis based on CAGE (and similarly scoped) data. This includes enhancer identification and quantification, linking enhancers to TSSs via correlation of expression and finding enhancer clusters, often referred to as stretch-or super enhancers.
In this workflow, we illustrate how the CAGEfightR package can be used to orchestrate an end-to-end analysis of CAGE data by making it easy to interface with a wide range of different Bioconductor packages. Highlights include TSS and enhancer candidate identification, differential expression, alternative TSS usage, enrichment of motifs, GO/KEGG terms and calculating TSS-enhancer correlations.

Dataset This workflow uses data from "Identification of Gene Transcription Start Sites and Enhancers Responding to
Pulmonary Carbon Nanotube Exposure in Vivo" by Bornholdt et al 24 . This study uses mouse as a model system to investigate how nanotubes affect lung tissue when inhaled. Inhaled nanotubes were previously found to produce a similar response to asbestos, potentially triggering an inflammatory response in the lung tissue leading to drastic changes in gene expression.
The dataset consists of CAGE data from mouse lung biopsies: 5 mice whose lungs were instilled with water (Ctrl) and 6 mice wholes lungs were instilled with nanotubes (Nano), with CTSSs for each sample stored in BigWig-files, shown in Table 2:

Workflow
The workflow is divided into 3 parts covering different parts of a typical CAGE data analysis: 1. Shows how to use CAGEfightR to import CTSSs and find and quantify TSS and enhancer candidates.

2.
Shows examples of how to perform genomic analyses of CAGE dusters using core Bioconductor packages such as GenomicRanges and Biostrings. This part covers typical analyses made from CAGE data, from summarizing cluster annotation, TSS shapes and core promoter sequence analysis to more advanced spatial analyses (finding TSS-enhancer correlation links and clusters of enhancers).
3. Shows how CAGEfightR can be used to prepare data for differential expression analysis with popular R packages, including DESeq2, limma and edgeR 20,21,25 . Borrowing from RNA-Seq terminology, differential expression can be assessed at multiple different levels: Tag cluster-and enhancer-level, gene-level and differential TSS usage 26 . Once differential expression results have been obtained, they can be combined with other sources of information such as motifs from JASPAR 27 and GO/KEGG terms 28,29,30 .
Part 1: Locating, quantifying and annotating TSSs and enhancers CAGEfightR starts analysis from CTSSs, which is the number of CAGE tag 5'-ends mapping to each basepair (bp) in the genome. CTSSs are normally stored as either BED-files or BigWig-files. CAGEfightR works on BigWig-files, since these can be efficiently imported and allow for random access.
Before starting the analysis, we recommend gathering all information (Filenames, groups, batches, preparation data, etc.) about the samples to be analyzed in a single data.frame, sometimes called the design matrix.
Starting with the TSS candidates, we can not only annotate a TSS with overlapping transcripts, but also in what part of a transcript a TSS lies by using a hierarchical annotation scheme. As some TSSs might be very wide, we only use the TSS peak for annotation purposes: Almost half of TSSs were found at annotated promoters, while the other half is novel compared to the UCSC known transcripts.
Transcript annotation is particularly useful for enhancer candidates, as bidirectional clustering might also detect bidirectional promoters. Therefore, a commonly used filtering approached is to only consider BCs in intergenic or intronic regions as enhancer candidates: # Keep only non-exonic BCs as enhancer candidates Enhancers <-subset(BCs, txType %in% c("intergenic", "intron")) This leaves almost 10000 enhancer candidates for analysis.
Merging into a single dataset. For many downstream analyses, in particular normalization and differential expression, it is useful to combine both TSS and enhancers candidates into a single dataset. This ensures that TSSs and enhancers do not overlap, so each CAGE tag is only counted once.
We finally calculate the total number of tags and TPM-scaled counts for the final merged dataset: # Genome track axis_track <-GenomeAxisTrack() # Annotation track tx_track <-GeneRegionTrack(txdb, name = "Gene Models", col = NA, fill = "bisque4", shape = "arrow", showId = TRUE) A good general strategy for quickly generating genome browser plots is to first define a region of interest, and then only plotting data within that region using subsetByOverlaps. The following code demonstrates this using the first TSS:  The top track shows the pooled CTSS signal and the middle track shows the identified TC with the thick bar indicating the TSS peak (the overall most used CTSSs within the TC). The bottom track shows the known transcript model at this genomic location. In this case, the CAGE-defined TSS corresponds well to the annotation.
We can also plot the first enhancer:  Here we see the bidirectional pattern characteristic of active enhancers. The bidirectional cluster is seen in the middle track, with the midpoint in thick marking the maximally balanced point within the bidirectional cluster.

Location and expression of TSSs and enhancers. In addition to looking at single examples of TSSs and enhancers,
we also want to get an overview of the number and expression of clusters in relation to transcript annotation. First we extract all of the necessary data from the RangedSummarizedExperiment into an ordinary data.frame: cluster_info <-RSE %>% rowData() %>% as.data.frame() Then we use ggplot2 to plot the number and expression levels of clusters in each annotation category: # Number of clusters ggplot(cluster_info, aes(x=txType, fill=clusterType)) + geom_bar(alpha=0.75, position="dodge", color="black") + scale_fill_colorblind("Cluster type") + labs(x="Cluster annotation", y="Frequency") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Analysing TSS shapes and sequences. A classic analysis of CAGE data is to divide TSSs into Sharp and
Broad classes, which show different core promoter regions and different expression patterns across tissues 7 .

Finding candidates for interacting TSSs and enhancers.
In addition to simply identifying enhancers, it is also interesting to try infer what genes they might be regulating. CAGE data can itself not provide direct evidence that an enhancer is physically interacting with a TSSs, which would requires specialized chromatin confirmation capture assays such as HiC, 4C, 5C, etc. However, previous studies have shown that TSSs and enhancers that are close to each other and have highly correlated expression are more likely to be interacting. We can therefore use distance and correlation of expression between TSSs and enhancers to identify TSSs-enhancer links as candidates for physical interactions 13 .
To do this with CAGEfightR, we first need to indicate the two types of clusters as a factor with two levels: rowData(RSE)$clusterType <-RSE %>% rowData() %>% use_series("clusterType") %>% as_factor() %>% fct_relevel("TSS") We can then calculate all pairwise correlations between TSSs and enhancer within a distance of 50 bp.  The top track shows the strength of correlations between 3 TSSs around the Atp1b1 gene. The highest correlation is seen between the upstream TSS and the most distal enhancer.
Finding stretches of enhancers. Several studies have found that groups or stretches of closely spaced enhancers tend to show different chromatin characteristics and functions compared to singleton enhancers. Such groups of are often referred to as "super enhancers" or "stretch enhancers" 34 .
Part 3: Differential Expression analysis of TSSs, enhancers and genes Normalization of expression and EDA. Before performing statistical tests for various measures of Differential Expression (DE), it is important to first conduct a thorough Exploratory Data Analysis (EDA) to identify what factor we need to include in the final model.
Here we will use DESeq2 20 for normalization and EDA since it offers easy to use functions for performing basic analyses. Other popular tools such as edgeR 21 and limma 25 offer similar functionality, as well as more specialized packages for EDA such as EDASeq.
DESeq2 offers sophisticated normalization and transformation of count data in the form of the variance stabilized transformation: this adds a dynamic pseudo-count to normalized expression values before log transforming to dampen the inherent mean-variance relationship of count data. This is particularly useful for CAGE data, as CAGE can detect even very lowly expressed TSSs and enhancers.
First, we fit a "blind" version of the variance-stabilizing transformation, since we do not yet know what design is appropriate for this particular study: # Create DESeq2 object with blank design dds_blind <-DESeqDataSet(RSE, design = ~ 1)
We can tabulate the total number of DE TSSs and enhancers: Correcting expression estimates for batch effects. In addition to looking at estimates and significance for each cluster, we might also want to look at individual expression values for some top hits. However, we then need to also correct the expression estimates themselves for batch effects, just like we did for log fold changes and p-values (using the same model of course).
Here we use ComBat 35 from the sva package which is suitable for removing simple batch effects from small experiments. For more advanced setups, removeBatchEffect from limma can remove arbitrarily complex batch effects. The RUVSeq package and fsva from sva can be used to remove unknown batch effects. We again use the variance-stabilizing transformation to prepare the data for ComBat (this makes count data resemble expression estimates obtained from microarrays, as ComBat was originally developed for microarrays).

Enrichment of DNA-binding motifs. A typical question following identification of differentially expressed TSSs and enhancers, is what TFs might be involved in their regulation. To shed light on this question we can annotate
TSSs and enhancers with DNA-binding motifs from the JASPAR database 27 .
First we extract the sequences around TSSs and enhancers. Here we simply define it as +/-500 bp around TSS peak or enhancer midpoint: "Arnt" "Ahr::Arnt" "Mecom" "Nkx2-5" "Pax2" "Sox17" Thirdly, we use the motifmatchr package 37 to find hits in the sequences: A significant odds ratio above 1 indicate that FOS::JUN is a candidate transcription factor (or, more technically correct, a candidate transcription factor dimer) in regulation of the nanotube response. This is not surprising given that FOS::JUN is part of the TNF-alpha inflammatory pathway (see more below).
Of course, this is a just a very quick and simple analysis of motif enrichment. One could easily have used different regions around TSSs and enhancers and/or split the enrichment analysis between TSSs and enhancers. Other Bioconductor packages like PWMEnrich, rGADEM and motifcounter implements more advanced statistical methods for calculating enrichment of known motifs. rGADEM, BCRANK and motifRG can also be used to calculate enrichment of novel motifs, sometimes referred to as motif discovery.

Gene-level differential expression.
While CAGE data is naturally analyzed at the level of clusters (TSSs and enhancers) it is in many cases interesting to also look at gene-level expression estimates. A prime example of this is looking at enrichment of Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) terms 28,29,30 which are only defined at gene-level. CAGEfightR includes functions for annotating clusters with gene models and summarizing expression to gene-level.
We can annotate clusters with gene IDs in the same manner as Transcript IDs: The gene IDs in this case is Entrez ID (which is widely used by Bioconductor packages). We can translate these systematic IDs into more human-readable symbols using the org.Mm.eg.db annotation package: # Translate symbols rowData(GSE)$symbol <-mapIds(odb, keys=rownames(GSE), column="SYMBOL", keytype="ENTREZID") #> 'select()' returned 1:1 mapping between keys and columns Having obtained a gene-level count matrix we can now perform gene-level DE analysis. Here we use limma-voom, since limma makes it easy to perform a subsequent enrichment analysis. Other tools such as DESeq2 (above) or edgeR (see below) could also have been used.
Note: limma is a powerful tool for DE analysis of count-based data. However, since it depends on log transforming counts, it is not always suitable for analyzing datasets where features have very low counts. This is usually not a problem for gene-level analysis, but can be a problem for enhancers, which are generally very lowly expressed.

Enrichment of GO-and KEGG-terms.
In addition to looking at individual top genes, we can look at how the differentially expressed genes relate to known databases of gene function to gain insight in what biological processes might be affected in the experiment.
KEGG-terms represents well defined pathways. We can use the pathview package 38 to investigate in more detail the genes in a given enriched pathway. For example, we can look at regulation of gene in the TNF-signalling pathway: # Visualize a KEGG DE_genes <-Filter(function(x) x != 0, dt[, "ClassNano"]) # This will save a png file to a temporary directory pathview(DE_genes, species="mmu", pathway.id="mmu04668", kegg.dir = tempdir()) Differential TSS Usage. In the above two analyses we looked at whether an individual TSSs or an individual gene was changing expression between experimental groups. However, we might also want to look at whether a gene show differential TSS usage: whether a gene uses different TSSs under different conditions. This problem is similar to differential splicing in RNA-Seq, but looking at TSSs rather than isoforms 26 . Here we will use the edgeR diffSpliceDGE method to find differential TSS usage, although many other packages could have been used, for example diffSplice from limma, DEXSeq, DRIMSeq, etc..
Intuitively, diffSpliceDGE tests whether a given TSSs show the same change as other TSSs in the same gene, indicating that TSSs are differentially regulated across the gene. This does however not take into account the relative composition of a given TSSs, e.g. whether a TSS increases from 1%-2% of gene output or 25%-50%. A useful preprocessing step is therefore to filter out TSSs making only a small contribution to total gene expression before analyses.
We can look at closer at the TSS usage in on of the top hits: We can visualize the batch-corrected expression (See above) of each TSS in the Fblim1 gene via a heatmap: RSE_filtered %>% subset(geneID == "74202") %>% assay("ComBat") %>% t %>% pheatmap(color = magma(100), cluster_cols = FALSE, main="Fblim1") Fblim1 has 3 TSSs, with 2 of them being used in the Ctrl samples, while the Nano samples also uses the chr4:141154044-141154185;-TSS, as also seen in the TSS-level  The Fblim1 gene uses two annotated TSSs, but the Nano samples also uses a novel intronic TSS.

Discussion
This workflow is intended as providing an outline of the basic building blocks of CAGE data analysis, going from clustering, to spatial analyses to differential expression. More advanced analyses can be strung together from these basic elements: Finding enhancers linked to DE TSSs, enhancer stretches composed of DE enhancer, comparing DNA binding motif enrichments between DE enhancers and TSSs, etc.
One aspect not covered in this workflow is the utility of CAGE data (and 5'-end data in general) in providing accurate TSSs for studying other types of data. For example, having accurate TSSs is highly beneficial in chromatin research, since the location and nucleosome and TSSs are closely related 13,39,40 . CAGE can be combined with chromatin confirmation assays such as HiC to find new enhancers that are both co-expressed and physically interacting with TSSs. Many genome-wide association studies are finding that disease-related genetic variants are found in intergenic regions, that are often poorly annotated. The accurate enhancer locations provided by CAGE can greatly aid interpretation of such variants 41 . The adherence of CAGEfightR to standard Bioconductor classes facilitates these inter-assay analyses by making it easy to mix-and-match multiple packages developed for different experimental assays.

Software and data availability
The following software versions were used in this article: 1.

Nevena Cvetesic
Institute of Clinical Sciences (ICS), Faculty of Medicine, Imperial College London, London, UK Thodberg and Sandelin present CAGEfightR, a new Bioconductor package suited for analysis of 5'end oriented datasets, derived from CAGE-seq and similar techniques. Although similar Bioconductor packages exist (icetea, CAGEr, TSRchitect), the greatest strength of CAGEfightR is unique in its ability to call or predict putative enhancers based on the bidirectional transcription initiation signature, thus filling the gap in the current R-based toolkit for CAGE-like data analysis. In addition, using CAGEfightR, hypotheses of enhancer-promoter interactions based on co-expression levels can be easily set and visualised.
The package is well-documented, and the step-by-step protocol well written and easy to follow. I only have a few minor comments which could benefit the general reader: The introduction and abstract could state more clearly that CAGE allows TSS mapping of only RNA polymerase II transcripts. Though this is implied through usage of cap-trapping, I would keep in mind that this workflow might be used by general readers not so familiar with TSS-mapping techniques. Considering the advent of technologies that capture RNA polI-RNApolIII transcripts, expected to be much noisier, it would be better to make it as clear as possible. If the authors believe CAGEfightR could be of use on noisier data, from experimental methods based on negative selection (such as TSS-seq), it would be worth testing this and including a few sentences, as this would promote CAGEfightR usage on any TSS mapping technique.
I support comparison of existing packages in the form of Table 1; however, I would expand this comparison to include unique features that perhaps CAGEfightR does not have -e.g. icetea and TSRchitect support paired-end data, CAGEr has TSS-shifting discovery function and implemented G-correction function to remove mismatching G's from 5'ends of reads.
Common problem with CAGE and CAGE-like data which is obtained through reverse transcription, is the addition of a G, or so called G-bias upstream of the true transcription start site. It would be beneficial for general readers as this is a step-by-step protocol to discuss how to generate BigWig files from fastq files, and how to address the G-bias problem/i.e. remove the mismatching G's at the 5'end of the reads.
In this paper, the authors present a cookbook for analyzing CAGE data mainly through their own Bioconductor package, CAGEfightR, applied to sample data, which were analyzed and published by their group previously. By following the given source codes, readers will be able to learn how they can obtain various kinds of information rather easily. Thus, I feel that this publication will be useful for those who want to learn how to analyze CAGE data quickly. On the other hand, it will not tell us how to solve our deeper problems in research. I know that this is out of the scope of this tutorial but I can't help but feeling, for example, how the given procedure for reducing batch effects can be justified (see below). Thus, here are some of my suggestions for its further improvement: In our realistic situations, the EDA approach is quite important. In this sense, I appreciate their demonstration on how to remove batch effects from the expression data between various samples, using ComBat (Figs. 9 and 11). However, most researchers will not be satisfied with just seeing that the PC1 has become to separate positive and negative groups; it is natural that they would like to confirm if the correction was enough or not; they would also wish to see what the new PC2 as well as the old PC1 represent. Therefore, I recommend the authors to extend Table 2 for characterizing each sample from various features (e.g., experimental conditions and data size) and to use such features for the interpretation. More discussion and/or additional attempts to clarify the most probable main reason for the initial batch effects would be desirable. Similarly, since the first author does not seem to have been a member of the previous analysis, it is interesting to see the consistency between the two studies. For example, the observation that inflammation-related genes were activated seems to be the same in both analyses. Then, are the genes with differential TSSs likely to explain the phenomenon? How much are the detected enhancers contributing to the differential expression? Do these enhancers (or newly activated TSSs) share any over-represented motifs? From the same reason, I recommend the authors to avoid using (ugly) chromosomal coordinates to represent genes/promoters/enhancers, wherever possible. It would be great if the authors can show that they could perform deeper analyses this time.
For the convenience of wider readers, it might be useful to show the way how to obtain BigWig files from rawer data (BAM or fastq, if possible). Similarly, a summary table of used tools (except CAGEfightR), containing their input file information as well as their main purposes might be useful. Also, it might be useful if there is a summary on what CAGEfightR can do/cannot do. For example, is it possible to combine different sources of CAGE data with this workflow? One of my students tried to follow the workflow. At first, she failed to install some of the packages. It was because the version of R she used was R3.5. Thus, this point should be clearly noted. In addition, she reports that to run the code "Fit and shrink DE model", the installation of "statmod"

Aaron Lun
Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge, UK Thodberg and Sandelin describe a comprehensive workflow for the basic analysis of CAGE data. The workflow is simple, well-presented and the code mostly runs without problems. I only have a few minor comments to improve its usefulness for the general reader: There should be a general overview on how the CTSS BigWig files are generated from raw FASTQ files.
Consider using ExperimentHub for the nanotubes package. This provides greater control over which files are downloaded, rather than forcing the user to obtain all files at installation. This is especially useful if you have multiple data sets -see, for example, the chipseqDBData package.
Explain what a BigWigFileList is, and why it needs to be used instead of having a simple character vector.
Consider not having string'ified intervals as the row names in the output of quickTSSs(). In some situations, generation of the strings can use more RAM than the actual data itself. It definitely slows down any attempt to 'show' the output. I would suggest that this be deferred as late as possible, e.g., until someone needs the strings as row names of a data frame to save to file.
More details are required on how quantification is performed. For example, I assume counts are summed directly from single strands for TSSs. For enhancers, are counts summed from both strands?
"As enhancers are the most complicated type, we keep only enhancers if a TSS and enhancer overlaps:" The complexity of the enhancers doesn't really provide a motivation for only keeping enhancers in cases of overlaps. The better reason is that all enhancers would be detected as two TSSs if the strandedness was ignored; if they do overlap, it would be more appropriate to interpret them as a single enhancer rather than as two distinct TSS events.
The Interquartile Range (IQR)... of what? I assume that the range refers to the length of the interval that contains 10 to 90% of a TSS's counts. Incidentally, the IQR is no longer an IQR if it's redefined from 10-90%.
The single quotes in the highTSSs call are malformed, which prevents copy-pasting.
One could consider using a 2-component mixture model to classify elements into sharp/broad in a more automated manner.
I presume that the pairwise correlations for the TSS-enhancer interactions are computed across samples for each TSS/enhancer pair. If so, is this done after blocking on the design? Otherwise it is possible to obtain strong positive correlations simply because a TSS and the enhancer happen to respond in the same direction to a given treatment. If there is a genuine physical interaction, it should manifest as correlations within each treatment condition, where stochastic differences in RNA polymerase activity affect both the TSS and its interacting enhancer.
There is no correction for multiple testing in the p-values from the links. While I recognise that this 1