Keywords
Genomic data integration, ChIP-seq, RNA-seq, gene set enrichment, genomic intervals
This article is included in the Bioinformatics gateway.
Genomic data integration, ChIP-seq, RNA-seq, gene set enrichment, genomic intervals
In this version, in response to reviewer’s feedback, we have made following changes:
We have updated our software package to a newer version (0.6.3) with improved functionality in terms of command line options for input and output files. We have updated the documentation accordingly. We have also uploaded updated versions of all three figures and improved the text throughout the manuscript.
See the authors' detailed response to the review by Aziz Khan
See the authors' detailed response to the review by Nicolae Radu Zabet
Gene expression control in higher organisms is achieved through a complex hierarchical process involving opening of chromatin, histone modifications, and binding of transcription factors (TFs). Experimental approaches to understand transcriptional regulatory mechanisms in a biological context involve large-scale measurement of gene expression. Depending on the design of the experiment, these analyses produce differentially expressed gene sets or clusters for further analysis. These studies are often complemented by assays which map, on a genome-wide scale, TF binding sites (ChIP-seq) or regions of chromatin accessibility (DNase-seq, ATAC-seq). Analyses of these data produce a collection of genomic intervals (peak sets). An important computational task is then to integrate these data to produce meaningful results; i.e. to relate gene sets to peak sets to aid functional interpretation. Bearing in mind distal regulation, an important consideration here is to be able to calculate gene set enrichment at multiple genomic distances from peak sets, and to be able to do this efficiently within the same analysis.
We present a new tool – PEGS (Peak set Enrichment in Gene Sets)1 – which calculates mutual enrichment of multiple gene sets associated with multiple peak sets, simultaneously and efficiently. This can be at user-defined peak-to-TSS (transcription start site) distances, as well as constraining to topologically associated domains (TADs). Thus, PEGS quickly produces an overall picture of gene set enrichment in relation to peaks, and shows at what genomic distances this is most pronounced. It is applicable to gene sets derived from any source, and peak sets derived from different epigenomic assays, as well as single nucleotide polymorphisms (SNPs) from genome-wide association studies (GWAS).
In PEGS, input peaks are extended in both directions using user-provided genomic distances or constrained within known TAD boundaries, if provided (Figure 1). Subsequently, the enrichment of the input gene set is calculated among the genes whose TSSs overlap with the extended peaks, separately for each genomic distance and/or TADs. These tasks are performed in PEGS as follows:
1. Creating a gene interval file in BED (Browser Extensible Data) format for all TSSs in the given genome using refGene from UCSC Table Browser. This reference TSSs BED file only needs to be created once (human hg38 and mouse mm10 are provided with the tool; a utility is provided to create these for other genomes).
2. For a given peak set, peaks are extended to a specified genomic distance in both directions (and up to overlapping TAD boundaries, if provided). Intersection of these extended peaks with the gene intervals BED file from step 1 is calculated using BEDTools (RRID:SCR_006646)2. This leads to a gene set with TSSs overlapping with extended peaks.
3. Using the intersection of the input gene set, and unique genes from step 2 (thus removing genes with multiple TSSs), a Hypergeometric test is performed to calculate the p-value using Equation 1, similar to GREAT (RRID:SCR_00580)3. Here, M is the total number of genes in the genome, Nc is the number of genes in the input cluster/set c, Np is the number of unique genes overlapping the peaks for given distance and npc is the intersection of two gene sets.
Cartoon showing peak expansion and overlapping TSSs in PEGS, with a specified genomic distance λ from the centre of the peak in both directions (a) where TSS2 and TSS3 are included, and a TAD overlapping with the left peak in (b) where all four TSSs within the TAD are included
Step 2 and 3 are repeated for every combination of gene cluster, peak set and genomic distance and/or TADs. The final combined heatmap shows −log10 of the resulting p-values.
PEGS is implemented in Python 3, where we have reused functions from existing Python packages included with Python distributions, or available from the Python Package Index (PyPI). We also make use of BEDTools2 for working with genomic intervals. We provide online documentation (https://pegs.readthedocs.io/en/latest/), and an example analysis with input data at the PEGS GitHub repository.
PEGS works with Python >= 3.6 and, when installed through pip, automatically installs all the dependencies. These are listed in requirements.txt file in our PEGS GitHub repository. We provide extensive documentation online at https://pegs.readthedocs.io which includes easy-to-follow instructions about:
Here, we present three use cases where we apply PEGS to different publicly available data sets. The format of input files is the same for all use cases below. Gene clusters are provided as text files with one gene symbol on each line; genomic region coordinates are provided in standard BED format. These input files for Use Case 1, as well as example analysis reproducing Figure 2A, are provided in our GitHub repository (https://github.com/fls-bioinformatics-core/pegs).
PEGS applications: (A) gene expression, ChIP-seq and DNase I data on mouse liver; upper two panels correspond to GR ChIP-seq and DNase peaks expanded to different genomic distances while the bottom panel shows both GR ChIP-seq and DNase peaks expanded to overlapping TAD boundaries (B) gene clusters derived from scRNA-seq and intergenic putative enhancer clusters from bulk ATAC-seq from three matching early stem cell differentiation time-points. In both plots, numbers in the cells show common genes among the input genes (x-axis) and genes overlapping with expanded peaks (y-axis) and the colour shows −log10 of p-value (Hypergeometric test).
The first application (Figure 2A) uses the gene sets consisting of putative targets of glucocorticoid receptor (GR) in mouse liver. These are up- and down-regulated genes obtained by an RNA-seq study of liver samples from mice treated acutely with synthetic glucocorticoid dexamethasone or vehicle4. Corresponding GR ChIP-seq and chromatin accessibility data (DNase I hypersensitive (DHS) regions) were obtained from 5, and 6 respectively, whilst the mouse liver TAD boundaries were obtained from 7. Raw published datasets were downloaded from GEO Sequence Read Archive (RRID:SCR_005012) using sratoolkit v2.9.2 (http://ncbi.github.io/sra-tools/). Reads were aligned to the reference genome (mouse mm10), sorted and indexed using Bowtie2 (v2.3.4.3, RRID:SCR_005476,8) and SAMtools (v1.9, RRID:SCR_002105,9). MACS2 (v2.1.1.20160309, RRID:SCR_013291,10) was used to call peaks, using default settings. Using these GR ChIP-seq peaks and DHSs as peak sets, PEGS analysis shows strong association of dexamethasone up-regulated genes with dexamethasone-induced GR peaks at distances up to 500kbp from these peaks, but no enrichment of down-regulated genes (Figure 2A, top panel), indicating distinct mechanisms of gene activation and repression by glucocorticoids. At the same time, there is promoter proximal enrichment for both up- and down-regulated genes in the DHS regions (Figure 2A, middle panel). On the other hand, PEGS analysis using TADs boundaries (Figure 2A, bottom panel) shows significant enrichment only in the case of up-regulated genes in GR ChIP-seq. This is suggestive of a direct role for GR in gene activation rather than repression.
Next, using PEGS, we calculated enrichment of gene clusters derived from single-cell RNA-seq and open chromatin regions defined by bulk ATAC-seq at three matching time points (ESCs- embryonic stem cells, day1 EpiLCs - epiblast-like cells, day2 EpiLCs) during early embryonic stem cell differentiation11. Early embryonic development (naïve mouse ESCs to EpiLCs) involves large changes in the chromatin landscape through the action of many transcription factors and chromatin regulators leading to specific gene expression programs. Using our publicly available data11, we defined our open chromatin peak sets as the intergenic regions with differential accessibility between any two time points. These were clustered into four profiles based on z-score of tag densities, as described in 11. Similarly, differentially expressed genes were identified from pseudo-bulk gene expression data at each time point, and were similarly clustered into four patterns across three time points. These constituted our gene sets for PEGS analysis. As shown in Figure 2B, application of PEGS to these data shows strong association between the matching gene expression (x-axis) and chromatin opening profiles (y-axis) at intergenic enhancers, reflecting correspondence between differential accessibility and gene expression changes at corresponding time points. This application shows the utility of PEGS for integrating chromatin accessibility and gene expression data, leading to biologically meaningful association of enhancer and gene clusters.
Genome-wide association studies (GWAS) are commonly employed to study genotype-disease associations. Here, we present an extended application of PEGS to GWAS data and find associations of SNPs for different sleep phenotypes with sets of tissue-specific genes from the Genotype-Tissue Expression (GTEx) Portal, RRID:SCR_013042). For this purpose, we downloaded GWAS data from the Sleep Disorder Knowledge Portal (RRID:SCR_016611) which provides data, as well as analysis and visualisation resources for human genetic information regarding sleep and related traits. We extracted SNPs (single nucleotide polymorphisms) for certain sleep associated phenotypes (with genome wide p-value cutoff <=5 e − 8). These SNPs constitute our input peak sets to PEGS, while we defined corresponding gene sets as tissue-specific genes from GTEx portal. These were created as following; we obtained median transcripts per million (TPM) data for different tissues in GTEx, and a gene list for a specific tissue was defined as genes with greater than 5x median TPM compared to the average in the remaining tissues.
In Figure 3, using PEGS, we show enrichment of SNPs from three sleep related phenotypes, namely chronotype, daytime sleepiness adjusted for BMI, and sleep duration. These enrichments are calculated for tissue-specific genes lists created from GTEx for 22 tissues, the majority of them from the brain. Application of PEGS to these data reveals some strong associations, e.g. chronotype SNPs strongly enriched for genes expressed in liver and blood, while daytime sleepiness SNPs are enriched in gene sets for different brain tissues. Some of these associations are reported in the literature, e.g. daytime sleepiness SNPs in brain tissue12, others may warrant further investigation.
The x-axis shows different tissue-specific gene lists, and y-axis shows three sets of sleep related SNPs, expanded to multiple genomic distances. The colour of the cells show −log10 of p-value of enrichment of corresponding gene list (x-axis) in the genes identified through overlap with expanded SNP intervals, the numbers in the cells show the common genes among the two (used in the calculation of Hypergeometric p-value)
Through the three different applications above, we demonstrate that PEGS is a versatile and highly efficient tool to integrate different genomic data, and is able to generate hypotheses for further analysis. The implementation of PEGS is highly efficient and as an example of computational efficiency, with pre-created reference TSS files, it only took 7.6 seconds to produce the output for Figure 2A on a laptop with Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz processor with 16GB RAM.
Furthermore, the user can adjust the background population and control for bias. For example, depending on the scientific question at hand, the background population could be limited to include only those genes known to be expressed in the tissue of interest. The efficiency of PEGS allows multiple gene and peak input files (e.g. with varying false discovery rate or fold-change cut-offs) to be tested quickly.
PEGS analysis is limited to enhancer-genes associations based on genomic proximity. It builds on some aspects of, and is complementary to, GREAT3, an existing tool, which performs functional enrichment of regulatory regions using annotations of nearby genes. PEGS could also be used in conjunction with other tools to gain further mechanistic understanding (e.g. by finding enriched transcription factors with TFEA.ChIP13, ranking of their target genes with Cistrome-GO14 or BETA15, or predicting which TFs might regulate differentially expressed gene sets with Lisa16).
All data underlying the results are available as part of the article or available publicly.
Software is available from Zenodo: https://doi.org/10.5281/zenodo.55962241. It is easily installable through the Python Package Index (PyPI).
Source code available from Github: https://github.com/fls-bioinformatics-core/pegs.
Archived source code at the time of publication: https://doi.org/10.5281/zenodo.55962241
License: PEGS is distributed under BSD 3-Clause license.
Online manual: https://pegs.readthedocs.io
The authors thankfully acknowledge useful discussions with Magnus Rattray and Leo Zeef.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: bioinformatics, gene regulation, regulatory genomics, epigenomics
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational biology, bioinformatics, chromatin and epigenetics
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: bioinformatics, gene regulation, regulatory genomics, epigenomics
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational biology, bioinformatics, chromatin and epigenetics
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 02 Nov 21 |
read | read |
Version 1 15 Jul 21 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)