ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article
Revised

PEGS: An efficient tool for gene set enrichment within defined sets of genomic intervals

[version 2; peer review: 2 approved]
PUBLISHED 02 Nov 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

Abstract

Many biological studies of transcriptional control mechanisms produce lists of genes and non-coding genomic intervals from corresponding gene expression and epigenomic assays. In higher organisms, such as eukaryotes, genes may be regulated by distal elements, with these elements lying 10s–100s of kilobases away from a gene transcription start site. To gain insight into these distal regulatory mechanisms, it is important to determine comparative enrichment of genes of interest in relation to genomic regions of interest, and to be able to do so at a range of distances. Existing bioinformatics tools can annotate genomic regions to nearest known genes, or look for transcription factor binding sites in relation to gene transcription start sites. Here, we present PEGS (Peak set Enrichment in Gene Sets). This tool efficiently provides an exploratory analysis by calculating enrichment of multiple gene sets, associated with multiple non-coding elements (peak sets), at multiple genomic distances, and within topologically associated domains. We apply PEGS to gene sets derived from gene expression studies, and genomic intervals from corresponding ChIP-seq and ATAC-seq experiments to derive biologically meaningful results. We also demonstrate an extended application to tissue-specific gene sets and publicly available GWAS data, to find enrichment of sleep trait associated SNPs in relation to tissue-specific gene expression profiles.

Keywords

Genomic data integration, ChIP-seq, RNA-seq, gene set enrichment, genomic intervals

Revised Amendments from Version 1

In this version, in response to reviewer’s feedback, we have made following changes:

We have updated our software package to a newer version (0.6.3) with improved functionality in terms of command line options for input and output files. We have updated the documentation accordingly. We have also uploaded updated versions of all three figures and improved the text throughout the manuscript.

See the authors' detailed response to the review by Aziz Khan
See the authors' detailed response to the review by Nicolae Radu Zabet

Introduction

Gene expression control in higher organisms is achieved through a complex hierarchical process involving opening of chromatin, histone modifications, and binding of transcription factors (TFs). Experimental approaches to understand transcriptional regulatory mechanisms in a biological context involve large-scale measurement of gene expression. Depending on the design of the experiment, these analyses produce differentially expressed gene sets or clusters for further analysis. These studies are often complemented by assays which map, on a genome-wide scale, TF binding sites (ChIP-seq) or regions of chromatin accessibility (DNase-seq, ATAC-seq). Analyses of these data produce a collection of genomic intervals (peak sets). An important computational task is then to integrate these data to produce meaningful results; i.e. to relate gene sets to peak sets to aid functional interpretation. Bearing in mind distal regulation, an important consideration here is to be able to calculate gene set enrichment at multiple genomic distances from peak sets, and to be able to do this efficiently within the same analysis.

We present a new tool – PEGS (Peak set Enrichment in Gene Sets)1 – which calculates mutual enrichment of multiple gene sets associated with multiple peak sets, simultaneously and efficiently. This can be at user-defined peak-to-TSS (transcription start site) distances, as well as constraining to topologically associated domains (TADs). Thus, PEGS quickly produces an overall picture of gene set enrichment in relation to peaks, and shows at what genomic distances this is most pronounced. It is applicable to gene sets derived from any source, and peak sets derived from different epigenomic assays, as well as single nucleotide polymorphisms (SNPs) from genome-wide association studies (GWAS).

Methods

Architecture and implementation

In PEGS, input peaks are extended in both directions using user-provided genomic distances or constrained within known TAD boundaries, if provided (Figure 1). Subsequently, the enrichment of the input gene set is calculated among the genes whose TSSs overlap with the extended peaks, separately for each genomic distance and/or TADs. These tasks are performed in PEGS as follows:

  • 1. Creating a gene interval file in BED (Browser Extensible Data) format for all TSSs in the given genome using refGene from UCSC Table Browser. This reference TSSs BED file only needs to be created once (human hg38 and mouse mm10 are provided with the tool; a utility is provided to create these for other genomes).

  • 2. For a given peak set, peaks are extended to a specified genomic distance in both directions (and up to overlapping TAD boundaries, if provided). Intersection of these extended peaks with the gene intervals BED file from step 1 is calculated using BEDTools (RRID:SCR_006646)2. This leads to a gene set with TSSs overlapping with extended peaks.

  • 3. Using the intersection of the input gene set, and unique genes from step 2 (thus removing genes with multiple TSSs), a Hypergeometric test is performed to calculate the p-value using Equation 1, similar to GREAT (RRID:SCR_00580)3. Here, M is the total number of genes in the genome, Nc is the number of genes in the input cluster/set c, Np is the number of unique genes overlapping the peaks for given distance and npc is the intersection of two gene sets.

c733f08d-f6af-47ab-8bf6-fec209314076_figure1.gif

Figure 1.

Cartoon showing peak expansion and overlapping TSSs in PEGS, with a specified genomic distance λ from the centre of the peak in both directions (a) where TSS2 and TSS3 are included, and a TAD overlapping with the left peak in (b) where all four TSSs within the TAD are included

pvalue=x=npcmin(Np,Nc)(Ncx)((MNp)(Ncx))(MNc)(1)

Step 2 and 3 are repeated for every combination of gene cluster, peak set and genomic distance and/or TADs. The final combined heatmap shows −log10 of the resulting p-values.

PEGS is implemented in Python 3, where we have reused functions from existing Python packages included with Python distributions, or available from the Python Package Index (PyPI). We also make use of BEDTools2 for working with genomic intervals. We provide online documentation (https://pegs.readthedocs.io/en/latest/), and an example analysis with input data at the PEGS GitHub repository.

Operation

PEGS works with Python >= 3.6 and, when installed through pip, automatically installs all the dependencies. These are listed in requirements.txt file in our PEGS GitHub repository. We provide extensive documentation online at https://pegs.readthedocs.io which includes easy-to-follow instructions about:

  • Installation and system requirements

  • Format of input files, output files, and graphics

  • PEGS commands for standard operations, as well as running PEGS with additional input options, e.g. TAD definition files

  • Creating customised reference TSSs files for new genomes

Results

Use cases

Here, we present three use cases where we apply PEGS to different publicly available data sets. The format of input files is the same for all use cases below. Gene clusters are provided as text files with one gene symbol on each line; genomic region coordinates are provided in standard BED format. These input files for Use Case 1, as well as example analysis reproducing Figure 2A, are provided in our GitHub repository (https://github.com/fls-bioinformatics-core/pegs).

c733f08d-f6af-47ab-8bf6-fec209314076_figure2.gif

Figure 2.

PEGS applications: (A) gene expression, ChIP-seq and DNase I data on mouse liver; upper two panels correspond to GR ChIP-seq and DNase peaks expanded to different genomic distances while the bottom panel shows both GR ChIP-seq and DNase peaks expanded to overlapping TAD boundaries (B) gene clusters derived from scRNA-seq and intergenic putative enhancer clusters from bulk ATAC-seq from three matching early stem cell differentiation time-points. In both plots, numbers in the cells show common genes among the input genes (x-axis) and genes overlapping with expanded peaks (y-axis) and the colour shows −log10 of p-value (Hypergeometric test).

Use case 1: Application of PEGS provides insight into glucocorticoid-mediated gene regulation in mouse liver

The first application (Figure 2A) uses the gene sets consisting of putative targets of glucocorticoid receptor (GR) in mouse liver. These are up- and down-regulated genes obtained by an RNA-seq study of liver samples from mice treated acutely with synthetic glucocorticoid dexamethasone or vehicle4. Corresponding GR ChIP-seq and chromatin accessibility data (DNase I hypersensitive (DHS) regions) were obtained from 5, and 6 respectively, whilst the mouse liver TAD boundaries were obtained from 7. Raw published datasets were downloaded from GEO Sequence Read Archive (RRID:SCR_005012) using sratoolkit v2.9.2 (http://ncbi.github.io/sra-tools/). Reads were aligned to the reference genome (mouse mm10), sorted and indexed using Bowtie2 (v2.3.4.3, RRID:SCR_005476,8) and SAMtools (v1.9, RRID:SCR_002105,9). MACS2 (v2.1.1.20160309, RRID:SCR_013291,10) was used to call peaks, using default settings. Using these GR ChIP-seq peaks and DHSs as peak sets, PEGS analysis shows strong association of dexamethasone up-regulated genes with dexamethasone-induced GR peaks at distances up to 500kbp from these peaks, but no enrichment of down-regulated genes (Figure 2A, top panel), indicating distinct mechanisms of gene activation and repression by glucocorticoids. At the same time, there is promoter proximal enrichment for both up- and down-regulated genes in the DHS regions (Figure 2A, middle panel). On the other hand, PEGS analysis using TADs boundaries (Figure 2A, bottom panel) shows significant enrichment only in the case of up-regulated genes in GR ChIP-seq. This is suggestive of a direct role for GR in gene activation rather than repression.

Use case 2: PEGS demonstrates association of differential chromatin accessibility and gene expression during embryonic stem cell differentiation

Next, using PEGS, we calculated enrichment of gene clusters derived from single-cell RNA-seq and open chromatin regions defined by bulk ATAC-seq at three matching time points (ESCs- embryonic stem cells, day1 EpiLCs - epiblast-like cells, day2 EpiLCs) during early embryonic stem cell differentiation11. Early embryonic development (naïve mouse ESCs to EpiLCs) involves large changes in the chromatin landscape through the action of many transcription factors and chromatin regulators leading to specific gene expression programs. Using our publicly available data11, we defined our open chromatin peak sets as the intergenic regions with differential accessibility between any two time points. These were clustered into four profiles based on z-score of tag densities, as described in 11. Similarly, differentially expressed genes were identified from pseudo-bulk gene expression data at each time point, and were similarly clustered into four patterns across three time points. These constituted our gene sets for PEGS analysis. As shown in Figure 2B, application of PEGS to these data shows strong association between the matching gene expression (x-axis) and chromatin opening profiles (y-axis) at intergenic enhancers, reflecting correspondence between differential accessibility and gene expression changes at corresponding time points. This application shows the utility of PEGS for integrating chromatin accessibility and gene expression data, leading to biologically meaningful association of enhancer and gene clusters.

Use case 3: Extended application: PEGS detects enrichment of sleep trait SNPs in tissue-specific genes

Genome-wide association studies (GWAS) are commonly employed to study genotype-disease associations. Here, we present an extended application of PEGS to GWAS data and find associations of SNPs for different sleep phenotypes with sets of tissue-specific genes from the Genotype-Tissue Expression (GTEx) Portal, RRID:SCR_013042). For this purpose, we downloaded GWAS data from the Sleep Disorder Knowledge Portal (RRID:SCR_016611) which provides data, as well as analysis and visualisation resources for human genetic information regarding sleep and related traits. We extracted SNPs (single nucleotide polymorphisms) for certain sleep associated phenotypes (with genome wide p-value cutoff <=5 e − 8). These SNPs constitute our input peak sets to PEGS, while we defined corresponding gene sets as tissue-specific genes from GTEx portal. These were created as following; we obtained median transcripts per million (TPM) data for different tissues in GTEx, and a gene list for a specific tissue was defined as genes with greater than 5x median TPM compared to the average in the remaining tissues.

In Figure 3, using PEGS, we show enrichment of SNPs from three sleep related phenotypes, namely chronotype, daytime sleepiness adjusted for BMI, and sleep duration. These enrichments are calculated for tissue-specific genes lists created from GTEx for 22 tissues, the majority of them from the brain. Application of PEGS to these data reveals some strong associations, e.g. chronotype SNPs strongly enriched for genes expressed in liver and blood, while daytime sleepiness SNPs are enriched in gene sets for different brain tissues. Some of these associations are reported in the literature, e.g. daytime sleepiness SNPs in brain tissue12, others may warrant further investigation.

c733f08d-f6af-47ab-8bf6-fec209314076_figure3.gif

Figure 3. Enrichment of sleep traits SNPs in tissue-specific gene lists (GTEx).

The x-axis shows different tissue-specific gene lists, and y-axis shows three sets of sleep related SNPs, expanded to multiple genomic distances. The colour of the cells show −log10 of p-value of enrichment of corresponding gene list (x-axis) in the genes identified through overlap with expanded SNP intervals, the numbers in the cells show the common genes among the two (used in the calculation of Hypergeometric p-value)

Conclusions

Through the three different applications above, we demonstrate that PEGS is a versatile and highly efficient tool to integrate different genomic data, and is able to generate hypotheses for further analysis. The implementation of PEGS is highly efficient and as an example of computational efficiency, with pre-created reference TSS files, it only took 7.6 seconds to produce the output for Figure 2A on a laptop with Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz processor with 16GB RAM.

Furthermore, the user can adjust the background population and control for bias. For example, depending on the scientific question at hand, the background population could be limited to include only those genes known to be expressed in the tissue of interest. The efficiency of PEGS allows multiple gene and peak input files (e.g. with varying false discovery rate or fold-change cut-offs) to be tested quickly.

PEGS analysis is limited to enhancer-genes associations based on genomic proximity. It builds on some aspects of, and is complementary to, GREAT3, an existing tool, which performs functional enrichment of regulatory regions using annotations of nearby genes. PEGS could also be used in conjunction with other tools to gain further mechanistic understanding (e.g. by finding enriched transcription factors with TFEA.ChIP13, ranking of their target genes with Cistrome-GO14 or BETA15, or predicting which TFs might regulate differentially expressed gene sets with Lisa16).

Data availability

All data underlying the results are available as part of the article or available publicly.

Software availability

Software is available from Zenodo: https://doi.org/10.5281/zenodo.55962241. It is easily installable through the Python Package Index (PyPI).

Source code available from Github: https://github.com/fls-bioinformatics-core/pegs.

Archived source code at the time of publication: https://doi.org/10.5281/zenodo.55962241

License: PEGS is distributed under BSD 3-Clause license.

Online manual: https://pegs.readthedocs.io

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 15 Jul 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Briggs P, Hunter AL, Yang Sh et al. PEGS: An efficient tool for gene set enrichment within defined sets of genomic intervals [version 2; peer review: 2 approved]. F1000Research 2021, 10:570 (https://doi.org/10.12688/f1000research.53926.2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 02 Nov 2021
Revised
Views
7
Cite
Reviewer Report 25 Feb 2022
Aziz Khan, Stanford Cancer Institute, Stanford University, Stanford, CA, USA 
Approved
VIEWS 7
The authors addressed all of ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Khan A. Reviewer Report For: PEGS: An efficient tool for gene set enrichment within defined sets of genomic intervals [version 2; peer review: 2 approved]. F1000Research 2021, 10:570 (https://doi.org/10.5256/f1000research.78711.r98741)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
8
Cite
Reviewer Report 05 Nov 2021
Nicolae Radu Zabet, Blizard Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, E1 2AT, UK 
Approved
VIEWS 8
The authors have ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Zabet NR. Reviewer Report For: PEGS: An efficient tool for gene set enrichment within defined sets of genomic intervals [version 2; peer review: 2 approved]. F1000Research 2021, 10:570 (https://doi.org/10.5256/f1000research.78711.r98740)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 15 Jul 2021
Views
30
Cite
Reviewer Report 26 Aug 2021
Aziz Khan, Stanford Cancer Institute, Stanford University, Stanford, CA, USA 
Approved
VIEWS 30
In this paper, the authors presented a Python-based command-line tool, PEGS, for gene set enrichment in association with genomic regions. PEGS computes the enrichment of gene sets with proximity-based association with genomic region sets. These associations are further restricted within ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Khan A. Reviewer Report For: PEGS: An efficient tool for gene set enrichment within defined sets of genomic intervals [version 2; peer review: 2 approved]. F1000Research 2021, 10:570 (https://doi.org/10.5256/f1000research.57359.r89907)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 02 Nov 2021
    Mudassar Iqbal, Division of Informatics, Imaging & Data Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, M13 9PL, UK
    02 Nov 2021
    Author Response
    We thank the reviewer for constructive review and useful suggestions. We have updated the software taking into account the reviewer’s suggestions, improved the manuscript overall and added more text. Here ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 02 Nov 2021
    Mudassar Iqbal, Division of Informatics, Imaging & Data Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, M13 9PL, UK
    02 Nov 2021
    Author Response
    We thank the reviewer for constructive review and useful suggestions. We have updated the software taking into account the reviewer’s suggestions, improved the manuscript overall and added more text. Here ... Continue reading
Views
31
Cite
Reviewer Report 02 Aug 2021
Nicolae Radu Zabet, Blizard Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, E1 2AT, UK 
Approved
VIEWS 31
Briggs and co-authors present a new tool called PEGS to generate gene set enrichment for ChIP-seq and DNase-seq datasets. In fact, the tool can be applied to any set of genomic intervals, including SNPs datasets. Generation of gene set enrichment ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Zabet NR. Reviewer Report For: PEGS: An efficient tool for gene set enrichment within defined sets of genomic intervals [version 2; peer review: 2 approved]. F1000Research 2021, 10:570 (https://doi.org/10.5256/f1000research.57359.r89906)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 02 Nov 2021
    Mudassar Iqbal, Division of Informatics, Imaging & Data Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, M13 9PL, UK
    02 Nov 2021
    Author Response
    We thank the reviewer for constructive comments, here we will address their points one by one.

    1 - Do you consider alternative TSS? Would a gene with multiple TSSs ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 02 Nov 2021
    Mudassar Iqbal, Division of Informatics, Imaging & Data Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, M13 9PL, UK
    02 Nov 2021
    Author Response
    We thank the reviewer for constructive comments, here we will address their points one by one.

    1 - Do you consider alternative TSS? Would a gene with multiple TSSs ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 15 Jul 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.