ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

sSNAPPY: an R/Bioconductor package for single-sample directional pathway perturbation analysis

[version 1; peer review: 2 approved with reservations]
PUBLISHED 13 Jun 2024
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioconductor gateway.

Abstract

A common outcome of analysing RNA-Seq data is the detection of biological pathways with significantly altered activity between the conditions under investigation. Whilst many strategies test for over-representation of genes, showing changed expression within pre-defined gene-sets, these analyses typically do not account for gene-gene interactions encoded by pathway topologies, and are not able to directly predict the directional change of pathway activity. To address these issues we have developed sSNAPPY,now available as an R/Bioconductor package, which leverages pathway topology information to compute pathway perturbation scores and predict the direction of change across a set of pathways.

Here, we demonstrate the use of sSNAPPY by applying the method to public scRNA-seq data, derived from ovarian cancer patient tissues collected before and after chemotherapy. Not only were we able to predict the direction of pathway perturbations discussed in the original study, but sSNAPPY was also able to detect significant changes of other biological processes, yielding far greater insight into the response to treatment. sSNAPPY represents a novel pathway analysis strategy that takes into consideration pathway topology to predict impacted biology pathways, both within related samples and across treatment groups. In addition to not relying on differentially expressed genes, the method and associated R package offers important flexibility and provides powerful visualisation tools.

R version: R version 4.3.3 (2024-02-29)

Bioconductor version: 3.18

Package: 1.6.1

Keywords

RNA-Seq, pathway enrichment, R package, topology, KEGG, Reactome, scRNA-seq

Introduction

Using pathway enrichment analysis to gain biological insights from gene expression data is a pivotal step in the analysis and interpretation of RNA-seq data, for which numerous methods have been developed (reviewed in1,2). Many existing methods tend to view pathways simply as a collection of gene names, as seen in those relying on the detection of differentially expressed genes and applying over-representation analysis (ORA) strategies, with alternative methods being those scoring all genes using functional class scoring (FCS), such as in Gene Set Enrichment Analysis (GSEA),3 arguably the most widely-used approach. However, databases such as the Kyoto Encyclopaedia of Genes and Genomes (KEGG)4 and WikiPathways5 capture not only which genes are implicated in a certain biological process but also their interactions, activating or inhibitory roles, and their relative importance within the pathway, all of which are overlooked in ORA- and FCS-based approaches.

To fully utilise this additional information, the latest generation of pathway analysis approaches include many which are topology-based such as SPIA,6 DEGraph,7 NetGSA8 and PRS,9 as well as others which explicitly model inter-gene correlations.10 Despite differences in the null hypotheses tested across these approaches, overall, they have demonstrated enhanced sensitivity and specificity due to their abilities to take gene-gene interconnections into account.11,12 Nevertheless, most topology-based methods focus only on comparing the activity of pathways between two treatment groups and cannot be used to score individual samples (Figure 1). However, in heterogeneous data where more than one factor may be influencing observations,13 incorporating scoring within paired samples may be desirable and may be able to reveal more nuanced insights. To address this gap, we present a Single-Sample directioNAl Pathway Perturbation analYsis methodology called sSNAPPY, available as an R/Bioconductor package. This article defines how sSNAPPY computes changes in gene expression within paired samples and propagates this through gene-set topologies, to predict the perturbation in pathway activities within paired samples, before providing summarised results across an entire dataset (Figure 1). The practical usage of the sSNAPPY R/Bioconductor package is illustrated through the analysis of a public scRNA-seq dataset using a pseudo-bulk approach, demonstrating its applicability to both bulk RNA-Seq and scRNA-Seq datasets.

7d855428-3c1f-4fa3-a6c5-ae5e6351b94b_figure1.gif

Figure 1. Schematic illustration of the differences between conventional pathway analysis methods and sSNAPPY.

Instead of being limited to treatment-level analyses, sSNAPPY allows the detection of pathway perturbation within individual samples by using sample-specific estimates of fold-change instead of experiment-wide estimates. (Created with BioRender.com).

Methods

Implementation

sSNAPPY is an R package that has been reviewed and published on the open-source bioinformatics software platform Bioconductor with all source code available via GitHub. The methodology itself is topology-based, designed to compute directional, single-sample, pathway perturbation scores for gene expression datasets with matched-pair, or nested designs (e.g. samples collected before and after treatment). Common examples of such designs may include treated vs control samples within cell-line passages, or across multiple treatments applied to tissue samples from within a specific donor. This allows for the detection of pathway perturbations within all samples from a treatment group, but also within individual samples, with sSNAPPY providing results in the form of pathway perturbation scores 1) for each set of paired samples, and 2) across all paired samples within an experimental grouping. The only data required to run sSNAPPY, is a log-transformed expression matrix (e.g. logCPM) with matching sample metadata describing treatment groups and the nested structure. It is assumed that all pre-processing has been performed beforehand, such as the exclusion of low-signal genes or normalisation to minimise technical artefacts like GC-bias.

The first step performed by sSNAPPY, is to estimate sample-specific log fold-change (δghi=μghiμg0i) across all genes g, for each treatment h, and within each set of nested replicates i, by subtracting expression estimates for the baseline samples μg0i from those in treatment group h. It should also be noted that sSNAPPY is applicable to any number of treatment/condition levels and that sample numbers within each treatment group are not required to be balanced.

It is well known that in RNA-seq data, genes with lower expression tend to have greater variability in signal and more broadly spread estimates of change.14 As such, we utilise a gene-level weighting strategy to down-weight fold-change estimates for low-abundance genes prior to passing these estimates to sSNAPPY. Gene-level weights wg are obtained in a treatment-agnostic manner by fitting a loess curve through the relationship between observed gene-level variance (σg2) and average signal (μ¯g)(Figure 2), and taking the inverse of the loess-predicted variance as the weight wg=a/f(μ¯g), where f(μ¯g) is the predicted value from the loess curve and the constant a ensures wg=1. We then use these globally-weighted estimates of log fold-change (δghi=wgδghi) in the calculation of all subsequent pathway perturbation scores.

7d855428-3c1f-4fa3-a6c5-ae5e6351b94b_figure2.gif

Figure 2. Gene-wise standard deviations shown against mean logCPM.

The relationship between standard deviations and expression levels is modelled by a loess fit. Whilst standard deviations are shown for the purposes of visualisation, gene-level weights are calculated using variances at this stage of the sSNAPPY algorithm.

sSNAPPY extends the topology-based scoring algorithm initially proposed in SPIA6 which propagates fold-change estimates from genes considered as differentially expressed through pathway topologies, to compute a perturbation score for each pathway. In contrast to SPIA, which relies on a defined set of differentially expressed genes, sSNAPPY uses fold-change estimates from all detected genes. By modifying the algorithm to incorporate single-sample, weighted estimates of fold-change, we are able to numerically represent changes in a pathway within a given sample, and subsequently model these across all samples within a treatment group. Thus, we define the single-sample perturbation score (Ship) for a given pathway p and treatment h for a set of nested samples i:

Ship=gGp[Sghipδghi],whereSghip=δghi+gUgpβggpSghipNgp
and:
  • Gp represents the set of genes in pathway p, such that gGp

  • Sghip is the gene-, treatment- and sample-specific perturbation score for pathway p

  • δghi=wgδghi is the weighted log fold-change of gene g as described above

  • Ugp is the subset of Gp containing only the genes directly upstream of gene g, and not including gene g

  • βggp is the pair-wise gene-gene interactions6 encoded by the topology matrix for genes g and gUgp

  • Ngp is the number of downstream genes from any gene g

  • Ship is the accumulated pathway perturbation score for pathway p in treatment h within sample i across all genes in the pathway

To scale single-sample pathway perturbation scores (Ship) so they are comparable across pathways, and to test for significance of individual scores, null distributions of perturbation scores for each pathway are generated through a sample permutation strategy, which retains any existing correlation structures between genes within a pathway. During permutation, all sample labels are randomly shuffled and permuted pseudo-pairs formed from the re-shuffled labels. Single-sample fold-changes are then calculated for each pseudo-pair of permuted samples while the rest of the scoring algorithm remains unchanged. The median and median absolute deviation (MAD) are calculated from the set of permuted perturbation scores within each pathway and used to normalise the raw perturbation scores to robust Z-scores, i.e. Zhip.

All possible permuted pseudo-pairs are sampled unless otherwise specified, such that in an experiment with I total samples, the maximum number of unique permuted pairs is P=I!(I2)!=I×(I1). Permutation p-values are calculated for each Ship value, indicating the approximate significance of pathway perturbation at the single-sample level. Relying on symmetry around zero, these are derived by assessing the proportion of permuted scores with absolute values as extreme, or more extreme, than the absolute value of test perturbation score within each pathway.15 Since the smallest achievable permutation p-value is 1/P, accurate estimation of small p-value requires a large number of permutations that is only feasible in data with a large sample size. As a guideline, GSEA recommends a minimum of 7 samples in each treatment group for utilizing their phenotype permutation approach,16 and under sSNAPPY this would yield P = 182 unique permuted pseudo-pairs and minimal permutation p-values of p0.0055at each individual sample and pathway-level.

Apart from assessing whether a pathway’s activity is changed significantly within an individual sample, an important question may be the detection of changes in pathway activity across all samples within a treatment. This can be performed by modelling Zhip values using regression models and incorporating Smyth’s moderated t-statistic17 as implemented in limma.18 The single-sample nature of sSNAPPY’s pathway perturbation scores is particularly helpful for datasets with complex experimental designs or known confounding factors as these can also be incorporated into the final regression models.

The Bioconductor package graphite19 provides functions that can be used to retrieve pathway topologies from a database and convert topology information to adjacency matrices. To streamline this process, we have implemented a convenience function, where users only need to provide the name of the desired database and species to retrieve all topology information in the format required by the scoring algorithm with the correct type of gene identifiers (i.e. EntrezID). Importantly, we noted that graphite19 retrieves topology matrices in varied orientations from different databases; in KEGG pathways, columns represent downstream relationships, whereas in Reactome and WikiPathway, columns indicate upstream regulation. However, this discrepancy is not taken account of in the runSPIA() function of graphite. We resolve the issue by transposing the topology matrices retrieved from the Reactome and WikiPathway as part of our convenience function.

Operation

The package has been tested on all operating systems, requiring R > 4.3.0, and can be installed using BiocManager as follows.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("sSNAPPY")

Use cases

Required packages

The set of additional required packages for this workflow can be installed using

pkg <- c(
  "tidyverse", "patchwork", "kableExtra", "AnnotationHub", "edgeR", "here",
  "pander", "colorspace", "SPIA", "fgsea"
)
BiocManager::install(pkg, update = FALSE, force = FALSE, ask = FALSE)

Once installed, the complete set of packages can be loaded into the session.

library(sSNAPPY)
library(tidyverse)
library(magrittr)
library(patchwork)
library(kableExtra)
library(AnnotationHub)
library(edgeR)
library(pander)
library(colorspace)
library(SPIA)
library(fgsea)

Retrieval of pathway topology matrices

Retrieving pathway topology information from a chosen database is the next key step required during preparation for running sSNAPPY, and this is the only step requiring internet access. If running on an HPC cluster where internet access may be restricted, this step can be performed separately with topologies saved by the user as an RDS object. Using the Reactome database20 in this workflow, the retrieved topology information will be stored as a list where each element corresponds to a pathway and the numbers in the matrices encode gene-gene interactions.

gsTopology <- retrieve_topology(database = "reactome", species = "hsapiens")

In addition to downloading topology matrices for all pathways, it is also possible to provide a restricted set of keywords for a targeted analysis. For example, passing the argument keyword = c("metabolism", "estrogen") would only return the subset of pathways which match either of these keywords. Multiple databases are also able to be searched by passing a vector of database names to the database argument.

Data

To demonstrate the application of sSNAPPY, we used pre-processed counts from a publicly available scRNA-seq dataset, retrieved from Gene Expression Omnibus (GEO) with accession code GSE165897. This dataset consists of 11 high-grade serous ovarian cancer (HGSOC) patient samples taken before and after chemotherapy.21 sSNAPPY was used to re-analyse data from the epithelial cells as they were the primary focus of the original study. Since sSNAPPY was designed primarily for bulk RNA-seq data, counts from epithelial cells within the same samples were first summed into pseudo-bulk profiles, giving rise to a total of 22 samples. We considered a gene detectable if we observed >1.5 counts per million in 11 of the 22 samples, ideally representing all samples from a complete treatment group. 11,101 (33.8%) of the 32,847 annotated genes passed this selection criteria and were included for downstream analyses. Conditional quantile normalisation22 was then applied to mitigate potential biases introduced by gene length and GC content. The normalised logCPM matrix of the processed dataset and sample metadata can be downloaded from here.

To begin running the sSNAPPY workflow, we first load our expression matrix and define our sample-level metadata. Importantly, the row names of the expression matrix must be specified as EntrezGene IDs, for compatibility with pathway databases. Genes without EntrezGene IDs were excluded during pre-processing, leaving 10,098 genes in the example expression matrix. The treatment column within our metadata is expected to be a factor, with the reference level interpreted as the control treatment.

readr::local_edition(1)
logCPM <- read_tsv(here::here("data/logCPM.tsv")) %>%
  column_to_rownames("entrezid")
sample_meta <- read_tsv(here::here("data/sample_meta.tsv"), col_types = "cfccncnc")
head(sample_meta)

## # A tibble: 6 × 8
##   sample         treatment  patient_id anatomical_location Age Stage PFI CRS
##   <chr>           <fct>       <chr>      <chr>             <dbl> <chr> <dbl> <chr>
## 1 EOC372_treat … treatmen … EOC372     Peritoneum         68 IIIC    460 1
## 2 EOC372_post- … post-NACT  EOC372     Peritoneum         68 IIIC    460 1
## 3 EOC443_post- … post-NACT  EOC443     Omentum            54 IVA     177 3
## 4 EOC443_treat … treatmen … EOC443     Omentum            54 IVA     177 3
## 5 EOC540_treat … treatmen … EOC540     Omentum            62 IIIC    126 2
## 6 EOC540_post- … post-NACT  EOC540     Omentum            62 IIIC    126 2

Score single-sample pathway perturbation

To compute the single-sample fold-changes (i.e. logFC) required for the set of perturbation scores, samples must be ‘matched pairs’ or nested, as discussed earlier, and treatments performed within each patient represents the nesting structure in this dataset. The factor (patient_id) defining the nested structure is passed to weight_ss_fc() using the groupBy parameter.

weightedFC <- weight_ss_fc(
  as.matrix (logCPM), metadata = sample_meta,
  sampleColumn = "sample", groupBy = "patient_id", treatColumn = "treatment"
)
glimpse (weightedFC)

The output of weight_ss_fc() is a list where one element is the matrix of weighted single-sample fold-changes (δghi), with rows corresponding to genes and columns to samples, and the other element is the vector of gene-wise weights (wg) used to calculate the weighted log fold-change (δghi). By default, the string ENTREZID: is added to all row names of the δghi matrix to be compatible with the format Reactome pathway topologies are retrieved in.

The matrix of δghi values is then passed to pathway topologies to compute gene-wise perturbation scores for all genes within a pathway, before being summed into a single score for each pathway. The function raw_gene_pert() returns an initial list, with each element containing the gene-level perturbation scores for a given pathway. These matrices are also able to be used during downstream analysis to identify which genes play the most significant roles in each pathway, as demonstrated in later sections. Pathway-level perturbation scores (Ship) are then returned as a data.frame containing sample and gene-set names after calling pathway_pert(). Pathways with zero perturbation scores across all genes and samples are automatically dropped at this step. In addition, pathways with aggregated pathway-level scores of zero in all samples will also be dropped by default, unless otherwise specified through the drop parameter of the pathway_pert() function.

genePertScore <- raw_gene_pert(weightedFC$weighted_logFC, gsTopology)
ssPertScore <- pathway_pert(genePertScore, weightedFC$weighted_logFC)
ssPertScore %>%
    split(f = .$gs_name) %>%
    keep(~all(. != 0)) %>%
    length()
## [1] 1093
head (ssPertScore)
##             sample         score                           gs_name
## 1 EOC372_post-NACT -2.292688e-04 reactome. Interleukin-6 signaling
## 2 EOC443_post-NACT -2.447003e-04 reactome. Interleukin-6 signaling
## 3 EOC540_post-NACT -1.848758e-04 reactome. Interleukin-6 signaling
## 4   EOC3_post-NACT -1.229489e-04 reactome. Interleukin-6 signaling
## 5  EOC87_post-NACT  3.427132e-05 reactome. Interleukin-6 signaling
## 6 EOC136_post-NACT  2.822155e-04 reactome. Interleukin-6 signaling

Sample permutation for normalisation and significance testing

The range of Ship values obtained from the complete set of pathways will vary greatly, due to the variability in topology structures. To determine the significance of individual scores and transform scores to ensure they are comparable across pathways, sSNAPPY utilises a sample-permutation strategy to estimate the null distributions of perturbation scores and derive Zhip scores. Since sample labels will be permuted randomly to put samples into pseudo-pairs, sample metadata is not required by the generate_permuted_scores function. All possible random pairs between samples will be sampled unless otherwise specified. In this example dataset with a total of 22 samples, the full set of 462 (i.e. P=22×21) permuted scores will be computed for each pathway.

permutedScore <- generate_permuted_scores(
  as.matrix (logCPM),
  gsTopology = gsTopology[names(genePertScore)],
  weight = weightedFC$weight
)

Apart from pathways whose permuted perturbation scores are consistently zero, and which will be dropped by default, the empirical distributions of remaining pathways are expected to be approximately normally distributed with mean μ=0, but with the scale of distributions heavily impacted by both the number of genes within each pathway and the overall topology. To demonstrate this, we randomly selected 6 pathways to demonstrate their quantile-quantile (q-q) plot and visualised the distributions of their permuted perturbation scores as boxplots (Figure 3).

set.seed(123)
random_pathways <- permutedScore %>%
  keep(~all(. != 0)) %>%
  .[sample(seq_along(.), 6)] %>%
  as.data.frame() %>%
  pivot_longer(
    cols = everything(), names_to = "gs_name", values_to = "score"
  ) %>%
  mutate(
    gs_name = str_replace_all(gs_name, "\\.", " "),
    gs_name = str_remove_all(gs_name, "reactome ")
  )
p1 <- random_pathways %>%
  ggplot(aes (sample = score, colour = gs_name)) +
  stat_qq() +
  stat_qq_line(colour = "black") +
  facet_wrap(~str_wrap(gs_name, width = 25), scales= "free") +
  labs(y = "Permuted Perturbation Score", x = "Theoretical Quantiles") +
  theme_bw() +
  theme(
    legend.position = "none",
    text = element_text(size = 14),
    strip.text = element_text(size = 16))
p2 <- random_pathways %>%
  ggplot(aes (gs_name, score, fill = gs_name)) +
  geom_boxplot() +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
  scale_fill_discrete(name = "Gene-set Name") +
  labs(x = "Pathway", y = "Permuted Perturbation Score") +
  theme_bw() +
  theme(
    legend.position = "none",
    axis.title = element_text(size = 16),
    axis.text = element_text(size = 14)
  )
(p1/p2) +
  plot_annotation(tag_levels = "A") +
  plot_layout(heights = c(0.6, 0.4))

7d855428-3c1f-4fa3-a6c5-ae5e6351b94b_figure3.gif

Figure 3. (A) Q-Q plot and (B) distributions of permuted perturbation scores of six randomly selected pathways.

All sampled empirical distributions are approximately normally distributed with a mean of zero.

The distributions obtained from label permutations are then used to convert each pathway-level score into the robust Zhij-score using the function normalise_by_permu(). Two-sided p-values for individual scores are computed based on how extreme test scores are in comparison to permuted scores for each pathway and corrected for multiple testing using any of the available methods, returning the FDR-adjusted values by default. In our example data, no pathways would be considered as significantly perturbed at the single-sample level using an FDR adjustment with α = 0.05.

normalisedScores <- normalise_by_permu(permutedScore, ssPertScore,
                                       sortBy = "pvalue")
head (normalisedScores)

##               MAD MEDIAN
## 2306 0.0006067519      0
## 2525 0.0002909911      0
## 5869 0.0001652241      0
## 5871 0.0001652241      0
## 5872 0.0001652241      0
## 7721 0.0275863198      0
##                                                           gs_name
## 2306 reactome.Golgi Cisternae Pericentriolar Stack Reorganization
## 2525       reactome.DNA Damage/Telomere Stress Induced Senescence
## 5869                        reactome.Defective CHST6 causes MCDC1
## 5871           reactome.Defective ST3GAL3 causes MCT12 and EIEE15
## 5872       reactome.Defective B4GALT1 causes B4GALT1-CDG (CDG-2d)
## 7721                        reactome.Mitochondrial protein import
##                sample         score   robustZ      pvalue adjPvalue
## 2306 EOC153_post-NACT -0.0013632057 -2.246727 0.004329004 1.0000000
## 2525 EOC153_post-NACT  0.0006149637  2.113342 0.004329004 1.0000000
## 5869 EOC349_post-NACT -0.0003304944 -2.000279 0.004329004 0.9993248
## 5871 EOC349_post-NACT -0.0003304944 -2.000279 0.004329004 0.9993248
## 5872 EOC349_post-NACT -0.0003304944 -2.000279 0.004329004 0.9993248
## 7721 EOC443_post-NACT -0.0598366717 -2.169070 0.004329004 0.9471861

A key question of interest in our example dataset is to identify which biological processes were impacted by chemotherapy across the entire group of patients. Using the sample-level output obtained above, we can explore this by applying t-tests or regression models across all samples. In order to minimise spurious results, Smyth’s moderated t-statistics17 can be applied across the complete dataset, with a constant variance assumed across all pathways, given that we are using Z-scores. To perform this analysis, Zhip values were converted to a matrix and standard limma methodologies were used. For our use case here, where only one treatment group is present, no design matrix is required, and a simple t-test is appropriate.

z_matrix <- normalisedScores %>%
  dplyr::select (robustZ, gs_name, sample) %>%
  pivot_wider(names_from = "sample", values_from = "robustZ") %>%
  column_to_rownames("gs_name") %>%
  as.matrix()
z_fits <- z_matrix %>%
  lmFit(design = rep(1, ncol(.))) %>%
  eBayes()
top_table <- topTable(z_fits, number = Inf) %>%
  as_tibble(rownames = "gs_name")
sigPathway <- dplyr::filter (top_table, adj.P.Val < 0.05)

121 out of the 1094 tested Reactome pathways have an FDR < 0.05 in the moderated t-test, and were considered to be significantly perturbed at the group level. Table 1 presents the top 10 significantly inhibited and activated pathways, along with their predicted direction of change.

table 1 <- sigPathway %>%
  mutate(
    Direction = ifelse (logFC < 0, "Inhibited", "Activated"),
    gs_name = str_remove_all(gs_name, "reactome.")
  ) %>%
  split(f = .$Direction) %>%
  lapply (dplyr::slice, 1:10) %>%
  bind_rows() %>%
  dplyr::select(
    Pathway = gs_name, Change = logFC, P.Value, FDR = adj.P.Val, Direction
  )

Table 1. Significantly impacted Reactome pathways identified among post-chemotherapy samples using sSNAPPY. Only the 10 most significantly inhibited and 10 most significantly activated pathways are shown.

PathwayChangeP.ValueFDRDirection
Signaling by Retinoic Acid0.6014.45E-040.0152Activated
Regulation of pyruvate dehydrogenase (PDH) complex0.5984.59E-040.0152Activated
Pyruvate metabolism0.5984.59E-040.0152Activated
Pyruvate metabolism and Citric Acid (TCA) cycle0.5984.59E-040.0152Activated
Negative regulation of MAPK pathway0.6275.40E-040.0174Activated
Translocation of ZAP-70 to Immunological synapse0.6248.75E-040.0218Activated
Generation of second messenger molecules0.6248.75E-040.0218Activated
MHC class II antigen presentation0.6280.00120.0241Activated
NGF-stimulated transcription0.5820.00140.0247Activated
Downstream TCR signaling0.6330.00140.0247Activated
Interleukin-35 Signalling-0.9021.69E-050.0151Inhibited
Sphingolipid de novo biosynthesis-0.8962.75E-050.0151Inhibited
SUMOylation of DNA replication proteins-0.8196.86E-050.0152Inhibited
Condensation of Prometaphase Chromosomes-0.9049.91E-050.0152Inhibited
Epigenetic regulation of gene expression-0.7901.21E-040.0152Inhibited
Polo-like kinase mediated events-0.8791.43E-040.0152Inhibited
Chromatin modifying enzymes-0.7901.52E-040.0152Inhibited
Chromatin organization-0.7901.52E-040.0152Inhibited
SUMO E3 ligases SUMOylate target proteins-0.7671.83E-040.0152Inhibited
APC-Cdc20 mediated degradation of Nek2A-0.8382.13E-040.0152Inhibited

For enrichment analysis in the original study,21 unsupervised clustering was performed on all cells labelled as cancer cells. Clusters were then annotated manually by performing pathway enrichment testing on cluster marker genes. Two clusters, associated with proliferative DNA repair signatures and stress-related markers, contained significantly higher numbers of post-chemotherapy cells than pre-treatment ones.21 The representative pathways enriched in the stress-associated cluster were IL6-mediated signaling events, TNF signaling pathway, and cellular responses to stress, and the other post-chemotherapy cell dominated cluster in the original study was enriched for pathways associated with cell proliferation and DNA repair, such as the Cell cycle, DNA repair, Homology directed repair (HDR) through homologous recombination, and the Fanconi anaemia pathway.

Whilst the published enrichment analysis was performed using ConsensusPathDB,23 in order to use pathway topologies we chose the Reactome set of pathways.20 sSNAPPY not only detected many significant perturbed pathways that are highly concordant with the pathways reported to be enriched in the original study but also includes an expected direction of change in activity. For example, the DNT repair pathway SUMOylation of DNA damage response and repair proteins pathway was predicted to be significantly inhibited by chemotherapy. The single-sample nature of the sSNAPPY output also provides great flexibility: apart from considering all treated samples as biological replicates, users may elect to perform an analysis incorporating other phenotypic traits which may impact a patient’s responses to chemotherapy, such as disease stages or tumour grades. To perform this step using the moderated t-statistic strategy and extend the above analysis, an appropriate design matrix is the only additional requirement for model-fitting, or alternatively, samples may be subset as may be appropriate.

Visualising perturbed pathways as networks

A valuable feature of sSNAPPY is the provision of several visualisation functions to assist in the presentation and interpretation of results. Biological pathways are not independent of each other with many genes playing a role across multiple pathways, and as such, viewing pathway analysis results as a network can be a powerful way to intuitively summarise the results and facilitate interpretation of the underlying biology. The plot_gs_network() function allows users to easily convert a list of relevant biological pathways to a network where edges between pathway nodes represent overlapping genes. Defined by the colorBy parameter, pathway nodes can be coloured by either the predicted direction of change or by significance levels (Figure 4). The returned plot is a ggplot224 object, meaning that components of the plotting theme and other parameters can be customized as for any other ggplot2 objects.

In the following example, we’ll inspect the 10 most significantly inhibited and 10 most significantly activated pathways, which involved four steps to prepare the data: 1) rename the logFC column to reflect the true meaning of the value and, 2) create a categorical variable with the pathway status, 3) transform p-values for simpler visualisation and 4) obtain a subset of pathways to visualise.

sigPathway <- sigPathway %>%
  dplyr::rename(Z = logFC) %>%
  mutate(
    status = ifelse(Z > 0, "Activated", "Inhibited"),
    status = as.factor(status),
    '-log10(p)' = -log10(P.Value)
  ) %>%
  split(f = .$status) %>%
  lapply(dplyr::slice, 1:10) %>%
  bind_rows()
# Plot the network structure
set.seed(150)
p1 <- plot_gs_network(
  normalisedScores = sigPathway, gsTopology = gsTopology, colorBy = "status",
  gsNameSize = 3
) +
  scale_colour_manual(values = c("red", "blue", "grey30")) +
  theme_void() +
  theme(
    legend.text = element_text(size = 10),
    legend.position = "inside",
    legend.position.inside = c(0.95, 0.95)
  )
set.seed(150)
p2 <- plot_gs_network(
  normalisedScores = sigPathway,
  gsTopology = gsTopology,
  colorBy= "-log10(p)",
  gsNameSize = 3,
  gsLegTitle = expression(paste(-log[10], "p"))
) +
  scale_colour_viridis_c() +
  theme_void() +
  theme(
    legend.text = element_text(size = 8),
    legend.title = element_text(size = 10),
    legend.position = "inside",
    legend.position.inside = c(0.95, 0.95)
  )
p1/p2 + plot_annotation(tag_levels = "A")

7d855428-3c1f-4fa3-a6c5-ae5e6351b94b_figure4.gif

Figure 4. Significantly perturbed Reactome pathways identified among post-chemotherapy samples using sSNAPPY.

Each node represents a significantly perturbed Reactome pathway, with nodes coloured by (A) predicted direction of change and (B) -log10(p). The 10 most significantly inhibited and 10 most significantly activated pathways are shown.

An advantage of visualising pathway analysis results using network structures is that it allows the identification of highly connected pathways (Figure 4). To summarise related pathways and further enable interpretation, we can apply community detection25 to group related pathways into ‘communities’. sSNAPPY’s plot_community() function is a “one-stop shop” for applying a community detection algorithm of the user’s choice to the network structure and annotating identified communities by the most common pathway category, denoting the main biological processes perturbed in that community. The most recent categories for both KEGG and Reactome databases were curated from their respective website (KEGG website & Reactome website) and included as parts of sSNAPPY. Analyses involving other pathway databases may require user-provided pathway categories. When the information about pathway categorisations is available, annotation of pathway communities is automatically completed. In the current dataset, the Louvain method was applied to the network of biological pathways and revealed five primary communities: 1) Adaptive Immune System; 2) Cell Cycle, Mitotic; 3) Chromatin modifying enzymes & Epigenetic regulation of gene expression; 4) Post-translational protein modification and 5) The citric acid (TCA) cycle and respiratory electron transport (Figure 5). The largest community formed was the Adaptive Immune System pathway, indicating a clear immune-signalling aspect to these results.

load(system.file("extdata", "gsAnnotation_df_reactome.rda", package = "sSNAPPY"))
set.seed(199)
plot_community(
    normalisedScores = sigPathway,
    gsTopology = gsTopology,
    gsAnnotation = gsAnnotation_df_reactome,
    colorBy = "status",
    lb_size = 3
) +
  scale_colour_manual(values = c("red", "blue")) +
  scale_fill_viridis_d() +
  scale_x_continuous(expand = expansion(0.2)) +
  scale_y_continuous(expand = expansion(0.2)) +
  guides (fill = "none") +
  theme_void() +
  theme(
      legend.position = "inside",
      legend.position.inside = c(0.9, 0.1)
  )

7d855428-3c1f-4fa3-a6c5-ae5e6351b94b_figure5.gif

Figure 5. Community structures among significantly perturbed Reactome pathways that were identified among post-chemotherapy samples using sSNAPPY.

The Louvain algorithm was applied to detect community structures, revealing biological processes associated with highly ranked pathways. The top 20 which were the most perturbed by chemotherapy are shown.

A key advantage of sSNAPPY is that it does not require the prior identification of differentially expressed genes, as this is a common challenge faced within clinical datasets. However, knowing which genes are implicated in the perturbation of pathways, particularly those which influence multiple pathways, can provide valuable insights for hypothesis generation and the underlying biological mechanisms. Therefore, sSNAPPY provides another visualisation feature called plot_gs2gene, which enables the inclusion of select genes from each pathway using network structures. Users can provide a vector of fold-change estimates to visualise genes within pathways, showing their estimated change in expression. As pathways often include hundreds of genes, it is recommended to filter for genes most likely to be playing a significant role. In this example dataset, only genes within the top 500 when ranking by the magnitude of the mean ssFC were included (Figure 6). An alternative strategy will be to select genes based on test-statistics, however, this decision is up to the individual researcher.

meanFC <- rowMeans(weightedFC$weighted_logFC)/weightedFC$weight
top500 <- rank(1/abs (meanFC)) <= 500

Since Reactome pathway topologies were retrieved using EntrezIDs, users can provide a data.frame mapping EntrezIDs to their chosen identifiers, such as gene names, through the mapEntrezID parameter, in order to make the visualisations more informative. A data.frame converting EntrezIDs to Ensembl gene names was derived from the Ensembl Release 10126 and has been made available as part of the package and serves as a helpful template for future mapping operations by users.

load(system.file("extdata", "entrez2name.rda", package = "sSNAPPY"))
head(entrez2name)

## # A tibble: 6 × 2
##   entrezid           mapTo
##   <chr>              <chr>
## 1 ENTREZID:84771     DDX11L1
## 2 ENTREZID:727856    DDX11L1/DDX11L9/DDX11L10
## 3 ENTREZID:100287102 DDX11L1
## 4 ENTREZID:100287596 DDX11L1/DDX11L9
## 5 ENTREZID:102725121 DDX11L1
## 6 ENTREZID:653635    WASH7P

set.seed(195)
plot_gs2gene(
  normalisedScores = sigPathway,
  gsTopology = gsTopology,
  colorGsBy = "status",
  mapEntrezID = entrez2name,
  geneFC = meanFC[top500],
  layout = "kk",
  edgeAlpha = 1,
  gsNodeSize = 5,
  geneNodeSize = 3,
  gsNameSize = 4,
  geneNameSize = 3.5
) +
  scale_colour_gradient2(name = "Mean ssFC", low = "orange", high = "green4") +
  scale_fill_manual(values = c("red", "blue")) +
  theme_void() +
  theme(
    legend.text = element_text(size = 12),
    legend.title = element_text(size = 12)
  )

7d855428-3c1f-4fa3-a6c5-ae5e6351b94b_figure6.gif

Figure 6. Genes associated with significantly perturbed Reactome pathways that were identified among post-chemotherapy samples using sSNAPPY.

Significantly perturbed Reactome pathways identified among post-chemotherapy samples using sSNAPPY, showing any genes in the top 500 ranked by magnitude of change in expression, and which pathways they are likely contributing to. Only the 10 most significantly inhibited and 10 most significantly activated pathways are shown.

Identifying key gene contributions

To further investigate a specific pathway and elucidate which are the key genes contributing to the final perturbation score, we can generate a heatmap via plot_gene_contribution() which shows the gene-level perturbation scores for the top-ranked members of a given pathway. This function takes advantage of the plotting capabilities of the pheatmap package,27 and as such, other annotations are also able to be easily included, such as patient response, or which general ranges the pathway-level normalised Z-Scores are in. Inclusion of the Z-Scores enabled the assessment of the level of perturbation predicted in each sample and key genes involved (Figure 7).

path_name <- "SUMOylation of DNA replication proteins"
z_breaks <- round(qnorm(c(0, 0.025, 0.05, 0.25, 0.75, 0.95, 0.975, 1)), 2)
annotation_df <- normalisedScores %>%
  dplyr::filter(str_detect(gs_name, path_name)) %>%
  left_join(dplyr::select(sample_meta, sample, CRS), by = "sample") %>%
  mutate(
    'Z Range' = cut(robustZ, breaks = z_breaks, include.lowest = TRUE),
    sample = str_remove_all(sample, "_post-NACT")
  ) %>%
  dplyr::select(sample, 'Z Range', CRS)
z_levels <- levels (annotation_df$'Z Range')
annotation_col <- list(
  CRS = c("3" = "#4B0055", "2" = "#009B95", "1" = "#FDE333"),
  'Z Range' = setNames(
    colorRampPalette(c("navyblue", "white", "darkred"))(length(z_levels)),
    z_levels
  )
)
mat <- genePertScore %>%
  .[[which (str_detect(names(.), path_name))]] %>%
  set_colnames(str_remove_all(colnames(.), "_post-NACT")) %>%
  .[rownames(.) %in% rownames (weightedFC$weighted_logFC),]
max_pert <- max (abs (range (mat))) * 1.01
plot_gene_contribution(
  genePertMatr = mat,
  color = rev (colorspace::divergex_hcl(100, palette = "RdBu")),
  breaks = seq(-max_pert, max_pert, length.out = 100),
  annotation_df =annotation_df,
  annotation_colors = annotation_col,
  filterBy = "mean",
  mapEntrezID = entrez2name,
  cutree_rows = 2,
  cutree_cols = 2,
  main = paste (path_name, "[REACTOME]")
)

7d855428-3c1f-4fa3-a6c5-ae5e6351b94b_figure7.gif

Figure 7. Gene-level perturbation scores for the top 10 genes in "SUMOylation of DNA replication proteins" pathway.

All pathway genes were ranked by average contribution to the perturbation score to select the top 10 genes. Samples were annotated by patient chemotherapy response score (CRS), along with the range for sample-level Z-scores as a guide to sample-specific pathway perturbation. The genes CDCA8, TOP2A, UBE2I, BIRC5 were identified as possible key drivers of inhibition for this pathway.

From this heatmap we can identify four candidate genes which are likely to be making a contribution to the inhibition of the SUMOylation of DNA replication proteins pathway upon chemotherapy, such as CDCA8, TOP2A, UBE2I, BIRC5 (Figure 7). These four genes are all associated with tumour progression and invasiveness and have been studied in the context of ovarian cancer. Both ubiquitin conjugating enzyme E2I (UBE2I) and cell division cycle associated 8 (CDCA8) genes have been identified as oncogenes in numerous cancer types, including ovarian cancer.28,29 Notably, in ovarian cancer, elevated UBE2I expression has been associated with poorer clinical outcomes.30 Similarly, BIRC5 encodes a protein which is also a predictor of inferior ovarian cancer patient outcome.31 Lastly, Topoisomerase IIα (TOP2A), which encodes DNA topoisomerase, has been identified as a gene that promotes the tumorigenesis of HGSOC tumours.32 Aligning with the report by Chekerov et al.33 that expression of TOP2A in ovarian tumour cells decreases as a response to chemotherapy,33 the median single-sample logFC of TOP2A was negative among the HGSOC post-chemotherapy samples included in this study (Figure 8). The other three selected potential driver genes (CDCA8, UBE2I, and BIRC5) also had negative median single-sample logFC in post-chemotherapy samples (Figure 8). Considering the implication of these four genes in ovarian cancer, decreases in their expression after chemotherapy treatment potentially indicate a favorable response to therapy.

By annotating the heatmap of gene-wise perturbation scores with patient chemotherapy response score (CRS), we also noted that the strongest inhibition of the SUMOylation of DNA replication proteins pathway was in the patient with the highest CRS score of 3 (i.e sample EOC443). CRS is an indicator of the relative length of progression-free survival after chemotherapy, where a score of 3 represents the longest survival. Hence inhibition of the SUMOylation of DNA replication proteins pathway might mediate favorable response to chemotherapy in ovarian cancer patients. We acknowledge that our analysis was limited to a small number of patients, which restricts the generalizability of the results. However, despite this limitation, these findings underscore the strength of sSNAPPY as a valuable tool for hypothesis generation not otherwise possible. Not only can sSNAPPY predict directional pathway perturbations, but it also enables the identification of potential driver genes which are strongly associated with these perturbations.

gene2plot <- entrez2name %>%
  dplyr::filter(mapTo %in% c("CDCA8", "TOP2A", "UBE2I", "BIRC5"))
(weightedFC$weighted_logFC/weightedFC$weight) %>%
  as.data.frame() %>%
  rownames_to_column("entrezid") %>%
  pivot_longer(
    cols = -all_of("entrezid"), names_to = "sample", values_to = "ssFC"
  ) %>%
  inner_join(gene2plot, by = "entrezid") %>%
  ggplot(aes(mapTo, ssFC, fill = mapTo)) +
  geom_boxplot() +
  labs(x = "", fill = "Gene") +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  theme_bw()

7d855428-3c1f-4fa3-a6c5-ae5e6351b94b_figure8.gif

Figure 8. ssFC of potential key genes driving the inhibition of “SUMOylation of DNA replication proteins” pathway.

Single-sample logFC (ssFC) of potential key genes driving the inhibition of “SUMOylation of DNA replication proteins” pathway as a response to chemotherapy in HGSOC tumours.

Comparison to other pathway enrichment methods

We also performed pathway analysis on this example dataset using three other methods to compare their performance against sSNAPPY. Details on the implementation of those methods and the comparisons performed are available at 10.5281/zenodo.10127829. We initially performed an analysis using SPIA.6 However, SPIA relies on differentially expressed genes and given that only 49 genes were identified in our analysis using conventional differential expression analysis, no pathways were considered to be significant using SPIA. Additionally, we also performed pathway analysis on this example dataset using two non-topological-based approaches: 1) GSEA3 using ranking statistics derived from differential expression analysis and 2) the fast version of rotation gene set testing for linear models (roast)34 fry, neither of which rely on the presence of differentially expressed genes, Importantly, both methods test for signal within genes at either the up-regulated, or down-regulated extremes. Of the 219 gene sets considered associated with down-regulated genes by GSEA, 61 were considered as inhibited using sSNAPPY. Similarly, of the 21 pathways considered as associated with up-regulated genes by GSEA, 5 were considered as activated using sSNAPPY. GSEA produced a further 173 gene-sets not detected by sSNAPPY, whilst sSNAPPY produced an additional 54 pathways of interest.

Analysis using fry yielded 117 pathways associated with down-regulated genes, with sSNAPPY considering 36 of these as inhibited. However, 5 pathways classified as inhibited under fry were considered as activated under sSNAPPY, highlighting that down-regulation of some genes may lead to activation of a pathway, which is vital information not available under fry. Similarly, for the 13 pathways associated with up-regulation under fry, two were considered activated by sSNAPPY, with one considered inhibited. A further 88 pathways were detected as being of interest under fry, without being considered as activated or inhibited by sSNAPPY, with 77 pathways uniquely detected under sSNAPPY.

Discussion

In conclusion, we have presented and provided a demonstration for the R/Bioconductor package sSNAPPY which offers a novel single-sample pathway perturbation testing approach, tailored for heterogeneous tissue samples where experiments are performed using a matched-pair design. In contrast to many common enrichment methods, sSNAPPY uses pathway topology information to compute perturbation scores which indicate the likely impact on the activity of a pathway, by predicting direction of change and enabling deeper characterisation of biological responses. By applying sSNAPPY to a public scRNA-seq data collected before and after HGSOC patients were subjected to chemotherapy, we demonstrated its ability to detect significant pathway perturbations of various interesting biological processes consistent with, and far beyond what was shown in the original study. Whilst initially conceived for bulk-RNA studies, this demonstration has also provided clear applicability to scRNA datasets when using using pseudo-bulk approaches sSNAPPY addresses the limitations of alternative strategies which fail to account for gene-gene interactions encoded by pathway topologies and are unable to predict the direction of pathway activities, nor the cumulative effect of expression change across multiple genes. In addition, the single-sample nature of the method can be utilised to address the increasing demand for personalised medicine. Through identifying shared and divergent responses between individuals, sSNAPPY can provide valuable insights into the heterogeneous responses across clinical samples. Overall, we believe sSNAPPY represents a valuable addition to the existing body of pathway analysis methods.

Ethics and consent

Ethical approval and consent were not required.

Author contributions

WL’s contributions include Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Software, Validation, Visualisation, Writing - Original Draft Preparation, and Writing - Review & Editing. VM was involved with Conceptualization, Methodology and Writing - Review & Editing. WDT contributed to Writing - Review & Editing. SMP’s contributions include Conceptualization, Methodology, Project Administration, Software, Supervision, Writing - Original Draft Preparation, and Writing - Review & Editing.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 13 Jun 2024
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Liu W, Mäkinen VP, Tilley WD and Pederson SM. sSNAPPY: an R/Bioconductor package for single-sample directional pathway perturbation analysis [version 1; peer review: 2 approved with reservations]. F1000Research 2024, 13:628 (https://doi.org/10.12688/f1000research.149494.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 13 Jun 2024
Views
7
Cite
Reviewer Report 18 Jul 2024
Panagiotis Moulos, Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center ‘Alexander Fleming’, Vari, Greece 
Approved with Reservations
VIEWS 7
The authors present sSNAPPY, a Bioconductor package for the interesting research field of determining expression changes in whole pathways given the topology, instead of only identifying over-represented gene groups with multiple functionalities. They also extend the case by taking into ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Moulos P. Reviewer Report For: sSNAPPY: an R/Bioconductor package for single-sample directional pathway perturbation analysis [version 1; peer review: 2 approved with reservations]. F1000Research 2024, 13:628 (https://doi.org/10.5256/f1000research.163964.r297693)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
13
Cite
Reviewer Report 04 Jul 2024
Dario Strbenac, The University of Sydney, Sydney, New South Wales, Australia 
Approved with Reservations
VIEWS 13
sSNAPPY in a Bioconductor package for directional pathway analysis. Having been accepted into Bioconductor, the package code and documentation is already known to be of good quality. Mathematical formulas are clear and the variables are all defined. However, the first ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Strbenac D. Reviewer Report For: sSNAPPY: an R/Bioconductor package for single-sample directional pathway perturbation analysis [version 1; peer review: 2 approved with reservations]. F1000Research 2024, 13:628 (https://doi.org/10.5256/f1000research.163964.r292642)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 13 Jun 2024
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.