sSNAPPY: an R/Bioconductor package for single-sample directional pathway perturbation analysis

Wenjun Liu; Ville-Petteri Mäkinen; Wayne D Tilley; Stephen M Pederson

doi:10.12688/f1000research.149494.1

Home Browse sSNAPPY: an R/Bioconductor package for single-sample directional pathway...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

sSNAPPY: an R/Bioconductor package for single-sample directional pathway perturbation analysis

[version 1; peer review: 2 approved with reservations]

Wenjun Liu ^1-3, Ville-Petteri Mäkinen^4,5, Wayne D Tilley¹, Stephen M Pederson ^1,6,7

PUBLISHED 13 Jun 2024

Author details Author details

¹ Dame Roma Mitchell Cancer Research Laboratories, Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, South Australia, 5000, Australia
² Adelaide Centre for Epigenetics, School of Biomedicine, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, South Australia, 5000, Australia
³ The South Australian Immunogenomics Cancer Institute, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, South Australia, 5000, Australia
⁴ Computational Medicine, Faculty of Medicine, University of Oulu, Oulu, Northern Ostrobothnia, Finland
⁵ Center for Life Course Health Research, Faculty of Medicine, University of Oulu, Oulu, Northern Ostrobothnia, Finland
⁶ Black Ochre Data Labs, Indigenous Genomics, Telethon Kids Institute, Adelaide, South Australia, 5000, Australia
⁷ John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia

Wenjun Liu
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Ville-Petteri Mäkinen
Roles: Conceptualization, Methodology, Writing – Review & Editing

Wayne D Tilley
Roles: Funding Acquisition, Writing – Review & Editing

Stephen M Pederson
Roles: Conceptualization, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioconductor gateway.

Abstract

A common outcome of analysing RNA-Seq data is the detection of biological pathways with significantly altered activity between the conditions under investigation. Whilst many strategies test for over-representation of genes, showing changed expression within pre-defined gene-sets, these analyses typically do not account for gene-gene interactions encoded by pathway topologies, and are not able to directly predict the directional change of pathway activity. To address these issues we have developed sSNAPPY,now available as an R/Bioconductor package, which leverages pathway topology information to compute pathway perturbation scores and predict the direction of change across a set of pathways.

Here, we demonstrate the use of sSNAPPY by applying the method to public scRNA-seq data, derived from ovarian cancer patient tissues collected before and after chemotherapy. Not only were we able to predict the direction of pathway perturbations discussed in the original study, but sSNAPPY was also able to detect significant changes of other biological processes, yielding far greater insight into the response to treatment. sSNAPPY represents a novel pathway analysis strategy that takes into consideration pathway topology to predict impacted biology pathways, both within related samples and across treatment groups. In addition to not relying on differentially expressed genes, the method and associated R package offers important flexibility and provides powerful visualisation tools.

R version: R version 4.3.3 (2024-02-29)

Bioconductor version: 3.18

Package: 1.6.1

Keywords

RNA-Seq, pathway enrichment, R package, topology, KEGG, Reactome, scRNA-seq

Corresponding authors: Wenjun Liu, Stephen M Pederson

Competing interests: No competing interests were disclosed.

Grant information: W.D. Tilley’s research is supported by the National Health and Medical Research Council of Australia (ID 1186647) and the National Breast Cancer Foundation Australia (ID IIRS-23-069). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2024 Liu W et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Liu W, Mäkinen VP, Tilley WD and Pederson SM. sSNAPPY: an R/Bioconductor package for single-sample directional pathway perturbation analysis [version 1; peer review: 2 approved with reservations]. F1000Research 2024, 13:628 (https://doi.org/10.12688/f1000research.149494.1) First published: 13 Jun 2024, 13:628 (https://doi.org/10.12688/f1000research.149494.1) Latest published: 13 Jun 2024, 13:628 (https://doi.org/10.12688/f1000research.149494.1)

Introduction

Using pathway enrichment analysis to gain biological insights from gene expression data is a pivotal step in the analysis and interpretation of RNA-seq data, for which numerous methods have been developed (reviewed in¹^,²). Many existing methods tend to view pathways simply as a collection of gene names, as seen in those relying on the detection of differentially expressed genes and applying over-representation analysis (ORA) strategies, with alternative methods being those scoring all genes using functional class scoring (FCS), such as in Gene Set Enrichment Analysis (GSEA),³ arguably the most widely-used approach. However, databases such as the Kyoto Encyclopaedia of Genes and Genomes (KEGG)⁴ and WikiPathways⁵ capture not only which genes are implicated in a certain biological process but also their interactions, activating or inhibitory roles, and their relative importance within the pathway, all of which are overlooked in ORA- and FCS-based approaches.

To fully utilise this additional information, the latest generation of pathway analysis approaches include many which are topology-based such as SPIA,⁶ DEGraph,⁷ NetGSA⁸ and PRS,⁹ as well as others which explicitly model inter-gene correlations.¹⁰ Despite differences in the null hypotheses tested across these approaches, overall, they have demonstrated enhanced sensitivity and specificity due to their abilities to take gene-gene interconnections into account.¹¹^,¹² Nevertheless, most topology-based methods focus only on comparing the activity of pathways between two treatment groups and cannot be used to score individual samples (Figure 1). However, in heterogeneous data where more than one factor may be influencing observations,¹³ incorporating scoring within paired samples may be desirable and may be able to reveal more nuanced insights. To address this gap, we present a Single-Sample directioNAl Pathway Perturbation analYsis methodology called sSNAPPY, available as an R/Bioconductor package. This article defines how sSNAPPY computes changes in gene expression within paired samples and propagates this through gene-set topologies, to predict the perturbation in pathway activities within paired samples, before providing summarised results across an entire dataset (Figure 1). The practical usage of the sSNAPPY R/Bioconductor package is illustrated through the analysis of a public scRNA-seq dataset using a pseudo-bulk approach, demonstrating its applicability to both bulk RNA-Seq and scRNA-Seq datasets.

Figure 1. Schematic illustration of the differences between conventional pathway analysis methods and sSNAPPY.

Instead of being limited to treatment-level analyses, sSNAPPY allows the detection of pathway perturbation within individual samples by using sample-specific estimates of fold-change instead of experiment-wide estimates. (Created with BioRender.com).

Methods

Implementation

sSNAPPY is an R package that has been reviewed and published on the open-source bioinformatics software platform Bioconductor with all source code available via GitHub. The methodology itself is topology-based, designed to compute directional, single-sample, pathway perturbation scores for gene expression datasets with matched-pair, or nested designs (e.g. samples collected before and after treatment). Common examples of such designs may include treated vs control samples within cell-line passages, or across multiple treatments applied to tissue samples from within a specific donor. This allows for the detection of pathway perturbations within all samples from a treatment group, but also within individual samples, with sSNAPPY providing results in the form of pathway perturbation scores 1) for each set of paired samples, and 2) across all paired samples within an experimental grouping. The only data required to run sSNAPPY, is a log-transformed expression matrix (e.g. logCPM) with matching sample metadata describing treatment groups and the nested structure. It is assumed that all pre-processing has been performed beforehand, such as the exclusion of low-signal genes or normalisation to minimise technical artefacts like GC-bias.

The first step performed by sSNAPPY, is to estimate sample-specific log fold-change ( $δ_{ghi} = μ_{ghi} - μ_{g 0 i}$ ) across all genes $g$ , for each treatment $h$ , and within each set of nested replicates $i$ , by subtracting expression estimates for the baseline samples $μ_{g 0 i}$ from those in treatment group $h$ . It should also be noted that sSNAPPY is applicable to any number of treatment/condition levels and that sample numbers within each treatment group are not required to be balanced.

It is well known that in RNA-seq data, genes with lower expression tend to have greater variability in signal and more broadly spread estimates of change.¹⁴ As such, we utilise a gene-level weighting strategy to down-weight fold-change estimates for low-abundance genes prior to passing these estimates to sSNAPPY. Gene-level weights $w_{g}$ are obtained in a treatment-agnostic manner by fitting a loess curve through the relationship between observed gene-level variance ( $σ_{g}^{2}$ ) and average signal ( ${\bar{μ}}_{g}$ )(Figure 2), and taking the inverse of the loess-predicted variance as the weight $w_{g} = a / f ({\bar{μ}}_{g})$ , where $f ({\bar{μ}}_{g})$ is the predicted value from the loess curve and the constant $a$ ensures $\sum w_{g} = 1$ . We then use these globally-weighted estimates of log fold-change ( $δ_{ghi}^{*} = w_{g} δ_{ghi}$ ) in the calculation of all subsequent pathway perturbation scores.

Figure 2. Gene-wise standard deviations shown against mean logCPM.

The relationship between standard deviations and expression levels is modelled by a loess fit. Whilst standard deviations are shown for the purposes of visualisation, gene-level weights are calculated using variances at this stage of the sSNAPPY algorithm.

sSNAPPY extends the topology-based scoring algorithm initially proposed in SPIA⁶ which propagates fold-change estimates from genes considered as differentially expressed through pathway topologies, to compute a perturbation score for each pathway. In contrast to SPIA, which relies on a defined set of differentially expressed genes, sSNAPPY uses fold-change estimates from all detected genes. By modifying the algorithm to incorporate single-sample, weighted estimates of fold-change, we are able to numerically represent changes in a pathway within a given sample, and subsequently model these across all samples within a treatment group. Thus, we define the single-sample perturbation score ( $S_{hip}$ ) for a given pathway $p$ and treatment $h$ for a set of nested samples $i$ :

\begin{array}{l} S_{hip} = \sum_{g \in G_{p}} [S_{ghip} - δ_{ghi}^{*}], where \\ S_{ghip} = δ_{ghi}^{*} + \sum_{g' \in U_{gp}} β_{gg' p} \frac{S_{g' hip}}{N_{g' p}} \end{array}

and:

• $G_{p}$ represents the set of genes in pathway $p$ , such that $g \in G_{p}$
• $S_{ghip}$ is the gene-, treatment- and sample-specific perturbation score for pathway $p$
• $δ_{ghi}^{*} = w_{g} δ_{ghi}$ is the weighted log fold-change of gene $g$ as described above
• $U_{gp}$ is the subset of $G_{p}$ containing only the genes directly upstream of gene $g$ , and not including gene $g$
• $β_{gg' p}$ is the pair-wise gene-gene interactions⁶ encoded by the topology matrix for genes $g$ and $g^{'} \in U_{gp}$
• $N_{gp}$ is the number of downstream genes from any gene $g$
• $S_{hip}$ is the accumulated pathway perturbation score for pathway $p$ in treatment $h$ within sample $i$ across all genes in the pathway

To scale single-sample pathway perturbation scores ( $S_{hip}$ ) so they are comparable across pathways, and to test for significance of individual scores, null distributions of perturbation scores for each pathway are generated through a sample permutation strategy, which retains any existing correlation structures between genes within a pathway. During permutation, all sample labels are randomly shuffled and permuted pseudo-pairs formed from the re-shuffled labels. Single-sample fold-changes are then calculated for each pseudo-pair of permuted samples while the rest of the scoring algorithm remains unchanged. The median and median absolute deviation (MAD) are calculated from the set of permuted perturbation scores within each pathway and used to normalise the raw perturbation scores to robust $Z$ -scores, i.e. $Z_{hip}$ .

All possible permuted pseudo-pairs are sampled unless otherwise specified, such that in an experiment with $I$ total samples, the maximum number of unique permuted pairs is $P = \frac{I!}{(I - 2)!} = I \times (I - 1)$ . Permutation p-values are calculated for each $S_{hip}$ value, indicating the approximate significance of pathway perturbation at the single-sample level. Relying on symmetry around zero, these are derived by assessing the proportion of permuted scores with absolute values as extreme, or more extreme, than the absolute value of test perturbation score within each pathway.¹⁵ Since the smallest achievable permutation p-value is 1/ $P$ , accurate estimation of small p-value requires a large number of permutations that is only feasible in data with a large sample size. As a guideline, GSEA recommends a minimum of 7 samples in each treatment group for utilizing their phenotype permutation approach,¹⁶ and under sSNAPPY this would yield $P$ = 182 unique permuted pseudo-pairs and minimal permutation p-values of $p \geq 0.0055$ at each individual sample and pathway-level.

Apart from assessing whether a pathway’s activity is changed significantly within an individual sample, an important question may be the detection of changes in pathway activity across all samples within a treatment. This can be performed by modelling $Z_{hip}$ values using regression models and incorporating Smyth’s moderated t-statistic¹⁷ as implemented in limma.¹⁸ The single-sample nature of sSNAPPY’s pathway perturbation scores is particularly helpful for datasets with complex experimental designs or known confounding factors as these can also be incorporated into the final regression models.

The Bioconductor package graphite¹⁹ provides functions that can be used to retrieve pathway topologies from a database and convert topology information to adjacency matrices. To streamline this process, we have implemented a convenience function, where users only need to provide the name of the desired database and species to retrieve all topology information in the format required by the scoring algorithm with the correct type of gene identifiers (i.e. EntrezID). Importantly, we noted that graphite¹⁹ retrieves topology matrices in varied orientations from different databases; in KEGG pathways, columns represent downstream relationships, whereas in Reactome and WikiPathway, columns indicate upstream regulation. However, this discrepancy is not taken account of in the runSPIA() function of graphite. We resolve the issue by transposing the topology matrices retrieved from the Reactome and WikiPathway as part of our convenience function.

Operation

The package has been tested on all operating systems, requiring R > 4.3.0, and can be installed using BiocManager as follows.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("sSNAPPY")

Use cases

Required packages

The set of additional required packages for this workflow can be installed using

pkg <- c(
  "tidyverse", "patchwork", "kableExtra", "AnnotationHub", "edgeR", "here",
  "pander", "colorspace", "SPIA", "fgsea"
)
BiocManager::install(pkg, update = FALSE, force = FALSE, ask = FALSE)

Once installed, the complete set of packages can be loaded into the session.

library(sSNAPPY)
library(tidyverse)
library(magrittr)
library(patchwork)
library(kableExtra)
library(AnnotationHub)
library(edgeR)
library(pander)
library(colorspace)
library(SPIA)
library(fgsea)

Retrieval of pathway topology matrices

Retrieving pathway topology information from a chosen database is the next key step required during preparation for running sSNAPPY, and this is the only step requiring internet access. If running on an HPC cluster where internet access may be restricted, this step can be performed separately with topologies saved by the user as an RDS object. Using the Reactome database²⁰ in this workflow, the retrieved topology information will be stored as a list where each element corresponds to a pathway and the numbers in the matrices encode gene-gene interactions.

gsTopology <- retrieve_topology(database = "reactome", species = "hsapiens")

In addition to downloading topology matrices for all pathways, it is also possible to provide a restricted set of keywords for a targeted analysis. For example, passing the argument keyword = c("metabolism", "estrogen") would only return the subset of pathways which match either of these keywords. Multiple databases are also able to be searched by passing a vector of database names to the database argument.

Data

To demonstrate the application of sSNAPPY, we used pre-processed counts from a publicly available scRNA-seq dataset, retrieved from Gene Expression Omnibus (GEO) with accession code GSE165897. This dataset consists of 11 high-grade serous ovarian cancer (HGSOC) patient samples taken before and after chemotherapy.²¹ sSNAPPY was used to re-analyse data from the epithelial cells as they were the primary focus of the original study. Since sSNAPPY was designed primarily for bulk RNA-seq data, counts from epithelial cells within the same samples were first summed into pseudo-bulk profiles, giving rise to a total of 22 samples. We considered a gene detectable if we observed >1.5 counts per million in $\geq$ 11 of the 22 samples, ideally representing all samples from a complete treatment group. 11,101 (33.8%) of the 32,847 annotated genes passed this selection criteria and were included for downstream analyses. Conditional quantile normalisation²² was then applied to mitigate potential biases introduced by gene length and GC content. The normalised logCPM matrix of the processed dataset and sample metadata can be downloaded from here.

To begin running the sSNAPPY workflow, we first load our expression matrix and define our sample-level metadata. Importantly, the row names of the expression matrix must be specified as EntrezGene IDs, for compatibility with pathway databases. Genes without EntrezGene IDs were excluded during pre-processing, leaving 10,098 genes in the example expression matrix. The treatment column within our metadata is expected to be a factor, with the reference level interpreted as the control treatment.

readr::local_edition(1)
logCPM <- read_tsv(here::here("data/logCPM.tsv")) %>%
  column_to_rownames("entrezid")
sample_meta <- read_tsv(here::here("data/sample_meta.tsv"), col_types = "cfccncnc")
head(sample_meta)

## # A tibble: 6 × 8
##   sample         treatment  patient_id anatomical_location Age Stage PFI CRS
##   <chr>           <fct>       <chr>      <chr>             <dbl> <chr> <dbl> <chr>
## 1 EOC372_treat … treatmen … EOC372     Peritoneum         68 IIIC    460 1
## 2 EOC372_post- … post-NACT  EOC372     Peritoneum         68 IIIC    460 1
## 3 EOC443_post- … post-NACT  EOC443     Omentum            54 IVA     177 3
## 4 EOC443_treat … treatmen … EOC443     Omentum            54 IVA     177 3
## 5 EOC540_treat … treatmen … EOC540     Omentum            62 IIIC    126 2
## 6 EOC540_post- … post-NACT  EOC540     Omentum            62 IIIC    126 2

Score single-sample pathway perturbation

To compute the single-sample fold-changes (i.e. logFC) required for the set of perturbation scores, samples must be ‘matched pairs’ or nested, as discussed earlier, and treatments performed within each patient represents the nesting structure in this dataset. The factor (patient_id) defining the nested structure is passed to weight_ss_fc() using the groupBy parameter.

weightedFC <- weight_ss_fc(
  as.matrix (logCPM), metadata = sample_meta,
  sampleColumn = "sample", groupBy = "patient_id", treatColumn = "treatment"
)
glimpse (weightedFC)

The output of weight_ss_fc() is a list where one element is the matrix of weighted single-sample fold-changes ( $δ_{ghi}^{*}$ ), with rows corresponding to genes and columns to samples, and the other element is the vector of gene-wise weights ( $w_{g}$ ) used to calculate the weighted log fold-change ( $δ_{ghi}^{*}$ ). By default, the string ENTREZID: is added to all row names of the $δ_{ghi}^{*}$ matrix to be compatible with the format Reactome pathway topologies are retrieved in.

The matrix of $δ_{ghi}^{*}$ values is then passed to pathway topologies to compute gene-wise perturbation scores for all genes within a pathway, before being summed into a single score for each pathway. The function raw_gene_pert() returns an initial list, with each element containing the gene-level perturbation scores for a given pathway. These matrices are also able to be used during downstream analysis to identify which genes play the most significant roles in each pathway, as demonstrated in later sections. Pathway-level perturbation scores ( $S_{hip}$ ) are then returned as a data.frame containing sample and gene-set names after calling pathway_pert(). Pathways with zero perturbation scores across all genes and samples are automatically dropped at this step. In addition, pathways with aggregated pathway-level scores of zero in all samples will also be dropped by default, unless otherwise specified through the drop parameter of the pathway_pert() function.

genePertScore <- raw_gene_pert(weightedFC$weighted_logFC, gsTopology)
ssPertScore <- pathway_pert(genePertScore, weightedFC$weighted_logFC)
ssPertScore %>%
    split(f = .$gs_name) %>%
    keep(~all(. != 0)) %>%
    length()
## [1] 1093
head (ssPertScore)
##             sample         score                           gs_name
## 1 EOC372_post-NACT -2.292688e-04 reactome. Interleukin-6 signaling
## 2 EOC443_post-NACT -2.447003e-04 reactome. Interleukin-6 signaling
## 3 EOC540_post-NACT -1.848758e-04 reactome. Interleukin-6 signaling
## 4   EOC3_post-NACT -1.229489e-04 reactome. Interleukin-6 signaling
## 5  EOC87_post-NACT  3.427132e-05 reactome. Interleukin-6 signaling
## 6 EOC136_post-NACT  2.822155e-04 reactome. Interleukin-6 signaling

Sample permutation for normalisation and significance testing

The range of $S_{hip}$ values obtained from the complete set of pathways will vary greatly, due to the variability in topology structures. To determine the significance of individual scores and transform scores to ensure they are comparable across pathways, sSNAPPY utilises a sample-permutation strategy to estimate the null distributions of perturbation scores and derive $Z_{hip}$ scores. Since sample labels will be permuted randomly to put samples into pseudo-pairs, sample metadata is not required by the generate_permuted_scores function. All possible random pairs between samples will be sampled unless otherwise specified. In this example dataset with a total of 22 samples, the full set of 462 (i.e. $P = 22 \times 21$ ) permuted scores will be computed for each pathway.

permutedScore <- generate_permuted_scores(
  as.matrix (logCPM),
  gsTopology = gsTopology[names(genePertScore)],
  weight = weightedFC$weight
)

Apart from pathways whose permuted perturbation scores are consistently zero, and which will be dropped by default, the empirical distributions of remaining pathways are expected to be approximately normally distributed with mean $μ = 0$ , but with the scale of distributions heavily impacted by both the number of genes within each pathway and the overall topology. To demonstrate this, we randomly selected 6 pathways to demonstrate their quantile-quantile (q-q) plot and visualised the distributions of their permuted perturbation scores as boxplots (Figure 3).

set.seed(123)
random_pathways <- permutedScore %>%
  keep(~all(. != 0)) %>%
  .[sample(seq_along(.), 6)] %>%
  as.data.frame() %>%
  pivot_longer(
    cols = everything(), names_to = "gs_name", values_to = "score"
  ) %>%
  mutate(
    gs_name = str_replace_all(gs_name, "\\.", " "),
    gs_name = str_remove_all(gs_name, "reactome ")
  )
p1 <- random_pathways %>%
  ggplot(aes (sample = score, colour = gs_name)) +
  stat_qq() +
  stat_qq_line(colour = "black") +
  facet_wrap(~str_wrap(gs_name, width = 25), scales= "free") +
  labs(y = "Permuted Perturbation Score", x = "Theoretical Quantiles") +
  theme_bw() +
  theme(
    legend.position = "none",
    text = element_text(size = 14),
    strip.text = element_text(size = 16))
p2 <- random_pathways %>%
  ggplot(aes (gs_name, score, fill = gs_name)) +
  geom_boxplot() +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
  scale_fill_discrete(name = "Gene-set Name") +
  labs(x = "Pathway", y = "Permuted Perturbation Score") +
  theme_bw() +
  theme(
    legend.position = "none",
    axis.title = element_text(size = 16),
    axis.text = element_text(size = 14)
  )
(p1/p2) +
  plot_annotation(tag_levels = "A") +
  plot_layout(heights = c(0.6, 0.4))

Figure 3. (A) Q-Q plot and (B) distributions of permuted perturbation scores of six randomly selected pathways.

All sampled empirical distributions are approximately normally distributed with a mean of zero.

The distributions obtained from label permutations are then used to convert each pathway-level score into the robust $Z_{hij}$ -score using the function normalise_by_permu(). Two-sided p-values for individual scores are computed based on how extreme test scores are in comparison to permuted scores for each pathway and corrected for multiple testing using any of the available methods, returning the FDR-adjusted values by default. In our example data, no pathways would be considered as significantly perturbed at the single-sample level using an FDR adjustment with $α$ = 0.05.

normalisedScores <- normalise_by_permu(permutedScore, ssPertScore,
                                       sortBy = "pvalue")
head (normalisedScores)

##               MAD MEDIAN
## 2306 0.0006067519      0
## 2525 0.0002909911      0
## 5869 0.0001652241      0
## 5871 0.0001652241      0
## 5872 0.0001652241      0
## 7721 0.0275863198      0
##                                                           gs_name
## 2306 reactome.Golgi Cisternae Pericentriolar Stack Reorganization
## 2525       reactome.DNA Damage/Telomere Stress Induced Senescence
## 5869                        reactome.Defective CHST6 causes MCDC1
## 5871           reactome.Defective ST3GAL3 causes MCT12 and EIEE15
## 5872       reactome.Defective B4GALT1 causes B4GALT1-CDG (CDG-2d)
## 7721                        reactome.Mitochondrial protein import
##                sample         score   robustZ      pvalue adjPvalue
## 2306 EOC153_post-NACT -0.0013632057 -2.246727 0.004329004 1.0000000
## 2525 EOC153_post-NACT  0.0006149637  2.113342 0.004329004 1.0000000
## 5869 EOC349_post-NACT -0.0003304944 -2.000279 0.004329004 0.9993248
## 5871 EOC349_post-NACT -0.0003304944 -2.000279 0.004329004 0.9993248
## 5872 EOC349_post-NACT -0.0003304944 -2.000279 0.004329004 0.9993248
## 7721 EOC443_post-NACT -0.0598366717 -2.169070 0.004329004 0.9471861

A key question of interest in our example dataset is to identify which biological processes were impacted by chemotherapy across the entire group of patients. Using the sample-level output obtained above, we can explore this by applying t-tests or regression models across all samples. In order to minimise spurious results, Smyth’s moderated t-statistics¹⁷ can be applied across the complete dataset, with a constant variance assumed across all pathways, given that we are using Z-scores. To perform this analysis, $Z_{hip}$ values were converted to a matrix and standard limma methodologies were used. For our use case here, where only one treatment group is present, no design matrix is required, and a simple t-test is appropriate.

z_matrix <- normalisedScores %>%
  dplyr::select (robustZ, gs_name, sample) %>%
  pivot_wider(names_from = "sample", values_from = "robustZ") %>%
  column_to_rownames("gs_name") %>%
  as.matrix()
z_fits <- z_matrix %>%
  lmFit(design = rep(1, ncol(.))) %>%
  eBayes()
top_table <- topTable(z_fits, number = Inf) %>%
  as_tibble(rownames = "gs_name")
sigPathway <- dplyr::filter (top_table, adj.P.Val < 0.05)

121 out of the 1094 tested Reactome pathways have an FDR < 0.05 in the moderated t-test, and were considered to be significantly perturbed at the group level. Table 1 presents the top 10 significantly inhibited and activated pathways, along with their predicted direction of change.

table 1 <- sigPathway %>%
  mutate(
    Direction = ifelse (logFC < 0, "Inhibited", "Activated"),
    gs_name = str_remove_all(gs_name, "reactome.")
  ) %>%
  split(f = .$Direction) %>%
  lapply (dplyr::slice, 1:10) %>%
  bind_rows() %>%
  dplyr::select(
    Pathway = gs_name, Change = logFC, P.Value, FDR = adj.P.Val, Direction
  )

Table 1. Significantly impacted Reactome pathways identified among post-chemotherapy samples using sSNAPPY. Only the 10 most significantly inhibited and 10 most significantly activated pathways are shown.

Pathway	Change	P.Value	FDR	Direction
Signaling by Retinoic Acid	0.601	4.45E-04	0.0152	Activated
Regulation of pyruvate dehydrogenase (PDH) complex	0.598	4.59E-04	0.0152	Activated
Pyruvate metabolism	0.598	4.59E-04	0.0152	Activated
Pyruvate metabolism and Citric Acid (TCA) cycle	0.598	4.59E-04	0.0152	Activated
Negative regulation of MAPK pathway	0.627	5.40E-04	0.0174	Activated
Translocation of ZAP-70 to Immunological synapse	0.624	8.75E-04	0.0218	Activated
Generation of second messenger molecules	0.624	8.75E-04	0.0218	Activated
MHC class II antigen presentation	0.628	0.0012	0.0241	Activated
NGF-stimulated transcription	0.582	0.0014	0.0247	Activated
Downstream TCR signaling	0.633	0.0014	0.0247	Activated
Interleukin-35 Signalling	-0.902	1.69E-05	0.0151	Inhibited
Sphingolipid de novo biosynthesis	-0.896	2.75E-05	0.0151	Inhibited
SUMOylation of DNA replication proteins	-0.819	6.86E-05	0.0152	Inhibited
Condensation of Prometaphase Chromosomes	-0.904	9.91E-05	0.0152	Inhibited
Epigenetic regulation of gene expression	-0.790	1.21E-04	0.0152	Inhibited
Polo-like kinase mediated events	-0.879	1.43E-04	0.0152	Inhibited
Chromatin modifying enzymes	-0.790	1.52E-04	0.0152	Inhibited
Chromatin organization	-0.790	1.52E-04	0.0152	Inhibited
SUMO E3 ligases SUMOylate target proteins	-0.767	1.83E-04	0.0152	Inhibited
APC-Cdc20 mediated degradation of Nek2A	-0.838	2.13E-04	0.0152	Inhibited

For enrichment analysis in the original study,²¹ unsupervised clustering was performed on all cells labelled as cancer cells. Clusters were then annotated manually by performing pathway enrichment testing on cluster marker genes. Two clusters, associated with proliferative DNA repair signatures and stress-related markers, contained significantly higher numbers of post-chemotherapy cells than pre-treatment ones.²¹ The representative pathways enriched in the stress-associated cluster were IL6-mediated signaling events, TNF signaling pathway, and cellular responses to stress, and the other post-chemotherapy cell dominated cluster in the original study was enriched for pathways associated with cell proliferation and DNA repair, such as the Cell cycle, DNA repair, Homology directed repair (HDR) through homologous recombination, and the Fanconi anaemia pathway.

Whilst the published enrichment analysis was performed using ConsensusPathDB,²³ in order to use pathway topologies we chose the Reactome set of pathways.²⁰ sSNAPPY not only detected many significant perturbed pathways that are highly concordant with the pathways reported to be enriched in the original study but also includes an expected direction of change in activity. For example, the DNT repair pathway SUMOylation of DNA damage response and repair proteins pathway was predicted to be significantly inhibited by chemotherapy. The single-sample nature of the sSNAPPY output also provides great flexibility: apart from considering all treated samples as biological replicates, users may elect to perform an analysis incorporating other phenotypic traits which may impact a patient’s responses to chemotherapy, such as disease stages or tumour grades. To perform this step using the moderated t-statistic strategy and extend the above analysis, an appropriate design matrix is the only additional requirement for model-fitting, or alternatively, samples may be subset as may be appropriate.

Visualising perturbed pathways as networks

A valuable feature of sSNAPPY is the provision of several visualisation functions to assist in the presentation and interpretation of results. Biological pathways are not independent of each other with many genes playing a role across multiple pathways, and as such, viewing pathway analysis results as a network can be a powerful way to intuitively summarise the results and facilitate interpretation of the underlying biology. The plot_gs_network() function allows users to easily convert a list of relevant biological pathways to a network where edges between pathway nodes represent overlapping genes. Defined by the colorBy parameter, pathway nodes can be coloured by either the predicted direction of change or by significance levels (Figure 4). The returned plot is a ggplot2²⁴ object, meaning that components of the plotting theme and other parameters can be customized as for any other ggplot2 objects.

In the following example, we’ll inspect the 10 most significantly inhibited and 10 most significantly activated pathways, which involved four steps to prepare the data: 1) rename the logFC column to reflect the true meaning of the value and, 2) create a categorical variable with the pathway status, 3) transform p-values for simpler visualisation and 4) obtain a subset of pathways to visualise.

sigPathway <- sigPathway %>%
  dplyr::rename(Z = logFC) %>%
  mutate(
    status = ifelse(Z > 0, "Activated", "Inhibited"),
    status = as.factor(status),
    '-log10(p)' = -log10(P.Value)
  ) %>%
  split(f = .$status) %>%
  lapply(dplyr::slice, 1:10) %>%
  bind_rows()
# Plot the network structure
set.seed(150)
p1 <- plot_gs_network(
  normalisedScores = sigPathway, gsTopology = gsTopology, colorBy = "status",
  gsNameSize = 3
) +
  scale_colour_manual(values = c("red", "blue", "grey30")) +
  theme_void() +
  theme(
    legend.text = element_text(size = 10),
    legend.position = "inside",
    legend.position.inside = c(0.95, 0.95)
  )
set.seed(150)
p2 <- plot_gs_network(
  normalisedScores = sigPathway,
  gsTopology = gsTopology,
  colorBy= "-log10(p)",
  gsNameSize = 3,
  gsLegTitle = expression(paste(-log[10], "p"))
) +
  scale_colour_viridis_c() +
  theme_void() +
  theme(
    legend.text = element_text(size = 8),
    legend.title = element_text(size = 10),
    legend.position = "inside",
    legend.position.inside = c(0.95, 0.95)
  )
p1/p2 + plot_annotation(tag_levels = "A")

Figure 4. Significantly perturbed Reactome pathways identified among post-chemotherapy samples using sSNAPPY.

Each node represents a significantly perturbed Reactome pathway, with nodes coloured by (A) predicted direction of change and (B) -log10(p). The 10 most significantly inhibited and 10 most significantly activated pathways are shown.

An advantage of visualising pathway analysis results using network structures is that it allows the identification of highly connected pathways (Figure 4). To summarise related pathways and further enable interpretation, we can apply community detection²⁵ to group related pathways into ‘communities’. sSNAPPY’s plot_community() function is a “one-stop shop” for applying a community detection algorithm of the user’s choice to the network structure and annotating identified communities by the most common pathway category, denoting the main biological processes perturbed in that community. The most recent categories for both KEGG and Reactome databases were curated from their respective website (KEGG website & Reactome website) and included as parts of sSNAPPY. Analyses involving other pathway databases may require user-provided pathway categories. When the information about pathway categorisations is available, annotation of pathway communities is automatically completed. In the current dataset, the Louvain method was applied to the network of biological pathways and revealed five primary communities: 1) Adaptive Immune System; 2) Cell Cycle, Mitotic; 3) Chromatin modifying enzymes & Epigenetic regulation of gene expression; 4) Post-translational protein modification and 5) The citric acid (TCA) cycle and respiratory electron transport (Figure 5). The largest community formed was the Adaptive Immune System pathway, indicating a clear immune-signalling aspect to these results.

load(system.file("extdata", "gsAnnotation_df_reactome.rda", package = "sSNAPPY"))
set.seed(199)
plot_community(
    normalisedScores = sigPathway,
    gsTopology = gsTopology,
    gsAnnotation = gsAnnotation_df_reactome,
    colorBy = "status",
    lb_size = 3
) +
  scale_colour_manual(values = c("red", "blue")) +
  scale_fill_viridis_d() +
  scale_x_continuous(expand = expansion(0.2)) +
  scale_y_continuous(expand = expansion(0.2)) +
  guides (fill = "none") +
  theme_void() +
  theme(
      legend.position = "inside",
      legend.position.inside = c(0.9, 0.1)
  )

Figure 5. Community structures among significantly perturbed Reactome pathways that were identified among post-chemotherapy samples using sSNAPPY.

The Louvain algorithm was applied to detect community structures, revealing biological processes associated with highly ranked pathways. The top 20 which were the most perturbed by chemotherapy are shown.

A key advantage of sSNAPPY is that it does not require the prior identification of differentially expressed genes, as this is a common challenge faced within clinical datasets. However, knowing which genes are implicated in the perturbation of pathways, particularly those which influence multiple pathways, can provide valuable insights for hypothesis generation and the underlying biological mechanisms. Therefore, sSNAPPY provides another visualisation feature called plot_gs2gene, which enables the inclusion of select genes from each pathway using network structures. Users can provide a vector of fold-change estimates to visualise genes within pathways, showing their estimated change in expression. As pathways often include hundreds of genes, it is recommended to filter for genes most likely to be playing a significant role. In this example dataset, only genes within the top 500 when ranking by the magnitude of the mean ssFC were included (Figure 6). An alternative strategy will be to select genes based on test-statistics, however, this decision is up to the individual researcher.

meanFC <- rowMeans(weightedFC$weighted_logFC)/weightedFC$weight
top500 <- rank(1/abs (meanFC)) <= 500

Since Reactome pathway topologies were retrieved using EntrezIDs, users can provide a data.frame mapping EntrezIDs to their chosen identifiers, such as gene names, through the mapEntrezID parameter, in order to make the visualisations more informative. A data.frame converting EntrezIDs to Ensembl gene names was derived from the Ensembl Release 101²⁶ and has been made available as part of the package and serves as a helpful template for future mapping operations by users.

load(system.file("extdata", "entrez2name.rda", package = "sSNAPPY"))
head(entrez2name)

## # A tibble: 6 × 2
##   entrezid           mapTo
##   <chr>              <chr>
## 1 ENTREZID:84771     DDX11L1
## 2 ENTREZID:727856    DDX11L1/DDX11L9/DDX11L10
## 3 ENTREZID:100287102 DDX11L1
## 4 ENTREZID:100287596 DDX11L1/DDX11L9
## 5 ENTREZID:102725121 DDX11L1
## 6 ENTREZID:653635    WASH7P

set.seed(195)
plot_gs2gene(
  normalisedScores = sigPathway,
  gsTopology = gsTopology,
  colorGsBy = "status",
  mapEntrezID = entrez2name,
  geneFC = meanFC[top500],
  layout = "kk",
  edgeAlpha = 1,
  gsNodeSize = 5,
  geneNodeSize = 3,
  gsNameSize = 4,
  geneNameSize = 3.5
) +
  scale_colour_gradient2(name = "Mean ssFC", low = "orange", high = "green4") +
  scale_fill_manual(values = c("red", "blue")) +
  theme_void() +
  theme(
    legend.text = element_text(size = 12),
    legend.title = element_text(size = 12)
  )

Figure 6. Genes associated with significantly perturbed Reactome pathways that were identified among post-chemotherapy samples using sSNAPPY.

Significantly perturbed Reactome pathways identified among post-chemotherapy samples using sSNAPPY, showing any genes in the top 500 ranked by magnitude of change in expression, and which pathways they are likely contributing to. Only the 10 most significantly inhibited and 10 most significantly activated pathways are shown.

Identifying key gene contributions

To further investigate a specific pathway and elucidate which are the key genes contributing to the final perturbation score, we can generate a heatmap via plot_gene_contribution() which shows the gene-level perturbation scores for the top-ranked members of a given pathway. This function takes advantage of the plotting capabilities of the pheatmap package,²⁷ and as such, other annotations are also able to be easily included, such as patient response, or which general ranges the pathway-level normalised Z-Scores are in. Inclusion of the Z-Scores enabled the assessment of the level of perturbation predicted in each sample and key genes involved (Figure 7).

path_name <- "SUMOylation of DNA replication proteins"
z_breaks <- round(qnorm(c(0, 0.025, 0.05, 0.25, 0.75, 0.95, 0.975, 1)), 2)
annotation_df <- normalisedScores %>%
  dplyr::filter(str_detect(gs_name, path_name)) %>%
  left_join(dplyr::select(sample_meta, sample, CRS), by = "sample") %>%
  mutate(
    'Z Range' = cut(robustZ, breaks = z_breaks, include.lowest = TRUE),
    sample = str_remove_all(sample, "_post-NACT")
  ) %>%
  dplyr::select(sample, 'Z Range', CRS)
z_levels <- levels (annotation_df$'Z Range')
annotation_col <- list(
  CRS = c("3" = "#4B0055", "2" = "#009B95", "1" = "#FDE333"),
  'Z Range' = setNames(
    colorRampPalette(c("navyblue", "white", "darkred"))(length(z_levels)),
    z_levels
  )
)
mat <- genePertScore %>%
  .[[which (str_detect(names(.), path_name))]] %>%
  set_colnames(str_remove_all(colnames(.), "_post-NACT")) %>%
  .[rownames(.) %in% rownames (weightedFC$weighted_logFC),]
max_pert <- max (abs (range (mat))) * 1.01
plot_gene_contribution(
  genePertMatr = mat,
  color = rev (colorspace::divergex_hcl(100, palette = "RdBu")),
  breaks = seq(-max_pert, max_pert, length.out = 100),
  annotation_df =annotation_df,
  annotation_colors = annotation_col,
  filterBy = "mean",
  mapEntrezID = entrez2name,
  cutree_rows = 2,
  cutree_cols = 2,
  main = paste (path_name, "[REACTOME]")
)

Figure 7. Gene-level perturbation scores for the top 10 genes in "SUMOylation of DNA replication proteins" pathway.

All pathway genes were ranked by average contribution to the perturbation score to select the top 10 genes. Samples were annotated by patient chemotherapy response score (CRS), along with the range for sample-level Z-scores as a guide to sample-specific pathway perturbation. The genes CDCA8, TOP2A, UBE2I, BIRC5 were identified as possible key drivers of inhibition for this pathway.

From this heatmap we can identify four candidate genes which are likely to be making a contribution to the inhibition of the SUMOylation of DNA replication proteins pathway upon chemotherapy, such as CDCA8, TOP2A, UBE2I, BIRC5 (Figure 7). These four genes are all associated with tumour progression and invasiveness and have been studied in the context of ovarian cancer. Both ubiquitin conjugating enzyme E2I (UBE2I) and cell division cycle associated 8 (CDCA8) genes have been identified as oncogenes in numerous cancer types, including ovarian cancer.²⁸^,²⁹ Notably, in ovarian cancer, elevated UBE2I expression has been associated with poorer clinical outcomes.³⁰ Similarly, BIRC5 encodes a protein which is also a predictor of inferior ovarian cancer patient outcome.³¹ Lastly, Topoisomerase II $α$ (TOP2A), which encodes DNA topoisomerase, has been identified as a gene that promotes the tumorigenesis of HGSOC tumours.³² Aligning with the report by Chekerov et al.³³ that expression of TOP2A in ovarian tumour cells decreases as a response to chemotherapy,³³ the median single-sample logFC of TOP2A was negative among the HGSOC post-chemotherapy samples included in this study (Figure 8). The other three selected potential driver genes (CDCA8, UBE2I, and BIRC5) also had negative median single-sample logFC in post-chemotherapy samples (Figure 8). Considering the implication of these four genes in ovarian cancer, decreases in their expression after chemotherapy treatment potentially indicate a favorable response to therapy.

By annotating the heatmap of gene-wise perturbation scores with patient chemotherapy response score (CRS), we also noted that the strongest inhibition of the SUMOylation of DNA replication proteins pathway was in the patient with the highest CRS score of 3 (i.e sample EOC443). CRS is an indicator of the relative length of progression-free survival after chemotherapy, where a score of 3 represents the longest survival. Hence inhibition of the SUMOylation of DNA replication proteins pathway might mediate favorable response to chemotherapy in ovarian cancer patients. We acknowledge that our analysis was limited to a small number of patients, which restricts the generalizability of the results. However, despite this limitation, these findings underscore the strength of sSNAPPY as a valuable tool for hypothesis generation not otherwise possible. Not only can sSNAPPY predict directional pathway perturbations, but it also enables the identification of potential driver genes which are strongly associated with these perturbations.

gene2plot <- entrez2name %>%
  dplyr::filter(mapTo %in% c("CDCA8", "TOP2A", "UBE2I", "BIRC5"))
(weightedFC$weighted_logFC/weightedFC$weight) %>%
  as.data.frame() %>%
  rownames_to_column("entrezid") %>%
  pivot_longer(
    cols = -all_of("entrezid"), names_to = "sample", values_to = "ssFC"
  ) %>%
  inner_join(gene2plot, by = "entrezid") %>%
  ggplot(aes(mapTo, ssFC, fill = mapTo)) +
  geom_boxplot() +
  labs(x = "", fill = "Gene") +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  theme_bw()

Figure 8. ssFC of potential key genes driving the inhibition of “SUMOylation of DNA replication proteins” pathway.

Single-sample logFC (ssFC) of potential key genes driving the inhibition of “SUMOylation of DNA replication proteins” pathway as a response to chemotherapy in HGSOC tumours.

Comparison to other pathway enrichment methods

We also performed pathway analysis on this example dataset using three other methods to compare their performance against sSNAPPY. Details on the implementation of those methods and the comparisons performed are available at 10.5281/zenodo.10127829. We initially performed an analysis using SPIA.⁶ However, SPIA relies on differentially expressed genes and given that only 49 genes were identified in our analysis using conventional differential expression analysis, no pathways were considered to be significant using SPIA. Additionally, we also performed pathway analysis on this example dataset using two non-topological-based approaches: 1) GSEA³ using ranking statistics derived from differential expression analysis and 2) the fast version of rotation gene set testing for linear models (roast)³⁴ fry, neither of which rely on the presence of differentially expressed genes, Importantly, both methods test for signal within genes at either the up-regulated, or down-regulated extremes. Of the 219 gene sets considered associated with down-regulated genes by GSEA, 61 were considered as inhibited using sSNAPPY. Similarly, of the 21 pathways considered as associated with up-regulated genes by GSEA, 5 were considered as activated using sSNAPPY. GSEA produced a further 173 gene-sets not detected by sSNAPPY, whilst sSNAPPY produced an additional 54 pathways of interest.

Analysis using fry yielded 117 pathways associated with down-regulated genes, with sSNAPPY considering 36 of these as inhibited. However, 5 pathways classified as inhibited under fry were considered as activated under sSNAPPY, highlighting that down-regulation of some genes may lead to activation of a pathway, which is vital information not available under fry. Similarly, for the 13 pathways associated with up-regulation under fry, two were considered activated by sSNAPPY, with one considered inhibited. A further 88 pathways were detected as being of interest under fry, without being considered as activated or inhibited by sSNAPPY, with 77 pathways uniquely detected under sSNAPPY.

Discussion

In conclusion, we have presented and provided a demonstration for the R/Bioconductor package sSNAPPY which offers a novel single-sample pathway perturbation testing approach, tailored for heterogeneous tissue samples where experiments are performed using a matched-pair design. In contrast to many common enrichment methods, sSNAPPY uses pathway topology information to compute perturbation scores which indicate the likely impact on the activity of a pathway, by predicting direction of change and enabling deeper characterisation of biological responses. By applying sSNAPPY to a public scRNA-seq data collected before and after HGSOC patients were subjected to chemotherapy, we demonstrated its ability to detect significant pathway perturbations of various interesting biological processes consistent with, and far beyond what was shown in the original study. Whilst initially conceived for bulk-RNA studies, this demonstration has also provided clear applicability to scRNA datasets when using using pseudo-bulk approaches sSNAPPY addresses the limitations of alternative strategies which fail to account for gene-gene interactions encoded by pathway topologies and are unable to predict the direction of pathway activities, nor the cumulative effect of expression change across multiple genes. In addition, the single-sample nature of the method can be utilised to address the increasing demand for personalised medicine. Through identifying shared and divergent responses between individuals, sSNAPPY can provide valuable insights into the heterogeneous responses across clinical samples. Overall, we believe sSNAPPY represents a valuable addition to the existing body of pathway analysis methods.

Ethics and consent

Ethical approval and consent were not required.

Author contributions

WL’s contributions include Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Software, Validation, Visualisation, Writing - Original Draft Preparation, and Writing - Review & Editing. VM was involved with Conceptualization, Methodology and Writing - Review & Editing. WDT contributed to Writing - Review & Editing. SMP’s contributions include Conceptualization, Methodology, Project Administration, Software, Supervision, Writing - Original Draft Preparation, and Writing - Review & Editing.

Data availability

The processed dataset used in this manuscript, along with the code used for data preparation is available at https://doi.org/10.5281/zenodo.10867706.

Raw data was obtained from Gene Expression Omnibus: Longitudinal single-cell RNA-seq data of metastatic ovarian cancer. Accession GSE165897; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE165897.²¹

Software availability

• Software available from: https://bioconductor.org/packages/release/bioc/html/sSNAPPY.html
• Source code available from: https://github.com/Wenjun-Liu/sSNAPPY
• Archived source code at time of publication: https://zenodo.org/doi/10.5281/zenodo.8185450
• License: GNU General Public License v3.0 GPL-3

References

1. Maleki F, Ovens K, Hogan DJ, et al.:Gene set analysis: Challenges, opportunities, and future research.Front. Genet.2020; 11: 654. PubMed Abstract | Publisher Full Text | Free Full Text
2. Mubeen S, Tom Kodamullil A, Hofmann-Apitius M, et al.:On the influence of several factors on pathway enrichment analysis.Brief. Bioinform.2022; 23(3): bbac143. PubMed Abstract | Publisher Full Text | Free Full Text
3. Subramanian A, Tamayo P, Mootha VK, et al.:Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.Proc. Natl. Acad. Sci. USA.2005 Oct; 102(43): 15545–15550. PubMed Abstract | Publisher Full Text | Free Full Text
4. Ogata H, Goto S, Sato K, et al.:KEGG: Kyoto Encyclopedia of Genes and Genomes.Nucleic Acids Res.1999 Jan; 27(1): 29–34. PubMed Abstract | Publisher Full Text | Free Full Text
5. Martens M, Ammar A, Riutta A, et al.:WikiPathways: Connecting communities.Nucleic Acids Res.2021 Jan; 49(D1): D613–D621. PubMed Abstract | Publisher Full Text | Free Full Text
6. Tarca AL, Draghici S, Khatri P, et al.:A novel signaling pathway impact analysis.Bioinformatics.2009 Jan; 25(1): 75–82. PubMed Abstract | Publisher Full Text | Free Full Text
7. Jacob L, Neuvial P, Dudoit S:More power via graph-structured tests for differential expression of gene networks.Ann. Appl. Stat.2012 Jun; 6(2): 561–600. Publisher Full Text
8. Ma J, Shojaie A, Michailidis G:Network-based pathway enrichment analysis with incomplete network information.Bioinformatics.2016 Jun; 32(20): 3165–3174. PubMed Abstract | Publisher Full Text | Free Full Text
9. Ibrahim MAH, Jassim S, Cawthorne MA, et al.:A topology-based score for pathway enrichment.J. Comput. Biol.2012 May; 19(5): 563–573. PubMed Abstract | Publisher Full Text
10. Wu D, Smyth GK:Camera: a competitive gene set test accounting for inter-gene correlation.Nucleic Acids Res.2012 May; 40(17): e133–e133. PubMed Abstract | Publisher Full Text | Free Full Text
11. Nguyen TM, Shafi A, Nguyen T, et al.:Identifying significantly impacted pathways: A comprehensive review and assessment.Genome Biol.2019 Oct; 20(1): 203. PubMed Abstract | Publisher Full Text | Free Full Text
12. Ma J, Shojaie A, Michailidis G:A comparative study of topology-based pathway enrichment analysis methods.BMC Bioinformatics.2019 Dec; 20(1): 546. PubMed Abstract | Publisher Full Text | Free Full Text
13. Hänzelmann S, Castelo R, Guinney J:GSVA: Gene set variation analysis for microarray and RNA-Seq data.BMC Bioinformatics.2013 Dec; 14(1): 7. PubMed Abstract | Publisher Full Text | Free Full Text
14. Law CW, Chen Y, Shi W, et al.:Voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.Genome Biol.2014; 15(2): R29. PubMed Abstract | Publisher Full Text | Free Full Text
15. Knijnenburg TA, Wessels LFA, Reinders MJT, et al.:Fewer permutations, more accurate P-values.Bioinformatics.2009 May; 25(12): i161–i168. PubMed Abstract | Publisher Full Text | Free Full Text
16. Gene set enrichment analysis (GSEA) User Guide.Reference Source
17. Smyth GK:Linear models and empirical bayes methods for assessing differential expression in microarray experiments.Stat. Appl. Genet. Mol. Biol.2004 Feb; 3(1): 1–25. PubMed Abstract | Publisher Full Text
18. Ritchie ME, Phipson B, Wu D, et al.:limma powers differential expression analyses for RNA-sequencing and microarray studies.Nucleic Acids Res.2015 Apr; 43(7): e47. PubMed Abstract | Publisher Full Text | Free Full Text
19. Sales G, Calura E, Cavalieri D, et al.:Graphite - a Bioconductor package to convert pathway topology to gene network.BMC Bioinformatics.2012 Dec; 13(1): 20. PubMed Abstract | Publisher Full Text | Free Full Text
20. Gillespie M, Jassal B, Stephan R, et al.:The reactome pathway knowledgebase 2022.Nucleic Acids Res.2021 Nov; 50(D1): D687–D692. PubMed Abstract | Publisher Full Text | Free Full Text
21. Zhang K, Erkan EP, Jamalzadeh S, et al.:Longitudinal single-cell RNA-seq analysis reveals stress-promoted chemoresistance in metastatic ovarian cancer.Sci. Adv.2022 Feb; 8(8): eabm1831. PubMed Abstract | Publisher Full Text | Free Full Text
22. Hansen KD, Irizarry RA, Wu Z:Removing technical variability in RNA-seq data using conditional quantile normalization.Biostatistics.2012 Apr; 13(2): 204–216. PubMed Abstract | Publisher Full Text | Free Full Text
23. Kamburov A, Stelzl U, Lehrach H, et al.:The ConsensusPathDB interaction database: 2013 update.Nucleic Acids Res.2012 Nov; 41(D1): D793–D800. PubMed Abstract | Publisher Full Text | Free Full Text
24. Wickham H:ggplot2: Elegant Graphics for Data Analysis.New York, NY:Springer New York;2009. Publisher Full Text
25. Newman MEJ, Girvan M:Finding and evaluating community structure in networks.Phys. Rev. E.2004 Feb; 69(2): 026113. Publisher Full Text
26. Cunningham F, Allen JE, Allen J, et al.:Ensembl 2022.Nucleic Acids Res.2021 Nov; 50(D1): D988–D995. PubMed Abstract | Publisher Full Text | Free Full Text
27. Kolde R:Pheatmap: Pretty heatmaps.2019.Reference Source
28. Dong M, Pang X, Xu Y, et al.:Ubiquitin-Conjugating Enzyme 9 Promotes Epithelial Ovarian Cancer Cell Proliferation in Vitro.Int. J. Mol. Sci.2013 May; 14(6): 11061–11071. PubMed Abstract | Publisher Full Text | Free Full Text
29. Qi G, Zhang C, Ma H, et al.:CDCA8, targeted by MYBL2, promotes malignant progression and olaparib insensitivity in ovarian cancer.Am. J. Cancer Res.2021; 11(2): 389–415. PubMed Abstract
30. Zou R, Xu H, Li F, et al.:Increased expression of UBE2T predicting poor survival of ovarian cancer: Based on bioinformatics analysis of UBE2s, clinical samples and the GEO database.DNA Cell Biol.2021; 40(1): 36–60. PubMed Abstract | Publisher Full Text
31. Gąsowska-Bajger B, Gąsowska-Bodnar A, Knapp P, et al.:Prognostic Significance of Survivin Expression in Patients with Ovarian Carcinoma: A Meta-Analysis.J. Clin. Med.2021 Feb; 10(4): 879. PubMed Abstract | Publisher Full Text | Free Full Text
32. Gao Y, Zhao H, Ren M, et al.:TOP2A Promotes Tumorigenesis of High-grade Serous Ovarian Cancer by Regulating the TGF-β/Smad Pathway.J. Cancer.2020; 11(14): 4181–4192. PubMed Abstract | Publisher Full Text | Free Full Text
33. Chekerov R, Klaman I, Zafrakas M, et al.:Altered Expression Pattern of Topoisomerase II, in Ovarian Tumor Epithelial and Stromal Cells after Platinum-Based Chemotherapy.Neoplasia.2006 Jan; 8(1): 38–45. PubMed Abstract | Publisher Full Text | Free Full Text
34. Wu D, Lim E, Vaillant F, et al.:ROAST: Rotation gene set tests for complex microarray experiments.Bioinformatics.2010; 26: 2176–2182. PubMed Abstract | Publisher Full Text | Free Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 13 Jun 2024

Author details Author details

¹ Dame Roma Mitchell Cancer Research Laboratories, Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, South Australia, 5000, Australia
² Adelaide Centre for Epigenetics, School of Biomedicine, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, South Australia, 5000, Australia
³ The South Australian Immunogenomics Cancer Institute, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, South Australia, 5000, Australia
⁴ Computational Medicine, Faculty of Medicine, University of Oulu, Oulu, Northern Ostrobothnia, Finland
⁵ Center for Life Course Health Research, Faculty of Medicine, University of Oulu, Oulu, Northern Ostrobothnia, Finland
⁶ Black Ochre Data Labs, Indigenous Genomics, Telethon Kids Institute, Adelaide, South Australia, 5000, Australia
⁷ John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia

Wenjun Liu
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Ville-Petteri Mäkinen
Roles: Conceptualization, Methodology, Writing – Review & Editing

Wayne D Tilley
Roles: Funding Acquisition, Writing – Review & Editing

Stephen M Pederson
Roles: Conceptualization, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

W.D. Tilley’s research is supported by the National Health and Medical Research Council of Australia (ID 1186647) and the National Breast Cancer Foundation Australia (ID IIRS-23-069). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 13 Jun 2024, 13:628

https://doi.org/10.12688/f1000research.149494.1

Copyright

© 2024 Liu W et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Liu W, Mäkinen VP, Tilley WD and Pederson SM. sSNAPPY: an R/Bioconductor package for single-sample directional pathway perturbation analysis [version 1; peer review: 2 approved with reservations]. F1000Research 2024, 13:628 (https://doi.org/10.12688/f1000research.149494.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 13 Jun 2024

Views

7

Reviewer Report 18 Jul 2024

Panagiotis Moulos, Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center ‘Alexander Fleming’, Vari, Greece

Approved with Reservations

https://doi.org/10.5256/f1000research.163964.r297693

The authors present sSNAPPY, a Bioconductor package for the interesting research field of determining expression changes in whole pathways given the topology, instead of only identifying over-represented gene groups with multiple functionalities. They also extend the case by taking into ... Continue reading

The authors present sSNAPPY, a Bioconductor package for the interesting research field of determining expression changes in whole pathways given the topology, instead of only identifying over-represented gene groups with multiple functionalities. They also extend the case by taking into account individual samples instead of summarized conditions, as done in previous work.

Overall, the article is well-written with well-described mathematical formulas and notation. The concept is well-presented and documented. However, as individual-sample based analysis does not take into account replication, it would be beneficial if the authors would elaborate more on the issue, i.e. elaborate more on the theoretical foundations of their permutation strategy.

Regarding the package itself, if accepted to Bioconductor there has already been a thorough review involving multiple code quality and consistency checks as well as testing, therefore there are no objections in this part. However, the authors should justify further their selection of a scRNA-Seq dataset instead of a bulk RNA-Seq one, as also stated in p. 6.

Code related remarks:
- To make the example more easily reproducible, code in the beginning of p. 7 should include downloading and renaming (if required) logCPM.tsv
- retrieve_topology takes too much time. Maybe the authors could create a resource and submit as a different package to Bioconductor. Also, I got warnings while executing, the message is not very clear.
- genePertScore, ssPertScore: numbers are not fully reproducible (Ubuntu 22.04, R 4.3.2), it should be made clear by the authors that the examples in the paper could be dependent on R version and other package versions

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: bioinformatics, clinical bioinformatics, transcriptomics, genetics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

14

Reviewer Report 04 Jul 2024

Dario Strbenac, The University of Sydney, Sydney, New South Wales, Australia

Approved with Reservations

https://doi.org/10.5256/f1000research.163964.r292642

sSNAPPY in a Bioconductor package for directional pathway analysis. Having been accepted into Bioconductor, the package code and documentation is already known to be of good quality. Mathematical formulas are clear and the variables are all defined. However, the first ... Continue reading

sSNAPPY in a Bioconductor package for directional pathway analysis. Having been accepted into Bioconductor, the package code and documentation is already known to be of good quality. Mathematical formulas are clear and the variables are all defined. However, the first step has an unfavourable user experience.

I ran retrieve_topology(database = "reactome", species = "hsapiens") but nothing happened for about half an hour and there were no progress nor error messages. I tried both a Windows 11 personal computer and a university-managed Linux server. Eventually, the function appears to complete successfully, albeit with a list of warnings of unknown consequence but seemingly ominous:

> head(warnings(), 3)
Warning messages:
1: In FUN(X[[i]], ...) :
the conversion lost all edges of pathway "Uncoating of the HIV Virion"
2: In FUN(X[[i]], ...) :
the conversion lost all edges of pathway "Plus-strand DNA synthesis"
3: In FUN(X[[i]], ...) :
the conversion lost all edges of pathway "Virus Assembly and Release"

If the user runs the command again, it takes a long time, instead of being cached to disk the first time and loaded near-instantly the second. This part of the analysis could benefit from BiocFileCache.

Generation of Figure 3 involves a lot of ggplot2 coding for quality control plots which seems generally useful. Perhaps this should be a reusable function that package users could concisely write in their own R Markdown files. Similarly, Figure 4 involves four steps which could be encapsulated into a package function for better reusability.

In terms of the Introduction, perhaps Pengyi Yang et al.(2014 [ref - 1]) Direction Pathway Analysis of Large-scale Proteomics Data Reveals Novel Features of the Insulin Action Pathway, Bioinformatics should be discussed, and limitations noted.

Is the rationale for developing the new software tool clearly explained?

Partly
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

References

1. Yang P, Patrick E, Tan SX, Fazakerley DJ, et al.: Direction pathway analysis of large-scale proteomics data reveals novel features of the insulin action pathway.Bioinformatics. 2014; 30 (6): 808-14 PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Statistical Bioinformatics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 13 Jun 2024

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 13 Jun 24	read	read

Dario Strbenac, The University of Sydney, Sydney, Australia
Panagiotis Moulos, Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center ‘Alexander Fleming’, Vari, Greece

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

7 Views

18 Jul 2024 | for Version 1

Panagiotis Moulos, Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center ‘Alexander Fleming’, Vari, Greece

7 Views Cite this report Responses(0)

Approved With Reservations

The authors present sSNAPPY, a Bioconductor package for the interesting research field of determining expression changes in whole pathways given the topology, instead of only identifying over-represented gene groups with multiple functionalities. They also extend the case by taking into account individual samples instead of summarized conditions, as done in previous work.

Overall, the article is well-written with well-described mathematical formulas and notation. The concept is well-presented and documented. However, as individual-sample based analysis does not take into account replication, it would be beneficial if the authors would elaborate more on the issue, i.e. elaborate more on the theoretical foundations of their permutation strategy.

Regarding the package itself, if accepted to Bioconductor there has already been a thorough review involving multiple code quality and consistency checks as well as testing, therefore there are no objections in this part. However, the authors should justify further their selection of a scRNA-Seq dataset instead of a bulk RNA-Seq one, as also stated in p. 6.

Code related remarks:
- To make the example more easily reproducible, code in the beginning of p. 7 should include downloading and renaming (if required) logCPM.tsv
- retrieve_topology takes too much time. Maybe the authors could create a resource and submit as a different package to Bioconductor. Also, I got warnings while executing, the message is not very clear.
- genePertScore, ssPertScore: numbers are not fully reproducible (Ubuntu 22.04, R 4.3.2), it should be made clear by the authors that the examples in the paper could be dependent on R version and other package versions

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

bioinformatics, clinical bioinformatics, transcriptomics, genetics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

14 Views

04 Jul 2024 | for Version 1

Dario Strbenac, The University of Sydney, Sydney, New South Wales, Australia

14 Views Cite this report Responses(0)

Approved With Reservations

sSNAPPY in a Bioconductor package for directional pathway analysis. Having been accepted into Bioconductor, the package code and documentation is already known to be of good quality. Mathematical formulas are clear and the variables are all defined. However, the first step has an unfavourable user experience.

I ran retrieve_topology(database = "reactome", species = "hsapiens") but nothing happened for about half an hour and there were no progress nor error messages. I tried both a Windows 11 personal computer and a university-managed Linux server. Eventually, the function appears to complete successfully, albeit with a list of warnings of unknown consequence but seemingly ominous:

> head(warnings(), 3)
Warning messages:
1: In FUN(X[[i]], ...) :
the conversion lost all edges of pathway "Uncoating of the HIV Virion"
2: In FUN(X[[i]], ...) :
the conversion lost all edges of pathway "Plus-strand DNA synthesis"
3: In FUN(X[[i]], ...) :
the conversion lost all edges of pathway "Virus Assembly and Release"

If the user runs the command again, it takes a long time, instead of being cached to disk the first time and loaded near-instantly the second. This part of the analysis could benefit from BiocFileCache.

Generation of Figure 3 involves a lot of ggplot2 coding for quality control plots which seems generally useful. Perhaps this should be a reusable function that package users could concisely write in their own R Markdown files. Similarly, Figure 4 involves four steps which could be encapsulated into a package function for better reusability.

In terms of the Introduction, perhaps Pengyi Yang et al.(2014 [ref - 1]) Direction Pathway Analysis of Large-scale Proteomics Data Reveals Novel Features of the Insulin Action Pathway, Bioinformatics should be discussed, and limitations noted.

Is the rationale for developing the new software tool clearly explained?

Partly
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

References

1. Yang P, Patrick E, Tan SX, Fazakerley DJ, et al.: Direction pathway analysis of large-scale proteomics data reveals novel features of the insulin action pathway.Bioinformatics. 2014; 30 (6): 808-14 PubMed Abstract | Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Statistical Bioinformatics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. Maleki F, Ovens K, Hogan DJ, et al.:Gene set analysis: Challenges, opportunities, and future research.Front. Genet.2020; 11: 654. PubMed Abstract | Publisher Full Text | Free Full Text

[2] 2. Mubeen S, Tom Kodamullil A, Hofmann-Apitius M, et al.:On the influence of several factors on pathway enrichment analysis.Brief. Bioinform.2022; 23(3): bbac143. PubMed Abstract | Publisher Full Text | Free Full Text

[3] 3. Subramanian A, Tamayo P, Mootha VK, et al.:Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.Proc. Natl. Acad. Sci. USA.2005 Oct; 102(43): 15545–15550. PubMed Abstract | Publisher Full Text | Free Full Text

[4] 4. Ogata H, Goto S, Sato K, et al.:KEGG: Kyoto Encyclopedia of Genes and Genomes.Nucleic Acids Res.1999 Jan; 27(1): 29–34. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Martens M, Ammar A, Riutta A, et al.:WikiPathways: Connecting communities.Nucleic Acids Res.2021 Jan; 49(D1): D613–D621. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Tarca AL, Draghici S, Khatri P, et al.:A novel signaling pathway impact analysis.Bioinformatics.2009 Jan; 25(1): 75–82. PubMed Abstract | Publisher Full Text | Free Full Text

[7] 7. Jacob L, Neuvial P, Dudoit S:More power via graph-structured tests for differential expression of gene networks.Ann. Appl. Stat.2012 Jun; 6(2): 561–600. Publisher Full Text

[8] 8. Ma J, Shojaie A, Michailidis G:Network-based pathway enrichment analysis with incomplete network information.Bioinformatics.2016 Jun; 32(20): 3165–3174. PubMed Abstract | Publisher Full Text | Free Full Text

[9] 9. Ibrahim MAH, Jassim S, Cawthorne MA, et al.:A topology-based score for pathway enrichment.J. Comput. Biol.2012 May; 19(5): 563–573. PubMed Abstract | Publisher Full Text

[10] 10. Wu D, Smyth GK:Camera: a competitive gene set test accounting for inter-gene correlation.Nucleic Acids Res.2012 May; 40(17): e133–e133. PubMed Abstract | Publisher Full Text | Free Full Text

[11] 11. Nguyen TM, Shafi A, Nguyen T, et al.:Identifying significantly impacted pathways: A comprehensive review and assessment.Genome Biol.2019 Oct; 20(1): 203. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Ma J, Shojaie A, Michailidis G:A comparative study of topology-based pathway enrichment analysis methods.BMC Bioinformatics.2019 Dec; 20(1): 546. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Hänzelmann S, Castelo R, Guinney J:GSVA: Gene set variation analysis for microarray and RNA-Seq data.BMC Bioinformatics.2013 Dec; 14(1): 7. PubMed Abstract | Publisher Full Text | Free Full Text

[14] 14. Law CW, Chen Y, Shi W, et al.:Voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.Genome Biol.2014; 15(2): R29. PubMed Abstract | Publisher Full Text | Free Full Text

[15] 15. Knijnenburg TA, Wessels LFA, Reinders MJT, et al.:Fewer permutations, more accurate P-values.Bioinformatics.2009 May; 25(12): i161–i168. PubMed Abstract | Publisher Full Text | Free Full Text

[16] 16. Gene set enrichment analysis (GSEA) User Guide.Reference Source

[17] 17. Smyth GK:Linear models and empirical bayes methods for assessing differential expression in microarray experiments.Stat. Appl. Genet. Mol. Biol.2004 Feb; 3(1): 1–25. PubMed Abstract | Publisher Full Text

[18] 18. Ritchie ME, Phipson B, Wu D, et al.:limma powers differential expression analyses for RNA-sequencing and microarray studies.Nucleic Acids Res.2015 Apr; 43(7): e47. PubMed Abstract | Publisher Full Text | Free Full Text

[19] 19. Sales G, Calura E, Cavalieri D, et al.:Graphite - a Bioconductor package to convert pathway topology to gene network.BMC Bioinformatics.2012 Dec; 13(1): 20. PubMed Abstract | Publisher Full Text | Free Full Text

[20] 20. Gillespie M, Jassal B, Stephan R, et al.:The reactome pathway knowledgebase 2022.Nucleic Acids Res.2021 Nov; 50(D1): D687–D692. PubMed Abstract | Publisher Full Text | Free Full Text

[21] 21. Zhang K, Erkan EP, Jamalzadeh S, et al.:Longitudinal single-cell RNA-seq analysis reveals stress-promoted chemoresistance in metastatic ovarian cancer.Sci. Adv.2022 Feb; 8(8): eabm1831. PubMed Abstract | Publisher Full Text | Free Full Text

[22] 22. Hansen KD, Irizarry RA, Wu Z:Removing technical variability in RNA-seq data using conditional quantile normalization.Biostatistics.2012 Apr; 13(2): 204–216. PubMed Abstract | Publisher Full Text | Free Full Text

[23] 23. Kamburov A, Stelzl U, Lehrach H, et al.:The ConsensusPathDB interaction database: 2013 update.Nucleic Acids Res.2012 Nov; 41(D1): D793–D800. PubMed Abstract | Publisher Full Text | Free Full Text

[24] 24. Wickham H:ggplot2: Elegant Graphics for Data Analysis.New York, NY:Springer New York;2009. Publisher Full Text

[25] 25. Newman MEJ, Girvan M:Finding and evaluating community structure in networks.Phys. Rev. E.2004 Feb; 69(2): 026113. Publisher Full Text

[26] 26. Cunningham F, Allen JE, Allen J, et al.:Ensembl 2022.Nucleic Acids Res.2021 Nov; 50(D1): D988–D995. PubMed Abstract | Publisher Full Text | Free Full Text

[27] 27. Kolde R:Pheatmap: Pretty heatmaps.2019.Reference Source

[28] 28. Dong M, Pang X, Xu Y, et al.:Ubiquitin-Conjugating Enzyme 9 Promotes Epithelial Ovarian Cancer Cell Proliferation in Vitro.Int. J. Mol. Sci.2013 May; 14(6): 11061–11071. PubMed Abstract | Publisher Full Text | Free Full Text

[29] 29. Qi G, Zhang C, Ma H, et al.:CDCA8, targeted by MYBL2, promotes malignant progression and olaparib insensitivity in ovarian cancer.Am. J. Cancer Res.2021; 11(2): 389–415. PubMed Abstract

[30] 30. Zou R, Xu H, Li F, et al.:Increased expression of UBE2T predicting poor survival of ovarian cancer: Based on bioinformatics analysis of UBE2s, clinical samples and the GEO database.DNA Cell Biol.2021; 40(1): 36–60. PubMed Abstract | Publisher Full Text

[31] 31. Gąsowska-Bajger B, Gąsowska-Bodnar A, Knapp P, et al.:Prognostic Significance of Survivin Expression in Patients with Ovarian Carcinoma: A Meta-Analysis.J. Clin. Med.2021 Feb; 10(4): 879. PubMed Abstract | Publisher Full Text | Free Full Text

[32] 32. Gao Y, Zhao H, Ren M, et al.:TOP2A Promotes Tumorigenesis of High-grade Serous Ovarian Cancer by Regulating the TGF-β/Smad Pathway.J. Cancer.2020; 11(14): 4181–4192. PubMed Abstract | Publisher Full Text | Free Full Text

[33] 33. Chekerov R, Klaman I, Zafrakas M, et al.:Altered Expression Pattern of Topoisomerase II, in Ovarian Tumor Epithelial and Stromal Cells after Platinum-Based Chemotherapy.Neoplasia.2006 Jan; 8(1): 38–45. PubMed Abstract | Publisher Full Text | Free Full Text

[34] 34. Wu D, Lim E, Vaillant F, et al.:ROAST: Rotation gene set tests for complex microarray experiments.Bioinformatics.2010; 26: 2176–2182. PubMed Abstract | Publisher Full Text | Free Full Text

sSNAPPY: an R/Bioconductor package for single-sample directional pathway perturbation analysis

Abstract

Keywords

Introduction

Figure 1. Schematic illustration of the differences between conventional pathway analysis methods and sSNAPPY.

Methods

Implementation

Figure 2. Gene-wise standard deviations shown against mean logCPM.

Operation

Use cases

Required packages

Retrieval of pathway topology matrices

Data

Score single-sample pathway perturbation

Sample permutation for normalisation and significance testing

Figure 3. (A) Q-Q plot and (B) distributions of permuted perturbation scores of six randomly selected pathways.

Table 1. Significantly impacted Reactome pathways identified among post-chemotherapy samples using sSNAPPY. Only the 10 most significantly inhibited and 10 most significantly activated pathways are shown.

Visualising perturbed pathways as networks

Figure 4. Significantly perturbed Reactome pathways identified among post-chemotherapy samples using sSNAPPY.

Figure 5. Community structures among significantly perturbed Reactome pathways that were identified among post-chemotherapy samples using sSNAPPY.

Figure 6. Genes associated with significantly perturbed Reactome pathways that were identified among post-chemotherapy samples using sSNAPPY.

Identifying key gene contributions

Figure 7. Gene-level perturbation scores for the top 10 genes in "SUMOylation of DNA replication proteins" pathway.

Figure 8. ssFC of potential key genes driving the inhibition of “SUMOylation of DNA replication proteins” pathway.

Comparison to other pathway enrichment methods

Discussion

Ethics and consent

Author contributions

Data availability

Software availability

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated