ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

Hobotnica: exploring molecular signature quality

[version 1; peer review: 2 approved with reservations]
PUBLISHED 08 Dec 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

Abstract

A Molecular Features Set (MFS), is a result of a vast diversity of bioinformatics pipelines. The lack of a “gold standard” for most experimental data modalities makes it difficult to provide valid estimation for a particular MFS's quality. Yet, this goal can partially be achieved by analyzing inner-sample Distance Matrices (DM) and their power to distinguish between phenotypes.
The quality of a DM can be assessed by summarizing its power to quantify the differences of inner-phenotype and outer-phenotype distances. This estimation of the DM quality can be construed as a measure of the MFS's quality. 
Here we propose Hobotnica, an approach to estimate MFSs quality by their ability to stratify data, and assign them significance scores, that allow for collating various signatures and comparing their quality for contrasting groups.

Keywords

Molecular signature, Distance Matrix, Differential Gene Expression, Gene Signature, Rank statistics

Introduction

A signature based on a predefined Molecular Features Set (MFS), which is designed to distinguish biological conditions or phenotypes from each other, is a crucial concept in bioinformatics and precision medicine. In this context, signatures typically originate from MFS from contrasting experimental data from two or more sample groups, which differ phenotypically. These MFS incorporate information on the differences between the groups. The nature of the MFS depends on the modality of the original data. For instance, the MFS provided by the Differential Gene Expression approach is a list of Differentially Expressed genes (DEG); Differential Methylation analysis provides Differentially Methylated Cytosines or regions (DMC and DMR) as MFS.

A significant number of mutational, expression and methylation-based signatures have recently been published and they are actively used in research and translational medicine. Examples of expression-based signatures involve gene sets for clinical prognosis (e.g., PAM50,1 MammaPrint2), for pathways and gene enrichment analysis (e.g., MsigDB collections3), and for drug re-purposing (e.g., LINCS project4).

Direct quality assessment for MFS is currently hardly possible, since there are no ‘gold standard’ datasets where active Molecular Features are explicitly known. In this manuscript, we propose a novel approach - Hobotnica, that allows for measurement of MFS quality by addressing the key property of the signature, namely, its quality for data stratification.

Hobotnica leverages the quality of distance matrices obtained from any source, in order to assess the quality of the MFS from any data modality compared to a random MFS. In this study, we demonstrate its application to transcriptomic signatures.

Methods

Approach

The Hobotnica approach is as follows: For a given data set W and a given MFS (S) we derive the inter-sample distance matrix (DMSW). Then we assess the quality of DM (and, thus, of S) with a summarizing function (αDMS=αDMSY or by abuse of notation αDMS) where (Y) represents the labels of samples. In shorter notation,

(1)
H:SfSDDMgDMYα

We desire the function α to gauge if the inner-class samples are closer to each other than to outer-class samples. If no difference exists from one class to another, α must be close to zero and as the difference grows, α grows. In the ideal case of a perfect separation, α reaches its maximum at 1:

  • α01

  • α1 High groups stratification quality

  • α0 Low groups stratification quality

Under the null hypothesis of Hobotnica ((H0)), no significant difference exists between αS and the α of an equal-sized general random set. On the contrary, the alternative (HA) hypothesizes that S generates higher α than most random S of the same size. To estimate a null distribution for Hobotnica’s α, we applied a permutation test. As our default options, we use Kendall distance as the distance measure and Mann-Whitney-Wilcoxon test as the summarizing function.

When instead of a single MFS a set of hypotheses H1:MFS1H2:MFS2Hn:MFSn is in place, for each Molecular Feature Set MFSi corresponding Distance Matrix DMi can be generated, and than, in turn, particular value of the measure αi:

(2)
H1:MFS1H2:MFS2Hn:MFSnfMFS1DDM1fMFS2DDM2MFSnDDMngDM1Aα1gDM2Aα2gDMnAαn.

Thus, for every MFS MFSi from set of hypotheses H1:MFS1H2:MFS2Hn:MFSn H-score αi may be computed, resulting in a set α1α2αn. Comparing α values allows for corresponding Feature Sets qualities ranking and selecting the most informative Signatures for the Data D.

Validation

To validate our approach, we conducted three case studies.

In the first case study we extracted RNA-seq expression dataset for prostate cancer from the Cancer Genome Atlas (TCGA) on counts level.5 As MFSs, we recruited the C2 collection of molecular signatures from MSigDB,3,6 that contains a number of prostate-related gene sets. This way, every candidate MFS (gene set from the collection) produced its specific H-score.

For the second case study, we recruited the PAM50 molecular signature,7 which was designed for classifying various breast cancer subtypes, as MFS. Then, we applied it to several datasets containing these breast cancer subtypes.5,811

In the third case study, we explored H-scores delivered by various DGE approaches. We performed DGE analysis for two groups of mice samples with different response to MYC factor treatment (Mycfl/fl vs MycΔIE, ERT2 genotypes)12 with DESeq213 and edgeR.14

The top 100 genes for each method were then retrieved. In addition, we extracted a list of genes genes with the highest variance in expression, as well as a number of random gene sets.

In each case, the counts were normalised to counts per million (cpm). For every geneset an H-score and its p-value with BH15 correction were computed.

Results

Prostate-related C2 gene sets clearly demonstrated highest H-score values and sufficient statistical significance (Table 1, Figure 1A), as well as data stratification (Figure 1B), which is expected for prostate cancer as opposed to control contrast. Gene sets not attributed to prostate cancer-related processes did not achieve statistically significant p-values (Table 1).

Table 1. Ten C2-chemical and genetic perturbations (GCP) Gene Signatures with the highest H-scores.

SignatureH-scorep-value
TOMLINS_PROSTATE_CANCER0.7950.025
WALLACE_PROSTATE_CANCER0.7470.025
OUYANG_PROSTATE_CANCER_PROGRESSION0.7450.025
LIU_PROSTATE_CANCER0.7350.025
PIEPOLI_LGI1_TARGETS0.7240.059
SMID_BREAST_CANCER_RELAPSE_IN_LIVER0.7120.164
TIMOFEEVA_GROWTH_STRESS_VIA_STAT10.7080.240
GENTILE_UV_LOW_DOSE0.7050.308
JOHANSSON_BRAIN_CANCER_EARLY_VS_LATE0.7010.377
HOWLIN_CITED1_TARGETS_10.7000.377
849e4f78-0eba-4f72-be9e-8a506656ca9c_figure1.gif

Figure 1. A: Distribution of H-scores for random genesets (blue) on TCGA prostate cancer vs normal dataset (see Table 1) and Tomlins prostate geneset H-score (red). B: MDS for TCGA prostate demonstrates samples separation with Tomlins geneset. C: Distribution of H-scores for random genesets (blue) on GSE48216 breast cancer dataset (see Table 2) and PAM50 geneset H-score (red). D: MDS for GSE48216 breast cancer dataset samples separation with PAM50 geneset.

H-scores for the PAM50 signature were evidently higher for all datasets in the second case study than those for random gene sets for the same datasets (Figure 2, Figure 1C). This implies that the PAM50 signature exhibits a high stratification quality for various breast cancer subtypes samples. PAM50-delivered H-scores also demonstrated highly statistically significant p-values (Table 2).

849e4f78-0eba-4f72-be9e-8a506656ca9c_figure2.gif

Figure 2. Distribution of random gene sets-delivered (blue) and PAM50 gene set-delivered (green) H-scores for breast cancer datasets (see Table 2).

Table 2. PAM50 results.

GEO AccessionSample sizeGroups in datasetH-scorep-value
GSE5813516860.7727e-4
GSE62944106750.88920.0003
GSE482164630.85670.0003
GSE803331030.97650.0003

In the third case study, various DGE approaches resulted in gene sets that delivered significantly different H-scores (Figure 3). For this dataset, edgeR provided a signature with the best quality score, while DESeq2 still demonstrated a higher separation quality than that of random signatures. Genes with the highest variance showed lower scores compared to random gene sets. This result stresses the importance of the Hobotnica procedure to evaluate the quality of a particular DGE analysis.

849e4f78-0eba-4f72-be9e-8a506656ca9c_figure3.gif

Figure 3. H-scores for the top 100 Gene Signatures delivered from DESeq2, edgeR, genes with highes variance and random gene sets applied to GSE155460 data.

Discussion

Hobotnica was designed to quantitatively evaluate MFS quality through their ability for data stratification, based on their inter-sample distance matrices, and to assess the statistical significance of the results. We demonstrated that Hobotnica can efficiently estimate the quality of a molecular signature in the context of expression data.

The suggested method can be used to evaluate various sorts of MFSs: those retrieved from DGE or DM analyses, Mutation/single nucleotide variation calling or pathways analysis, as well as data modalities of other types, that are suitable as differential problems.

A possible application of Hobotnica is evaluating a particular model’s performance (e.g., DGE model) for a particular dataset. This will allow researchers to choose a method that delivers a signature with the best data stratification from a number of proposed approaches.

Assessing H-score values for various lengths of the same set or signature can be explored with the proposed method, which will help to optimize MFS structure. Such procedures can be especially crucial in clinical applications.

Data availability

Underlying data

NCBI Gene Expression Omnibus: Alternatively processed and compiled RNA-Sequencing and clinical data for thousands of samples from The Cancer Genome Atlas, https://identifiers.org/ncbiprotein:GSE62944

NCBI Gene Expression Omnibus: Modeling precision treatment of breast cancer, https://identifiers.org/ncbiprotein:GSE48216

NCBI Gene Expression Omnibus:Spatial proximity to fibroblasts impacts molecular features and therapeutic sensitivity of breast cancer cells influencing clinical outcomes, https://identifiers.org/ncbiprotein:GSE80333

NCBI Gene Expression Omnibus: Next Generation Sequencing Analysis of Mycfl/fl and MycIE, ERT2 intestinal transcriptomes, https://identifiers.org/ncbiprotein:GSE155460

Extended data

Analysis code

Analysis code available from: https://github.com/lab-medvedeva/Hobotnica-main

Archived analysis code as at time of publication: https://doi.org/10.5281/zenodo.5656814

License: GNU General Public License v2.0

Competing interests

No competing interests were disclosed.

Grant information

This work was supported by Ministry of Science and Higher Education of the Russian Federation (agreement no. 075-15-2020-899) and by the NIH grants R01DE027809 and P30CA006973.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 08 Dec 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Stupnikov A, Sizykh A, Favorov A et al. Hobotnica: exploring molecular signature quality [version 1; peer review: 2 approved with reservations]. F1000Research 2021, 10:1260 (https://doi.org/10.12688/f1000research.74846.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 08 Dec 2021
Views
51
Cite
Reviewer Report 05 Jan 2022
Shailesh Tripathi, Production and Operations Management, University of Applied Sciences Upper Austria, Linz, Austria;  FH Austria, Steyr, Austria 
Approved with Reservations
VIEWS 51
The authors present an approach called Hobotonica for quantitatively evaluating (by assigning H score) MFS quality for given sample labels. This approach could be useful for analyzing samples, for e.g., quality comparison, filtering out poor quality samples, and comparing different ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Tripathi S. Reviewer Report For: Hobotnica: exploring molecular signature quality [version 1; peer review: 2 approved with reservations]. F1000Research 2021, 10:1260 (https://doi.org/10.5256/f1000research.78645.r102284)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 16 Aug 2022
    Alexey Stupnikov, National Medical Research Center for Endocrinology, Moscow, Russian Federation
    16 Aug 2022
    Author Response
    • The author should add simulation studies providing a realistic understanding and interpretation of the H score.
    ​​​​​​​The Reviewer raises an important problem of parametric and nonparametric statistics. ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 16 Aug 2022
    Alexey Stupnikov, National Medical Research Center for Endocrinology, Moscow, Russian Federation
    16 Aug 2022
    Author Response
    • The author should add simulation studies providing a realistic understanding and interpretation of the H score.
    ​​​​​​​The Reviewer raises an important problem of parametric and nonparametric statistics. ... Continue reading
Views
41
Cite
Reviewer Report 20 Dec 2021
Roberto Malinverni, Cancer and Leukemia Epigenetics and Biology Program, Josep Carreras Leukemia Research Institute (IJC), Badalona, Spain 
Approved with Reservations
VIEWS 41
In this short article the authors present an R package called Hobotnica, whose purpose is to evaluate the goodness with which different methodologies can stratify the results presented as Molecular Feature Sets (MFS). With MFS the authors point to all ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Malinverni R. Reviewer Report For: Hobotnica: exploring molecular signature quality [version 1; peer review: 2 approved with reservations]. F1000Research 2021, 10:1260 (https://doi.org/10.5256/f1000research.78645.r102280)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 16 Aug 2022
    Alexey Stupnikov, National Medical Research Center for Endocrinology, Moscow, Russian Federation
    16 Aug 2022
    Author Response
    • The authors present three examples in which it is demonstrated how this approach is able to evaluate the effectiveness of MFS stratification, but the examples considered are all
    ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 16 Aug 2022
    Alexey Stupnikov, National Medical Research Center for Endocrinology, Moscow, Russian Federation
    16 Aug 2022
    Author Response
    • The authors present three examples in which it is demonstrated how this approach is able to evaluate the effectiveness of MFS stratification, but the examples considered are all
    ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 08 Dec 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.