Hobotnica: exploring molecular signature quality

Alexey Stupnikov; Alexey Sizykh; Alexander Favorov; Bahman Afsari; Sarah Wheelan; Luigi Marchionni; Yulia Medvedeva

doi:10.12688/f1000research.74846.1

Home Browse Hobotnica: exploring molecular signature quality

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Method Article

Hobotnica: exploring molecular signature quality

[version 1; peer review: 2 approved with reservations]

Alexey Stupnikov ^1,2, Alexey Sizykh¹, Alexander Favorov^3,4, [...] Bahman Afsari³, Sarah Wheelan³, Luigi Marchionni⁵, Yulia Medvedeva ^1,2,6

Alexey Stupnikov ^1,2, Alexey Sizykh¹, [...] Alexander Favorov^3,4, Bahman Afsari³, Sarah Wheelan³, Luigi Marchionni⁵, Yulia Medvedeva ^1,2,6

PUBLISHED 08 Dec 2021

Author details Author details

¹ Moscow Institute of Physics and Technology, Moscow, Russian Federation
² National Medical Research Center for Endocrinology, Moscow, Russian Federation
³ Johns Hopkins University, Baltimore, USA
⁴ Vavilov Institute for General Genetics RAS, Moscow, Russian Federation
⁵ Weill Cornell Medicine, New York, USA
⁶ Center of Biotechnology RAS, Moscow, Russian Federation

Alexey Stupnikov
Roles: Conceptualization, Formal Analysis, Investigation, Methodology, Project Administration, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Alexey Sizykh
Roles: Formal Analysis, Investigation, Software, Visualization

Alexander Favorov
Roles: Conceptualization

Bahman Afsari
Roles: Conceptualization, Writing – Original Draft Preparation

Sarah Wheelan
Roles: Conceptualization, Funding Acquisition, Writing – Original Draft Preparation

Luigi Marchionni
Roles: Conceptualization, Funding Acquisition, Methodology

Yulia Medvedeva
Roles: Funding Acquisition, Project Administration, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

Abstract

A Molecular Features Set (MFS), is a result of a vast diversity of bioinformatics pipelines. The lack of a “gold standard” for most experimental data modalities makes it difficult to provide valid estimation for a particular MFS's quality. Yet, this goal can partially be achieved by analyzing inner-sample Distance Matrices (DM) and their power to distinguish between phenotypes.
The quality of a DM can be assessed by summarizing its power to quantify the differences of inner-phenotype and outer-phenotype distances. This estimation of the DM quality can be construed as a measure of the MFS's quality.
Here we propose Hobotnica, an approach to estimate MFSs quality by their ability to stratify data, and assign them significance scores, that allow for collating various signatures and comparing their quality for contrasting groups.

Keywords

Molecular signature, Distance Matrix, Differential Gene Expression, Gene Signature, Rank statistics

Corresponding authors: Alexey Stupnikov, Yulia Medvedeva

Competing interests: No competing interests were disclosed.

Grant information: This work was supported by Ministry of Science and Higher Education of the Russian Federation (agreement no. 075-15-2020-899) and by the NIH grants R01DE027809 and P30CA006973.

Copyright: © 2021 Stupnikov A et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Stupnikov A, Sizykh A, Favorov A et al. Hobotnica: exploring molecular signature quality [version 1; peer review: 2 approved with reservations]. F1000Research 2021, 10:1260 (https://doi.org/10.12688/f1000research.74846.1) First published: 08 Dec 2021, 10:1260 (https://doi.org/10.12688/f1000research.74846.1) Latest published: 16 Aug 2022, 10:1260 (https://doi.org/10.12688/f1000research.74846.2)

Introduction

A signature based on a predefined Molecular Features Set (MFS), which is designed to distinguish biological conditions or phenotypes from each other, is a crucial concept in bioinformatics and precision medicine. In this context, signatures typically originate from MFS from contrasting experimental data from two or more sample groups, which differ phenotypically. These MFS incorporate information on the differences between the groups. The nature of the MFS depends on the modality of the original data. For instance, the MFS provided by the Differential Gene Expression approach is a list of Differentially Expressed genes (DEG); Differential Methylation analysis provides Differentially Methylated Cytosines or regions (DMC and DMR) as MFS.

A significant number of mutational, expression and methylation-based signatures have recently been published and they are actively used in research and translational medicine. Examples of expression-based signatures involve gene sets for clinical prognosis (e.g., PAM50,¹ MammaPrint²), for pathways and gene enrichment analysis (e.g., MsigDB collections³), and for drug re-purposing (e.g., LINCS project⁴).

Direct quality assessment for MFS is currently hardly possible, since there are no ‘gold standard’ datasets where active Molecular Features are explicitly known. In this manuscript, we propose a novel approach - Hobotnica, that allows for measurement of MFS quality by addressing the key property of the signature, namely, its quality for data stratification.

Hobotnica leverages the quality of distance matrices obtained from any source, in order to assess the quality of the MFS from any data modality compared to a random MFS. In this study, we demonstrate its application to transcriptomic signatures.

Methods

Approach

The Hobotnica approach is as follows: For a given data set $W$ and a given MFS ( $S$ ) we derive the inter-sample distance matrix ( $DM (S, W)$ ). Then we assess the quality of $DM$ (and, thus, of $S$ ) with a summarizing function ( $α (DM (S)) = α (DM (S), Y)$ or by abuse of notation $α (DM (S))$ ) where ( $Y$ ) represents the labels of samples. In shorter notation,

(1)

\begin{array}{c} H : S \\ f (S| D) \to DM \\ g (DM| Y) \to α \end{array}

We desire the function $α$ to gauge if the inner-class samples are closer to each other than to outer-class samples. If no difference exists from one class to another, $α$ must be close to zero and as the difference grows, $α$ grows. In the ideal case of a perfect separation, $α$ reaches its maximum at 1:

• $α \in [0, 1]$
• $α \to 1 \Leftrightarrow$ High groups stratification quality
• $α \to 0 \Leftrightarrow$ Low groups stratification quality

Under the null hypothesis of Hobotnica (( $H_{0}$ )), no significant difference exists between $α (S)$ and the $α$ of an equal-sized general random set. On the contrary, the alternative ( $H_{A}$ ) hypothesizes that $S$ generates higher $α$ than most random $S^{'}$ of the same size. To estimate a null distribution for Hobotnica’s $α$ , we applied a permutation test. As our default options, we use Kendall distance as the distance measure and Mann-Whitney-Wilcoxon test as the summarizing function.

When instead of a single $MFS$ a set of hypotheses $\{H_{1} : {MFS}_{1} H_{2} : {MFS}_{2} \dots H_{n} : {MFS}_{n}\}$ is in place, for each Molecular Feature Set ${MFS}_{i}$ corresponding Distance Matrix ${DM}_{i}$ can be generated, and than, in turn, particular value of the measure $α_{i}$ :

(2)

\{\begin{cases} H_{1} : {MFS}_{1} \\ H_{2} : {MFS}_{2} \\ \dots \\ H_{n} : {MFS}_{n} \end{cases} \to \{\begin{cases} f ({MFS}_{1}| D) \to {DM}_{1} \\ f ({MFS}_{2}| D) \to {DM}_{2} \\ \dots \\ ({MFS}_{n}| D) \to {DM}_{n} \end{cases} \to \{\begin{cases} g ({DM}_{1}| A) \to α_{1} \\ g ({DM}_{2}| A) \to α_{2} \\ \dots \\ g ({DM}_{n}| A) \to α_{n} \end{cases} .

Thus, for every MFS ${MFS}_{i}$ from set of hypotheses $\{H_{1} : {MFS}_{1} H_{2} : {MFS}_{2} \dots H_{n} : {MFS}_{n}\}$ H-score $α_{i}$ may be computed, resulting in a set $⟨α_{1}, α_{2}, \dots α_{n}⟩$ . Comparing $α$ values allows for corresponding Feature Sets qualities ranking and selecting the most informative Signatures for the Data $D$ .

Validation

To validate our approach, we conducted three case studies.

In the first case study we extracted RNA-seq expression dataset for prostate cancer from the Cancer Genome Atlas (TCGA) on counts level.⁵ As MFSs, we recruited the C2 collection of molecular signatures from MSigDB,³^,⁶ that contains a number of prostate-related gene sets. This way, every candidate MFS (gene set from the collection) produced its specific H-score.

For the second case study, we recruited the PAM50 molecular signature,⁷ which was designed for classifying various breast cancer subtypes, as MFS. Then, we applied it to several datasets containing these breast cancer subtypes.⁵^,⁸^–¹¹

In the third case study, we explored H-scores delivered by various DGE approaches. We performed DGE analysis for two groups of mice samples with different response to MYC factor treatment (Mycfl/fl vs Myc $Δ$ IE, ERT2 genotypes)¹² with DESeq2¹³ and edgeR.¹⁴

The top 100 genes for each method were then retrieved. In addition, we extracted a list of genes genes with the highest variance in expression, as well as a number of random gene sets.

In each case, the counts were normalised to counts per million (cpm). For every geneset an H-score and its p-value with BH¹⁵ correction were computed.

Results

Prostate-related C2 gene sets clearly demonstrated highest H-score values and sufficient statistical significance (Table 1, Figure 1A), as well as data stratification (Figure 1B), which is expected for prostate cancer as opposed to control contrast. Gene sets not attributed to prostate cancer-related processes did not achieve statistically significant p-values (Table 1).

Table 1. Ten C2-chemical and genetic perturbations (GCP) Gene Signatures with the highest H-scores.

Signature	H-score	p-value
TOMLINS_PROSTATE_CANCER	0.795	0.025
WALLACE_PROSTATE_CANCER	0.747	0.025
OUYANG_PROSTATE_CANCER_PROGRESSION	0.745	0.025
LIU_PROSTATE_CANCER	0.735	0.025
PIEPOLI_LGI1_TARGETS	0.724	0.059
SMID_BREAST_CANCER_RELAPSE_IN_LIVER	0.712	0.164
TIMOFEEVA_GROWTH_STRESS_VIA_STAT1	0.708	0.240
GENTILE_UV_LOW_DOSE	0.705	0.308
JOHANSSON_BRAIN_CANCER_EARLY_VS_LATE	0.701	0.377
HOWLIN_CITED1_TARGETS_1	0.700	0.377

Figure 1. A: Distribution of H-scores for random genesets (blue) on TCGA prostate cancer vs normal dataset (see Table 1) and Tomlins prostate geneset H-score (red). B: MDS for TCGA prostate demonstrates samples separation with Tomlins geneset. C: Distribution of H-scores for random genesets (blue) on GSE48216 breast cancer dataset (see Table 2) and PAM50 geneset H-score (red). D: MDS for GSE48216 breast cancer dataset samples separation with PAM50 geneset.

H-scores for the PAM50 signature were evidently higher for all datasets in the second case study than those for random gene sets for the same datasets (Figure 2, Figure 1C). This implies that the PAM50 signature exhibits a high stratification quality for various breast cancer subtypes samples. PAM50-delivered H-scores also demonstrated highly statistically significant p-values (Table 2).

Figure 2. Distribution of random gene sets-delivered (blue) and PAM50 gene set-delivered (green) H-scores for breast cancer datasets (see Table 2).

Table 2. PAM50 results.

GEO Accession	Sample size	Groups in dataset	H-score	p-value
GSE58135	168	6	0.772	7e-4
GSE62944	1067	5	0.8892	0.0003
GSE48216	46	3	0.8567	0.0003
GSE80333	10	3	0.9765	0.0003

In the third case study, various DGE approaches resulted in gene sets that delivered significantly different H-scores (Figure 3). For this dataset, edgeR provided a signature with the best quality score, while DESeq2 still demonstrated a higher separation quality than that of random signatures. Genes with the highest variance showed lower scores compared to random gene sets. This result stresses the importance of the Hobotnica procedure to evaluate the quality of a particular DGE analysis.

Figure 3. H-scores for the top 100 Gene Signatures delivered from DESeq2, edgeR, genes with highes variance and random gene sets applied to GSE155460 data.

Discussion

Hobotnica was designed to quantitatively evaluate MFS quality through their ability for data stratification, based on their inter-sample distance matrices, and to assess the statistical significance of the results. We demonstrated that Hobotnica can efficiently estimate the quality of a molecular signature in the context of expression data.

The suggested method can be used to evaluate various sorts of MFSs: those retrieved from DGE or DM analyses, Mutation/single nucleotide variation calling or pathways analysis, as well as data modalities of other types, that are suitable as differential problems.

A possible application of Hobotnica is evaluating a particular model’s performance (e.g., DGE model) for a particular dataset. This will allow researchers to choose a method that delivers a signature with the best data stratification from a number of proposed approaches.

Assessing H-score values for various lengths of the same set or signature can be explored with the proposed method, which will help to optimize MFS structure. Such procedures can be especially crucial in clinical applications.

Data availability

Underlying data

NCBI Gene Expression Omnibus: Alternatively processed and compiled RNA-Sequencing and clinical data for thousands of samples from The Cancer Genome Atlas, https://identifiers.org/ncbiprotein:GSE62944

NCBI Gene Expression Omnibus: Modeling precision treatment of breast cancer, https://identifiers.org/ncbiprotein:GSE48216

NCBI Gene Expression Omnibus:Spatial proximity to fibroblasts impacts molecular features and therapeutic sensitivity of breast cancer cells influencing clinical outcomes, https://identifiers.org/ncbiprotein:GSE80333

NCBI Gene Expression Omnibus: Next Generation Sequencing Analysis of Mycfl/fl and MycIE, ERT2 intestinal transcriptomes, https://identifiers.org/ncbiprotein:GSE155460

Extended data

Analysis code

Analysis code available from: https://github.com/lab-medvedeva/Hobotnica-main

Archived analysis code as at time of publication: https://doi.org/10.5281/zenodo.5656814

License: GNU General Public License v2.0

Competing interests

No competing interests were disclosed.

Grant information

This work was supported by Ministry of Science and Higher Education of the Russian Federation (agreement no. 075-15-2020-899) and by the NIH grants R01DE027809 and P30CA006973.

Acknowledgements

We thank Frank Emmert-Streib, Leslie Cope and Elana Fertig for fruitful discussions.

References

1. Parker JS, Mullins M, Cheang MCU, et al.: Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol. 2009; 27(8): 1160–1167. PubMed Abstract | Publisher Full Text | Free Full Text
2. Cardoso F, van’t Veer LJ, Bogaerts J, et al.: 70-gene signature as an aid to treatment decisions in earlystage breast cancer. N. Engl. J. Med. 2016; 375(8): 717–729. Publisher Full Text
3. Subramanian A, Tamayo P, Mootha VK, et al.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 2005; 102(43): 15545–15550. PubMed Abstract | Publisher Full Text | Free Full Text
4. Liu C, Jing S, Yang F, et al.: Compound signature detection on lincs l1000 big data. Mol. BioSyst. 2015; 11(3): 714–722. PubMed Abstract | Publisher Full Text | Free Full Text
5. Rahman M, Jackson LK, Evan Johnson W, et al.: Alternative preprocessing of rna-sequencing data in the cancer genome atlas leads to improved analysis results. Bioinformatics. 2015; 31(22): 3666–3672. PubMed Abstract | Publisher Full Text | Free Full Text
6. Liberzon A, Subramanian A, Pinchback R, et al.: Molecular signatures database (MSigDB) 3.0. Bioinformatics. 05 2011; 27(12): 1739–1740. ISSN 1367-4803. PubMed Abstract | Publisher Full Text | Free Full Text
7. Parker JS, Mullins M, Cheang MCU, et al.: Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol. 2009; 27(8): 1160–1167. PubMed Abstract | Publisher Full Text
8. Varley KE, Gertz J, Roberts BS, et al.: Recurrent read-through fusion transcripts in breast cancer. Breast Cancer Res. Treat. 2014; 146(2): 287–297. PubMed Abstract | Publisher Full Text | Free Full Text
9. Marusyk A, Tabassum DP, Janiszewska M, et al.: Spatial proximity to fibroblasts impacts molecular features and therapeutic sensitivity of breast cancer cells influencing clinical outcomes. Cancer Res. 2016; 76(22): 6495–6506. PubMed Abstract | Publisher Full Text | Free Full Text
10. Daemen A, Griffith OL, Heiser LM, et al.: Modeling precision treatment of breast cancer. Genome Biol. 2013; 14(10): R110–R114. Publisher Full Text
11. Costello JC, Heiser LM, Georgii E, et al.: A community effort to assess and improve drug sensitivity prediction algorithms. Nat. Biotechnol. 2014; 32(12): 1202–1212. PubMed Abstract | Publisher Full Text | Free Full Text
12. Luo Y, Yang S, Wu X, et al.: Intestinal MYC modulates obesity-related metabolic dysfunction. Nat. Metab. July 2021; 3(7): 923–939. PubMed Abstract | Publisher Full Text
13. Love MI, Huber W, Anders S: Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biol. 2014; 15(12): 1–21. Publisher Full Text
14. Robinson MD, McCarthy DJ, Smyth GK: edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26(1): 139–140. PubMed Abstract | Publisher Full Text | Free Full Text
15. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological). 1995; 57(1): 289–300.

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 08 Dec 2021

Author details Author details

Alexey Stupnikov
Roles: Conceptualization, Formal Analysis, Investigation, Methodology, Project Administration, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Alexey Sizykh
Roles: Formal Analysis, Investigation, Software, Visualization

Alexander Favorov
Roles: Conceptualization

Bahman Afsari
Roles: Conceptualization, Writing – Original Draft Preparation

Sarah Wheelan
Roles: Conceptualization, Funding Acquisition, Writing – Original Draft Preparation

Luigi Marchionni
Roles: Conceptualization, Funding Acquisition, Methodology

Yulia Medvedeva
Roles: Funding Acquisition, Project Administration, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This work was supported by Ministry of Science and Higher Education of the Russian Federation (agreement no. 075-15-2020-899) and by the NIH grants R01DE027809 and P30CA006973.

Article Versions (2)

version 2

Revised

Published: 16 Aug 2022, 10:1260

https://doi.org/10.12688/f1000research.74846.2

version 1

Published: 08 Dec 2021, 10:1260

https://doi.org/10.12688/f1000research.74846.1

© 2021 Stupnikov A et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Stupnikov A, Sizykh A, Favorov A et al. Hobotnica: exploring molecular signature quality [version 1; peer review: 2 approved with reservations]. F1000Research 2021, 10:1260 (https://doi.org/10.12688/f1000research.74846.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 08 Dec 2021

Views

Reviewer Report 05 Jan 2022

Shailesh Tripathi, Production and Operations Management, University of Applied Sciences Upper Austria, Linz, Austria; FH Austria, Steyr, Austria

Approved with Reservations

https://doi.org/10.5256/f1000research.78645.r102284

The authors present an approach called Hobotonica for quantitatively evaluating (by assigning H score) MFS quality for given sample labels. This approach could be useful for analyzing samples, for e.g., quality comparison, filtering out poor quality samples, and comparing different phenotypical conditions and experiments. It is important that the authors should discuss a reasonable H-score interpretation in terms of various implications of data quality/outcome related to experimental conditions, sample size, data preprocessing, and the complexity of biological systems reflecting the non-trivial correlation structure.

I highlight some of the recommendations to be discussed in the paper:

The author should add simulation studies providing a realistic understanding and interpretation of the H score.
How is the current approach different from the clustering-based approach where the optimized number of clusters are compared to sample labels using rand-index (where a high rand score means the clustering solution and the sample labels are in agreement) or other measures.
The analysis should consider experimental conditions (data derived from multiple experiments representing the same phenotype), data preprocessing methods, sample size, and gene expression data covariance structure.
How does H-score vary with relation to the number of phenotype conditions and number of MFS. The authors should add analysis and interpretation of results:
- when MFS is differentially expressed genes.
- when MFS is randomly selected.
- When MFS is a predefined set (e.g., GO pathway).
The author should add accurate descriptions of all the notations used.
Add a definition of H-score.

Is the rationale for developing the new method (or application) clearly explained?

Partly
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Data science, Machine learning, network analysis, computational biology, gene expression data analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 16 Aug 2022

Alexey Stupnikov, National Medical Research Center for Endocrinology, Moscow, Russian Federation

16 Aug 2022

Author Response
- The author should add simulation studies providing a realistic understanding and interpretation of the H score.
The Reviewer raises an important problem of parametric and nonparametric statistics. ... Continue reading
The author should add simulation studies providing a realistic understanding and interpretation of the H score.

The Reviewer raises an important problem of parametric and nonparametric statistics. The nature of H-score distribution indeed was not discussed in detail. Yet, its distribution was not the focus of this study, for distribution of H-scores for randomly selected Molecular Feature Sets is only employed to compute empirical p-values. This procedure is nonparametric and, therefore, is not affected by the nature of H-score distribution. We agree the nonparametric nature of the statistic needs to be mentioned more explicitly in the manuscript, and add a paragraph to the Discussion section:

“The non-parametric statistic used in the approach not only allows for MFS of various types (differential, predefined, etc.) and data modalities (expression, methylation, etc.), but also for different structure of contrasted samples groups (sample size, preprocessing methods, etc.)”

How is the current approach different from the clustering-based approach where the optimized number of clusters are compared to sample labels using rand-index (where a high rand score means the clustering solution and the sample labels are in agreement) or other measures.

We agree the difference of our approach from clustering-based methods was not mentioned explicitly. We have added a paragraph to the Methods section discussing the difference of our approach.

“The proposed approach is different from existing metrics, such as often used Rand index and other clustering-based measures, as it allows one to avoid clustering procedure, that itself may be carried out with various approaches and parameters. In contract to clustering-based methods, H-scores directly reflects the sample stratification quality.”

The analysis should consider experimental conditions (data derived from multiple experiments representing the same phenotype), data preprocessing methods, sample size, and gene expression data covariance structure.

The dataset details the reviewer mentions are crucial for understanding a particular dataset’s structure, and it is correct that they may affect some types of analysis. However, since the nature of the statistic is nonparametric, the details do not affect our approach. We agree this point needs to be stressed explicitly in the manuscript, and add a paragraph in the Discussion section:

“The non-parametric statistic used in the approach not only allows for MFS of various types (differential, predefined, etc.) and data modalities (expression, methylation, etc.), but also for different structure of contrasted samples groups (sample size, preprocessing methods, etc.)”

The exploration of H-score distribution in relation to experimental conditions mentioned above certainly is an interesting fundamental question. However, it does not affect the practical implementation we introduce. Therefore, we believe that such study is beyond the scope of a current paper.

How does H-score vary with relation to the number of phenotype conditions and number of MFS. The authors should add analysis and interpretation of results:

when MFS is differentially expressed genes.

when MFS is randomly selected.

When MFS is a predefined set (e.g., GO pathway).

The nature of MFS indeed may be quite different. For this reason we have performed and discussed the analysis of following MFS types:

differentially expressed genes

randomly selected genes

MysigDB gene sets (predefined genesets)

In the revised version of manuscript differentially methylated MFS are also considered.
In this way, we feel the analysis and the interpretation the Reviewer suggested can be found in the manuscript.

The author should add accurate descriptions of all the notations used.

To address Reviewer’s comment we carefully checked the manuscript to ensure all introduced notations are defined and explained, and added a subsection to the Methods section for more detailed description of H-score.

Add a definition of H-score.

We agree the definition of the H-score statistic and notation we introduce should be expanded. To address Reviewer’s comment, we have added a subsection to the Methods section where we define H-score with more details.
The author should add simulation studies providing a realistic understanding and interpretation of the H score.

The Reviewer raises an important problem of parametric and nonparametric statistics. The nature of H-score distribution indeed was not discussed in detail. Yet, its distribution was not the focus of this study, for distribution of H-scores for randomly selected Molecular Feature Sets is only employed to compute empirical p-values. This procedure is nonparametric and, therefore, is not affected by the nature of H-score distribution. We agree the nonparametric nature of the statistic needs to be mentioned more explicitly in the manuscript, and add a paragraph to the Discussion section:

“The non-parametric statistic used in the approach not only allows for MFS of various types (differential, predefined, etc.) and data modalities (expression, methylation, etc.), but also for different structure of contrasted samples groups (sample size, preprocessing methods, etc.)”

How is the current approach different from the clustering-based approach where the optimized number of clusters are compared to sample labels using rand-index (where a high rand score means the clustering solution and the sample labels are in agreement) or other measures.

We agree the difference of our approach from clustering-based methods was not mentioned explicitly. We have added a paragraph to the Methods section discussing the difference of our approach.

“The proposed approach is different from existing metrics, such as often used Rand index and other clustering-based measures, as it allows one to avoid clustering procedure, that itself may be carried out with various approaches and parameters. In contract to clustering-based methods, H-scores directly reflects the sample stratification quality.”

The analysis should consider experimental conditions (data derived from multiple experiments representing the same phenotype), data preprocessing methods, sample size, and gene expression data covariance structure.

The dataset details the reviewer mentions are crucial for understanding a particular dataset’s structure, and it is correct that they may affect some types of analysis. However, since the nature of the statistic is nonparametric, the details do not affect our approach. We agree this point needs to be stressed explicitly in the manuscript, and add a paragraph in the Discussion section:

“The non-parametric statistic used in the approach not only allows for MFS of various types (differential, predefined, etc.) and data modalities (expression, methylation, etc.), but also for different structure of contrasted samples groups (sample size, preprocessing methods, etc.)”

The exploration of H-score distribution in relation to experimental conditions mentioned above certainly is an interesting fundamental question. However, it does not affect the practical implementation we introduce. Therefore, we believe that such study is beyond the scope of a current paper.

How does H-score vary with relation to the number of phenotype conditions and number of MFS. The authors should add analysis and interpretation of results:

when MFS is differentially expressed genes.

when MFS is randomly selected.

When MFS is a predefined set (e.g., GO pathway).

The nature of MFS indeed may be quite different. For this reason we have performed and discussed the analysis of following MFS types:

differentially expressed genes

randomly selected genes

MysigDB gene sets (predefined genesets)

In the revised version of manuscript differentially methylated MFS are also considered.
In this way, we feel the analysis and the interpretation the Reviewer suggested can be found in the manuscript.

The author should add accurate descriptions of all the notations used.

To address Reviewer’s comment we carefully checked the manuscript to ensure all introduced notations are defined and explained, and added a subsection to the Methods section for more detailed description of H-score.

Add a definition of H-score.

We agree the definition of the H-score statistic and notation we introduce should be expanded. To address Reviewer’s comment, we have added a subsection to the Methods section where we define H-score with more details.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 16 Aug 2022

Alexey Stupnikov, National Medical Research Center for Endocrinology, Moscow, Russian Federation

16 Aug 2022

Author Response
- The author should add simulation studies providing a realistic understanding and interpretation of the H score.
The Reviewer raises an important problem of parametric and nonparametric statistics. ... Continue reading
The author should add simulation studies providing a realistic understanding and interpretation of the H score.

The Reviewer raises an important problem of parametric and nonparametric statistics. The nature of H-score distribution indeed was not discussed in detail. Yet, its distribution was not the focus of this study, for distribution of H-scores for randomly selected Molecular Feature Sets is only employed to compute empirical p-values. This procedure is nonparametric and, therefore, is not affected by the nature of H-score distribution. We agree the nonparametric nature of the statistic needs to be mentioned more explicitly in the manuscript, and add a paragraph to the Discussion section:

“The non-parametric statistic used in the approach not only allows for MFS of various types (differential, predefined, etc.) and data modalities (expression, methylation, etc.), but also for different structure of contrasted samples groups (sample size, preprocessing methods, etc.)”

How is the current approach different from the clustering-based approach where the optimized number of clusters are compared to sample labels using rand-index (where a high rand score means the clustering solution and the sample labels are in agreement) or other measures.

We agree the difference of our approach from clustering-based methods was not mentioned explicitly. We have added a paragraph to the Methods section discussing the difference of our approach.

“The proposed approach is different from existing metrics, such as often used Rand index and other clustering-based measures, as it allows one to avoid clustering procedure, that itself may be carried out with various approaches and parameters. In contract to clustering-based methods, H-scores directly reflects the sample stratification quality.”

The analysis should consider experimental conditions (data derived from multiple experiments representing the same phenotype), data preprocessing methods, sample size, and gene expression data covariance structure.

The dataset details the reviewer mentions are crucial for understanding a particular dataset’s structure, and it is correct that they may affect some types of analysis. However, since the nature of the statistic is nonparametric, the details do not affect our approach. We agree this point needs to be stressed explicitly in the manuscript, and add a paragraph in the Discussion section:

“The non-parametric statistic used in the approach not only allows for MFS of various types (differential, predefined, etc.) and data modalities (expression, methylation, etc.), but also for different structure of contrasted samples groups (sample size, preprocessing methods, etc.)”

The exploration of H-score distribution in relation to experimental conditions mentioned above certainly is an interesting fundamental question. However, it does not affect the practical implementation we introduce. Therefore, we believe that such study is beyond the scope of a current paper.

How does H-score vary with relation to the number of phenotype conditions and number of MFS. The authors should add analysis and interpretation of results:

when MFS is differentially expressed genes.

when MFS is randomly selected.

When MFS is a predefined set (e.g., GO pathway).

The nature of MFS indeed may be quite different. For this reason we have performed and discussed the analysis of following MFS types:

differentially expressed genes

randomly selected genes

MysigDB gene sets (predefined genesets)

In the revised version of manuscript differentially methylated MFS are also considered.
In this way, we feel the analysis and the interpretation the Reviewer suggested can be found in the manuscript.

The author should add accurate descriptions of all the notations used.

To address Reviewer’s comment we carefully checked the manuscript to ensure all introduced notations are defined and explained, and added a subsection to the Methods section for more detailed description of H-score.

Add a definition of H-score.

We agree the definition of the H-score statistic and notation we introduce should be expanded. To address Reviewer’s comment, we have added a subsection to the Methods section where we define H-score with more details.
The author should add simulation studies providing a realistic understanding and interpretation of the H score.

The Reviewer raises an important problem of parametric and nonparametric statistics. The nature of H-score distribution indeed was not discussed in detail. Yet, its distribution was not the focus of this study, for distribution of H-scores for randomly selected Molecular Feature Sets is only employed to compute empirical p-values. This procedure is nonparametric and, therefore, is not affected by the nature of H-score distribution. We agree the nonparametric nature of the statistic needs to be mentioned more explicitly in the manuscript, and add a paragraph to the Discussion section:

“The non-parametric statistic used in the approach not only allows for MFS of various types (differential, predefined, etc.) and data modalities (expression, methylation, etc.), but also for different structure of contrasted samples groups (sample size, preprocessing methods, etc.)”

How is the current approach different from the clustering-based approach where the optimized number of clusters are compared to sample labels using rand-index (where a high rand score means the clustering solution and the sample labels are in agreement) or other measures.

We agree the difference of our approach from clustering-based methods was not mentioned explicitly. We have added a paragraph to the Methods section discussing the difference of our approach.

“The proposed approach is different from existing metrics, such as often used Rand index and other clustering-based measures, as it allows one to avoid clustering procedure, that itself may be carried out with various approaches and parameters. In contract to clustering-based methods, H-scores directly reflects the sample stratification quality.”

The analysis should consider experimental conditions (data derived from multiple experiments representing the same phenotype), data preprocessing methods, sample size, and gene expression data covariance structure.

The dataset details the reviewer mentions are crucial for understanding a particular dataset’s structure, and it is correct that they may affect some types of analysis. However, since the nature of the statistic is nonparametric, the details do not affect our approach. We agree this point needs to be stressed explicitly in the manuscript, and add a paragraph in the Discussion section:

“The non-parametric statistic used in the approach not only allows for MFS of various types (differential, predefined, etc.) and data modalities (expression, methylation, etc.), but also for different structure of contrasted samples groups (sample size, preprocessing methods, etc.)”

The exploration of H-score distribution in relation to experimental conditions mentioned above certainly is an interesting fundamental question. However, it does not affect the practical implementation we introduce. Therefore, we believe that such study is beyond the scope of a current paper.

How does H-score vary with relation to the number of phenotype conditions and number of MFS. The authors should add analysis and interpretation of results:

when MFS is differentially expressed genes.

when MFS is randomly selected.

When MFS is a predefined set (e.g., GO pathway).

The nature of MFS indeed may be quite different. For this reason we have performed and discussed the analysis of following MFS types:

differentially expressed genes

randomly selected genes

MysigDB gene sets (predefined genesets)

In the revised version of manuscript differentially methylated MFS are also considered.
In this way, we feel the analysis and the interpretation the Reviewer suggested can be found in the manuscript.

The author should add accurate descriptions of all the notations used.

To address Reviewer’s comment we carefully checked the manuscript to ensure all introduced notations are defined and explained, and added a subsection to the Methods section for more detailed description of H-score.

Add a definition of H-score.

We agree the definition of the H-score statistic and notation we introduce should be expanded. To address Reviewer’s comment, we have added a subsection to the Methods section where we define H-score with more details.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 20 Dec 2021

Roberto Malinverni, Cancer and Leukemia Epigenetics and Biology Program, Josep Carreras Leukemia Research Institute (IJC), Badalona, Spain

Approved with Reservations

https://doi.org/10.5256/f1000research.78645.r102280

In this short article the authors present an R package called Hobotnica, whose purpose is to evaluate the goodness with which different methodologies can stratify the results presented as Molecular Feature Sets (MFS). With MFS the authors point to all those types of data as a result of different -omics techniques (such as expression, methylation, Mutation / single nucleotide variation calling or pathways analysis).

Major comments

The authors present three examples in which it is demonstrated how this approach is able to evaluate the effectiveness of MFS stratification, but the examples considered are all based on expression data. To verify the statements presented in the article, it would be useful to test the methodology on different data (for example methylation arrays). The approach chosen for this evaluation is based on the calculation and comparison of Distance Matrices (DM).
The example of figure 3 evaluates two different standard approaches for the analysis of RNAseq using Hobotnica and the H0 value as discriminant. It can be appreciated in this figure how the stratification quality of Deseq2 is decidedly more efficient than both random genes and top variant genes. Surprisingly, however, the H0 value calculated using the top 100 genes collected with edgeR is very similar to that calculated using random genes, this confused me. Authors should explain this similarity more in depth.

The data presented in this article do not seem to convince satisfactorily. The quality evaluation power obtained by applying Hobotnica does not seem to correspond to the premises made. While not in fact a slate on the methodology, my advice is to review the examples and try to improve in benchmarking, adding different types of data.

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Partly
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Epigenetics. R-developrer

CITE

Report a concern

Author Response 16 Aug 2022

Alexey Stupnikov, National Medical Research Center for Endocrinology, Moscow, Russian Federation

16 Aug 2022

Author Response
- The authors present three examples in which it is demonstrated how this approach is able to evaluate the effectiveness of MFS stratification, but the examples considered are all
... Continue reading
The authors present three examples in which it is demonstrated how this approach is able to evaluate the effectiveness of MFS stratification, but the examples considered are all based on expression data. To verify the statements presented in the article, it would be useful to test the methodology on different data (for example methylation arrays). The approach chosen for this evaluation is based on the calculation and comparison of Distance Matrices (DM).

We thank the Reviewer for this suggestion. Indeed, the method we propose is applicable for Molecular Feature Sets evaluation of different nature, yet in the manuscript we demonstrated its work only for Expression based data.To improve the manuscript in a way the Reviewer suggested , we have performed additional analysis on Methylation based data for several datasets. We added this case study to the Validation and Results sections and comments to the Discussion section.

The example of figure 3 evaluates two different standard approaches for the analysis of RNAseq using Hobotnica and the H0 value as discriminant. It can be appreciated in this figure how the stratification quality of Deseq2 is decidedly more efficient than both random genes and top variant genes. Surprisingly, however, the H0 value calculated using the top 100 genes collected with edgeR is very similar to that calculated using random genes, this confused me. Authors should explain this similarity more in depth.

The point the Reviewer mentions indeed raises concern. After thorough examination we identified a data analysis related problem that caused the wrong genes subset process and resulted in the incorrect depiction in the presented barplot. We have corrected this issue. The rest of our findings were not changed during our reevaluation and all the results hold. The results are depicted and Fig.5 in the current manuscript version.

The data presented in this article do not seem to convince satisfactorily. The quality evaluation power obtained by applying Hobotnica does not seem to correspond to the premises made. While not in fact a slate on the methodology, my advice is to review the examples and try to improve in benchmarking, adding different types of data.

To address the Reviewer’s comments on the evaluation made in the manuscript we conducted additional analysis and validation carried out for methylation data modality on several datasets, and improved and fixed validation for differential expression analysis. Now we feel that the argument and claims we make regarding the H-score are compelling and plausible.
The authors present three examples in which it is demonstrated how this approach is able to evaluate the effectiveness of MFS stratification, but the examples considered are all based on expression data. To verify the statements presented in the article, it would be useful to test the methodology on different data (for example methylation arrays). The approach chosen for this evaluation is based on the calculation and comparison of Distance Matrices (DM).

We thank the Reviewer for this suggestion. Indeed, the method we propose is applicable for Molecular Feature Sets evaluation of different nature, yet in the manuscript we demonstrated its work only for Expression based data.To improve the manuscript in a way the Reviewer suggested , we have performed additional analysis on Methylation based data for several datasets. We added this case study to the Validation and Results sections and comments to the Discussion section.

The example of figure 3 evaluates two different standard approaches for the analysis of RNAseq using Hobotnica and the H0 value as discriminant. It can be appreciated in this figure how the stratification quality of Deseq2 is decidedly more efficient than both random genes and top variant genes. Surprisingly, however, the H0 value calculated using the top 100 genes collected with edgeR is very similar to that calculated using random genes, this confused me. Authors should explain this similarity more in depth.

The point the Reviewer mentions indeed raises concern. After thorough examination we identified a data analysis related problem that caused the wrong genes subset process and resulted in the incorrect depiction in the presented barplot. We have corrected this issue. The rest of our findings were not changed during our reevaluation and all the results hold. The results are depicted and Fig.5 in the current manuscript version.

The data presented in this article do not seem to convince satisfactorily. The quality evaluation power obtained by applying Hobotnica does not seem to correspond to the premises made. While not in fact a slate on the methodology, my advice is to review the examples and try to improve in benchmarking, adding different types of data.

To address the Reviewer’s comments on the evaluation made in the manuscript we conducted additional analysis and validation carried out for methylation data modality on several datasets, and improved and fixed validation for differential expression analysis. Now we feel that the argument and claims we make regarding the H-score are compelling and plausible.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 16 Aug 2022

Alexey Stupnikov, National Medical Research Center for Endocrinology, Moscow, Russian Federation

16 Aug 2022

Author Response
- The authors present three examples in which it is demonstrated how this approach is able to evaluate the effectiveness of MFS stratification, but the examples considered are all
... Continue reading
The authors present three examples in which it is demonstrated how this approach is able to evaluate the effectiveness of MFS stratification, but the examples considered are all based on expression data. To verify the statements presented in the article, it would be useful to test the methodology on different data (for example methylation arrays). The approach chosen for this evaluation is based on the calculation and comparison of Distance Matrices (DM).

We thank the Reviewer for this suggestion. Indeed, the method we propose is applicable for Molecular Feature Sets evaluation of different nature, yet in the manuscript we demonstrated its work only for Expression based data.To improve the manuscript in a way the Reviewer suggested , we have performed additional analysis on Methylation based data for several datasets. We added this case study to the Validation and Results sections and comments to the Discussion section.

The example of figure 3 evaluates two different standard approaches for the analysis of RNAseq using Hobotnica and the H0 value as discriminant. It can be appreciated in this figure how the stratification quality of Deseq2 is decidedly more efficient than both random genes and top variant genes. Surprisingly, however, the H0 value calculated using the top 100 genes collected with edgeR is very similar to that calculated using random genes, this confused me. Authors should explain this similarity more in depth.

The point the Reviewer mentions indeed raises concern. After thorough examination we identified a data analysis related problem that caused the wrong genes subset process and resulted in the incorrect depiction in the presented barplot. We have corrected this issue. The rest of our findings were not changed during our reevaluation and all the results hold. The results are depicted and Fig.5 in the current manuscript version.

The data presented in this article do not seem to convince satisfactorily. The quality evaluation power obtained by applying Hobotnica does not seem to correspond to the premises made. While not in fact a slate on the methodology, my advice is to review the examples and try to improve in benchmarking, adding different types of data.

To address the Reviewer’s comments on the evaluation made in the manuscript we conducted additional analysis and validation carried out for methylation data modality on several datasets, and improved and fixed validation for differential expression analysis. Now we feel that the argument and claims we make regarding the H-score are compelling and plausible.
The authors present three examples in which it is demonstrated how this approach is able to evaluate the effectiveness of MFS stratification, but the examples considered are all based on expression data. To verify the statements presented in the article, it would be useful to test the methodology on different data (for example methylation arrays). The approach chosen for this evaluation is based on the calculation and comparison of Distance Matrices (DM).

We thank the Reviewer for this suggestion. Indeed, the method we propose is applicable for Molecular Feature Sets evaluation of different nature, yet in the manuscript we demonstrated its work only for Expression based data.To improve the manuscript in a way the Reviewer suggested , we have performed additional analysis on Methylation based data for several datasets. We added this case study to the Validation and Results sections and comments to the Discussion section.

The example of figure 3 evaluates two different standard approaches for the analysis of RNAseq using Hobotnica and the H0 value as discriminant. It can be appreciated in this figure how the stratification quality of Deseq2 is decidedly more efficient than both random genes and top variant genes. Surprisingly, however, the H0 value calculated using the top 100 genes collected with edgeR is very similar to that calculated using random genes, this confused me. Authors should explain this similarity more in depth.

The point the Reviewer mentions indeed raises concern. After thorough examination we identified a data analysis related problem that caused the wrong genes subset process and resulted in the incorrect depiction in the presented barplot. We have corrected this issue. The rest of our findings were not changed during our reevaluation and all the results hold. The results are depicted and Fig.5 in the current manuscript version.

The data presented in this article do not seem to convince satisfactorily. The quality evaluation power obtained by applying Hobotnica does not seem to correspond to the premises made. While not in fact a slate on the methodology, my advice is to review the examples and try to improve in benchmarking, adding different types of data.

To address the Reviewer’s comments on the evaluation made in the manuscript we conducted additional analysis and validation carried out for methylation data modality on several datasets, and improved and fixed validation for differential expression analysis. Now we feel that the argument and claims we make regarding the H-score are compelling and plausible.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 08 Dec 2021

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 2 (revision) 16 Aug 22	read	read
Version 1 08 Dec 21	read	read

Roberto Malinverni, Josep Carreras Leukemia Research Institute (IJC), Badalona, Spain
Shailesh Tripathi, University of Applied Sciences Upper Austria, Linz, Austria; FH Austria, Steyr, Austria

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

6 Views

26 Sep 2022 | for Version 2

Roberto Malinverni, Cancer and Leukemia Epigenetics and Biology Program, Josep Carreras Leukemia Research Institute (IJC), Badalona, Spain

6 Views Cite this report Responses(0)

Approved

In my opinion the authors answer to all my criticism. In this version of the article, they correct some imprecision in the graphs and add figures that help to understand the method. I think that now is ready for indexing.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Epigenetics. R-developrer

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

5 Views

20 Sep 2022 | for Version 2

Shailesh Tripathi, Production and Operations Management, University of Applied Sciences Upper Austria, Linz, Austria; FH Austria, Steyr, Austria

5 Views Cite this report Responses(0)

Approved

I have gone through the responses provided by the authors and the updates of the paper. The author has addressed the main questions and revised the manuscript. I have no further questions. For the final acceptance, I approve it.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Data science, Machine learning, network analysis, computational biology, gene expression data analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

52 Views

05 Jan 2022 | for Version 1

Shailesh Tripathi, Production and Operations Management, University of Applied Sciences Upper Austria, Linz, Austria; FH Austria, Steyr, Austria

52 Views Cite this report Responses(1)

Approved With Reservations

The author should add simulation studies providing a realistic understanding and interpretation of the H score.
How is the current approach different from the clustering-based approach where the optimized number of clusters are compared to sample labels using rand-index (where a high rand score means the clustering solution and the sample labels are in agreement) or other measures.
The analysis should consider experimental conditions (data derived from multiple experiments representing the same phenotype), data preprocessing methods, sample size, and gene expression data covariance structure.
How does H-score vary with relation to the number of phenotype conditions and number of MFS. The authors should add analysis and interpretation of results:
- when MFS is differentially expressed genes.
- when MFS is randomly selected.
- When MFS is a predefined set (e.g., GO pathway).
The author should add accurate descriptions of all the notations used.
Add a definition of H-score.

Is the rationale for developing the new method (or application) clearly explained?

Partly
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Data science, Machine learning, network analysis, computational biology, gene expression data analysis

Respond to this report

Responses (1)

Author Response

16 Aug 2022

Alexey Stupnikov, National Medical Research Center for Endocrinology, Moscow, Russian Federation

The author should add simulation studies providing a realistic understanding and interpretation of the H score.

The Reviewer raises an important problem of parametric and nonparametric statistics. The nature of H-score distribution indeed was not discussed in detail. Yet, its distribution was not the focus of this study, for distribution of H-scores for randomly selected Molecular Feature Sets is only employed to compute empirical p-values. This procedure is nonparametric and, therefore, is not affected by the nature of H-score distribution. We agree the nonparametric nature of the statistic needs to be mentioned more explicitly in the manuscript, and add a paragraph to the Discussion section:

“The non-parametric statistic used in the approach not only allows for MFS of various types (differential, predefined, etc.) and data modalities (expression, methylation, etc.), but also for different structure of contrasted samples groups (sample size, preprocessing methods, etc.)”

How is the current approach different from the clustering-based approach where the optimized number of clusters are compared to sample labels using rand-index (where a high rand score means the clustering solution and the sample labels are in agreement) or other measures.

We agree the difference of our approach from clustering-based methods was not mentioned explicitly. We have added a paragraph to the Methods section discussing the difference of our approach.

“The proposed approach is different from existing metrics, such as often used Rand index and other clustering-based measures, as it allows one to avoid clustering procedure, that itself may be carried out with various approaches and parameters. In contract to clustering-based methods, H-scores directly reflects the sample stratification quality.”

The analysis should consider experimental conditions (data derived from multiple experiments representing the same phenotype), data preprocessing methods, sample size, and gene expression data covariance structure.

The dataset details the reviewer mentions are crucial for understanding a particular dataset’s structure, and it is correct that they may affect some types of analysis. However, since the nature of the statistic is nonparametric, the details do not affect our approach. We agree this point needs to be stressed explicitly in the manuscript, and add a paragraph in the Discussion section:

“The non-parametric statistic used in the approach not only allows for MFS of various types (differential, predefined, etc.) and data modalities (expression, methylation, etc.), but also for different structure of contrasted samples groups (sample size, preprocessing methods, etc.)”

The exploration of H-score distribution in relation to experimental conditions mentioned above certainly is an interesting fundamental question. However, it does not affect the practical implementation we introduce. Therefore, we believe that such study is beyond the scope of a current paper.

How does H-score vary with relation to the number of phenotype conditions and number of MFS. The authors should add analysis and interpretation of results:
- when MFS is differentially expressed genes.
- when MFS is randomly selected.
- When MFS is a predefined set (e.g., GO pathway).

The nature of MFS indeed may be quite different. For this reason we have performed and discussed the analysis of following MFS types:

differentially expressed genes
randomly selected genes
MysigDB gene sets (predefined genesets)

In the revised version of manuscript differentially methylated MFS are also considered.
In this way, we feel the analysis and the interpretation the Reviewer suggested can be found in the manuscript.

The author should add accurate descriptions of all the notations used.

To address Reviewer’s comment we carefully checked the manuscript to ensure all introduced notations are defined and explained, and added a subsection to the Methods section for more detailed description of H-score.

Add a definition of H-score.

We agree the definition of the H-score statistic and notation we introduce should be expanded. To address Reviewer’s comment, we have added a subsection to the Methods section where we define H-score with more details.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

43 Views

20 Dec 2021 | for Version 1

Roberto Malinverni, Cancer and Leukemia Epigenetics and Biology Program, Josep Carreras Leukemia Research Institute (IJC), Badalona, Spain

43 Views Cite this report Responses(1)

Approved With Reservations

The authors present three examples in which it is demonstrated how this approach is able to evaluate the effectiveness of MFS stratification, but the examples considered are all based on expression data. To verify the statements presented in the article, it would be useful to test the methodology on different data (for example methylation arrays). The approach chosen for this evaluation is based on the calculation and comparison of Distance Matrices (DM).
The example of figure 3 evaluates two different standard approaches for the analysis of RNAseq using Hobotnica and the H0 value as discriminant. It can be appreciated in this figure how the stratification quality of Deseq2 is decidedly more efficient than both random genes and top variant genes. Surprisingly, however, the H0 value calculated using the top 100 genes collected with edgeR is very similar to that calculated using random genes, this confused me. Authors should explain this similarity more in depth.

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Partly
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Epigenetics. R-developrer

Respond to this report

Responses (1)

Author Response

16 Aug 2022

Alexey Stupnikov, National Medical Research Center for Endocrinology, Moscow, Russian Federation

The authors present three examples in which it is demonstrated how this approach is able to evaluate the effectiveness of MFS stratification, but the examples considered are all based on expression data. To verify the statements presented in the article, it would be useful to test the methodology on different data (for example methylation arrays). The approach chosen for this evaluation is based on the calculation and comparison of Distance Matrices (DM).

We thank the Reviewer for this suggestion. Indeed, the method we propose is applicable for Molecular Feature Sets evaluation of different nature, yet in the manuscript we demonstrated its work only for Expression based data.To improve the manuscript in a way the Reviewer suggested , we have performed additional analysis on Methylation based data for several datasets. We added this case study to the Validation and Results sections and comments to the Discussion section.

The example of figure 3 evaluates two different standard approaches for the analysis of RNAseq using Hobotnica and the H0 value as discriminant. It can be appreciated in this figure how the stratification quality of Deseq2 is decidedly more efficient than both random genes and top variant genes. Surprisingly, however, the H0 value calculated using the top 100 genes collected with edgeR is very similar to that calculated using random genes, this confused me. Authors should explain this similarity more in depth.

The point the Reviewer mentions indeed raises concern. After thorough examination we identified a data analysis related problem that caused the wrong genes subset process and resulted in the incorrect depiction in the presented barplot. We have corrected this issue. The rest of our findings were not changed during our reevaluation and all the results hold. The results are depicted and Fig.5 in the current manuscript version.

The data presented in this article do not seem to convince satisfactorily. The quality evaluation power obtained by applying Hobotnica does not seem to correspond to the premises made. While not in fact a slate on the methodology, my advice is to review the examples and try to improve in benchmarking, adding different types of data.

To address the Reviewer’s comments on the evaluation made in the manuscript we conducted additional analysis and validation carried out for methylation data modality on several datasets, and improved and fixed validation for differential expression analysis. Now we feel that the argument and claims we make regarding the H-score are compelling and plausible.

View more View less

Competing Interests

No competing interests were disclosed.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Parker JS, Mullins M, Cheang MCU, et al.: Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol. 2009; 27(8): 1160–1167. PubMed Abstract | Publisher Full Text | Free Full Text

[2] 2. Cardoso F, van’t Veer LJ, Bogaerts J, et al.: 70-gene signature as an aid to treatment decisions in earlystage breast cancer. N. Engl. J. Med. 2016; 375(8): 717–729. Publisher Full Text

[3] 3. Subramanian A, Tamayo P, Mootha VK, et al.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 2005; 102(43): 15545–15550. PubMed Abstract | Publisher Full Text | Free Full Text

[4] 4. Liu C, Jing S, Yang F, et al.: Compound signature detection on lincs l1000 big data. Mol. BioSyst. 2015; 11(3): 714–722. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Rahman M, Jackson LK, Evan Johnson W, et al.: Alternative preprocessing of rna-sequencing data in the cancer genome atlas leads to improved analysis results. Bioinformatics. 2015; 31(22): 3666–3672. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Liberzon A, Subramanian A, Pinchback R, et al.: Molecular signatures database (MSigDB) 3.0. Bioinformatics. 05 2011; 27(12): 1739–1740. ISSN 1367-4803. PubMed Abstract | Publisher Full Text | Free Full Text

[7] 7. Parker JS, Mullins M, Cheang MCU, et al.: Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol. 2009; 27(8): 1160–1167. PubMed Abstract | Publisher Full Text

[8] 8. Varley KE, Gertz J, Roberts BS, et al.: Recurrent read-through fusion transcripts in breast cancer. Breast Cancer Res. Treat. 2014; 146(2): 287–297. PubMed Abstract | Publisher Full Text | Free Full Text

[9] 9. Marusyk A, Tabassum DP, Janiszewska M, et al.: Spatial proximity to fibroblasts impacts molecular features and therapeutic sensitivity of breast cancer cells influencing clinical outcomes. Cancer Res. 2016; 76(22): 6495–6506. PubMed Abstract | Publisher Full Text | Free Full Text

[10] 10. Daemen A, Griffith OL, Heiser LM, et al.: Modeling precision treatment of breast cancer. Genome Biol. 2013; 14(10): R110–R114. Publisher Full Text

[11] 11. Costello JC, Heiser LM, Georgii E, et al.: A community effort to assess and improve drug sensitivity prediction algorithms. Nat. Biotechnol. 2014; 32(12): 1202–1212. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Luo Y, Yang S, Wu X, et al.: Intestinal MYC modulates obesity-related metabolic dysfunction. Nat. Metab. July 2021; 3(7): 923–939. PubMed Abstract | Publisher Full Text

[13] 13. Love MI, Huber W, Anders S: Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biol. 2014; 15(12): 1–21. Publisher Full Text

[14] 14. Robinson MD, McCarthy DJ, Smyth GK: edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26(1): 139–140. PubMed Abstract | Publisher Full Text | Free Full Text

[15] 15. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological). 1995; 57(1): 289–300.

Hobotnica: exploring molecular signature quality

Abstract

Keywords

Introduction

Methods

Approach

(1)

(2)

Validation

Results

Table 1. Ten C2-chemical and genetic perturbations (GCP) Gene Signatures with the highest H-scores.

Figure 2. Distribution of random gene sets-delivered (blue) and PAM50 gene set-delivered (green) H-scores for breast cancer datasets (see Table 2).

Table 2. PAM50 results.

Figure 3. H-scores for the top 100 Gene Signatures delivered from DESeq2, edgeR, genes with highes variance and random gene sets applied to GSE155460 data.

Discussion

Data availability

Underlying data

Extended data

Competing interests

Grant information

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated