Hobotnica: exploring molecular signature quality

A Molecular Features Set (MFS), is a result of a vast diversity of bioinformatics pipelines. The lack of a “gold standard” for most experimental data modalities makes it difficult to provide valid estimation for a particular MFS's quality. Yet, this goal can partially be achieved by analyzing inner-sample Distance Matrices (DM) and their power to distinguish between phenotypes. The quality of a DM can be assessed by summarizing its power to quantify the differences of inner-phenotype and outer-phenotype distances. This estimation of the DM quality can be construed as a measure of the MFS's quality. Here we propose Hobotnica, an approach to estimate MFSs quality by their ability to stratify data, and assign them significance scores, that allow for collating various signatures and comparing their quality for contrasting groups.


Introduction
A signature based on a predefined Molecular Features Set (MFS), which is designed to distinguish biological conditions or phenotypes from each other, is a crucial concept in bioinformatics and precision medicine. In this context, signatures typically originate from MFS from contrasting experimental data from two or more sample groups, which differ phenotypically. These MFS incorporate information on the differences between the groups. The nature of the MFS depends on the modality of the original data. For instance, the MFS provided by the Differential Gene Expression approach is a list of Differentially Expressed genes (DEG); Differential Methylation analysis provides Differentially Methylated Cytosines or regions (DMC and DMR) as MFS.
A significant number of mutational, expression and methylation-based signatures have recently been published and they are actively used in research and translational medicine. Examples of expression-based signatures involve gene sets for clinical prognosis (e.g., PAM50, 1 MammaPrint 2 ), for pathways and gene enrichment analysis (e.g., MsigDB collections 3 ), and for drug re-purposing (e.g., LINCS project 4 ).
Direct quality assessment for MFS is currently hardly possible, since there are no 'gold standard' datasets where active Molecular Features are explicitly known. In this manuscript, we propose a novel approach -Hobotnica, that allows for measurement of MFS quality by addressing the key property of the signature, namely, its quality for data stratification.
Hobotnica leverages the quality of distance matrices obtained from any source, in order to assess the quality of the MFS from any data modality compared to a random MFS. In this study, we demonstrate its application to transcriptomic and methylation signatures.

Approach
The Hobotnica approach is as follows: For a given data set W and a given MFS (S) we derive the inter-sample distance matrix (DM S,W ð Þ). Then we assess the quality of DM (and, thus, of S) with a summarizing function (α DM S ð Þ ð Þ¼ α DM S ð Þ, Y ð Þor by abuse of notation α DM S ð Þ ð Þ ) where (Y) represents the labels of samples. In shorter notation, We desire the function α to gauge if the inner-class samples are closer to each other than to outer-class samples. If no difference exists from one class to another, α must be close to zero and as the difference grows, α grows. In the ideal case of a perfect separation, α reaches its maximum at 1: Under the null hypothesis of Hobotnica ((H 0 )), no significant difference exists between α S ð Þ and the α of an equal-sized general random set. On the contrary, the alternative (H A ) hypothesizes that S generates higher α than most random S 0 of the same size. To estimate a null distribution for Hobotnica's α, we applied a permutation test. As our default options, we use Kendall distance as the distance measure and Mann-Whitney-Wilcoxon test as the summarizing function.

REVISED Amendments from Version 1
In the new version we have expanded the definition and introduction for our approach. We have added additional analysis for Methylation data type to illustrate and validate our approach. New figures were added to demonstrate this. Several minor editions to improve the manuscript were added.
Any further responses from the reviewers can be found at the end of the article When instead of a single MFS a set of hypotheses H 1 : MFS 1 ,H 2 : MFS 2 , …,H n : MFS n f g is in place, for each Molecular Feature Set MFS i corresponding Distance Matrix DM i can be generated, and than, in turn, particular value of the measure α i : Thus, for every MFS MFS i from set of hypotheses H 1 : MFS 1 ,H 2 : MFS 2 , …,H n : MFS n f g H-score α i may be computed, resulting in a set ⟨α 1 ,α 2 ,…α n ,⟩. Comparing α values allows for corresponding Feature Sets qualities ranking and selecting the most informative Signatures for the Data D.

Model
For two groups of samples, R and G, taking into account the nature of Distance Matrix (0-diagonal and symmetry), the corresponding lower triangle matrix may be considered ( Figure 1A). The distance values of this matrix d pq can be ranked ( Figure 1B), producing new matrix with natural values r pq ( Figure 1C).
Subseting the matrix to values corresponding to in-class distances ( Figure 1D), we can compute sum of these ranks Σ. Given the numbers of samples is n in group 1 and in m group 2, total number of ranks (and therefore max value of a rank) in un-subsetted triangle matrix will be Likewise, total number of ranks is subsetted to in-class values will be Σ reaches its min value A if the subset procedure selects minimal values of ranks, or, in other words, ranks in the selected part of rank matrix are values from 1 to M: The minimal value is reached when the lowest ranks reside in the diagonal squares of the distance matrix that correspond to in-group distances, and, therefore, best groups separation. Similarly, max value B is delivered when maximal values of Ranks Matrix are selected -from (N À M) to N: This procedure finally allows to introduce measure α with necessary listed properties. We refer to this measure as Hobotnica, or H-score.
Thus, for every Gene Signature GS i from set of hypotheses {H 1 : GS 1 , H 2 : GS 2 , ..., H n : GS n } H-score α i may be computed, resulting in a set hα 1 , α 2 , ... α n i. Comparing and ranking α values allows for corresponding Gene Signatures qualities ranking and comparison.
The proposed approach is different from existing metrics, such as often used Rand index 5 and other clusteringbased measures, as it allows one to avoid clustering procedure, that itself may be carried out with various approaches and parameters. [6][7][8] In contract to clustering-based methods, H-scores directly reflects the sample stratification quality.

Validation
To validate our approach, we conducted four case studies.
In the first case study we extracted RNA-seq expression dataset for prostate cancer from the Cancer Genome Atlas (TCGA) on counts level. 9 As MFSs, we recruited the C2 collection of molecular signatures from MSigDB, 3,10 that contains a number of prostate-related gene sets. This way, every candidate MFS (gene set from the collection) produced its specific H-score.
For the second case study, we recruited the PAM50 molecular signature, 11 which was designed for classifying various breast cancer subtypes, as MFS. Then, we applied it to several datasets containing these breast cancer subtypes. 9,[12][13][14][15] In the third case study, we explored H-scores delivered by various DGE approaches. We performed DGE analysis for two groups of mice samples with different response to MYC factor treatment (Mycfl/fl vs MycΔIE, ERT2 genotypes) 16 with DESeq2 17 and edgeR. 18 The top 100 genes for each method were then retrieved. In addition, we extracted a list of genes genes with the highest variance in expression, as well as a number of random gene sets.
In each case, the counts were normalised to counts per million (cpm). For every geneset an H-score and its p-value with BH 19 correction were computed.
In the last case the Hobotnica application for differential methylation signatures assessment was demonstrated. Hobotnica was applied to the signature from study 20 that distinguish B-cell subpopulations with mutated and unmutated IGHV from patients with chronic lymphocytic leukemia (M-CLL and U-CLL). The signature was derived from the data obtained on 450k Human Methylation Array, the length of the signature is 3265 sites.
The signature was validated on datasets from other experiments containing the same comparison groups (M-CLL and U-CLL): GSE136724 from, 21 GSE143411 from 22 and GSE144894 from. 23 Datasets GSE136724 and GSE143411 contain samples from patients after chemo(immune) therapy and untreated samples, only untreated groups were used for validation. H-score was calculated for each dataset using matrices of beta values with beta-mixture quantile normalization. The signature was reduced to 3089 sites for GSE136724, 3091 sites for GSE143411 and 3254 for GSE144894. The p-value was calculated based on 100000 random signatures of the same length using pseudocount.

Results
Prostate-related C2 gene sets clearly demonstrated highest H-score values and sufficient statistical significance (Table 1, Figure 2A), as well as data stratification ( Figure 2B), which is expected for prostate cancer as opposed to control contrast. Gene sets not attributed to prostate cancer-related processes did not achieve statistically significant p-values (Table 1).
H-scores for the PAM50 signature were evidently higher for all datasets in the second case study than those for random gene sets for the same datasets ( Figure 3, Figure 2C). This implies that the PAM50 signature exhibits a high stratification quality for various breast cancer subtypes samples. PAM50-delivered H-scores also demonstrated highly statistically significant p-values (Table 2).
In the third case study, various DGE approaches resulted in gene sets that delivered significantly different H-scores ( Figure 4). For this dataset, edgeR provided a signature with the best quality score, while DESeq2 still demonstrated a higher separation quality than that of random signatures. Genes with the highest variance showed lower scores compared to random gene sets. This result stresses the importance of the Hobotnica procedure to evaluate the quality of a particular DGE analysis.    Table 1) and Tomlins prostate geneset H-score (red). B: MDS for TCGA prostate demonstrates samples separation with Tomlins geneset. C: Distribution of H-scores for random genesets (blue) on GSE48216 breast cancer dataset (see Table 2) and PAM50 geneset H-score (red). D: MDS for GSE48216 breast cancer dataset samples separation with PAM50 geneset. Figure 3. Distribution of random gene sets-delivered (blue) and PAM50 gene set-delivered (green) H-scores for breast cancer datasets (see Table 2). obtained from 100000 random signatures of the same length that were used for p-value calculation are shown in Figure 5. Figure 5 displays MDS plots based on submatrices that include only differential methylation signature sites.

Discussion
Hobotnica was designed to quantitatively evaluate MFS quality through their ability for data stratification, based on their inter-sample distance matrices, and to assess the statistical significance of the results. We demonstrated that Hobotnica can efficiently estimate the quality of a molecular signature in the context of expression data.
The suggested method can be used to evaluate various sorts of MFSs: those retrieved from DGE or DM analyses, Mutation/single nucleotide variation calling or pathways analysis, as well as data modalities of other types, that are suitable as differential problems.
The non-parametric statistic used in the approach not only allows for MFS of various types (differential, predefined, etc.) and data modalities (expression, methylation, etc.), but also for different structure of contrasted samples groups (sample size, preprocessing methods, etc.).
A possible application of Hobotnica is evaluating a particular model's performance (e.g., DGE model) for a particular dataset. This will allow researchers to choose a method that delivers a signature with the best data stratification from a number of proposed approaches.
Assessing H-score values for various lengths of the same set or signature can be explored with the proposed method, which will help to optimize MFS structure. Such procedures can be especially crucial in clinical applications.

4.
The author should add accurate descriptions of all the notations used. 5.
Add a definition of H-score. 6.

Are sufficient details provided to allow replication of the method development and its use by others? Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly nature of H-score distribution indeed was not discussed in detail. Yet, its distribution was not the focus of this study, for distribution of H-scores for randomly selected Molecular Feature Sets is only employed to compute empirical p-values. This procedure is nonparametric and, therefore, is not affected by the nature of H-score distribution. We agree the nonparametric nature of the statistic needs to be mentioned more explicitly in the manuscript, and add a paragraph to the Discussion section: "The non-parametric statistic used in the approach not only allows for MFS of various types (differential, predefined, etc.) and data modalities (expression, methylation, etc.), but also for different structure of contrasted samples groups (sample size, preprocessing methods, etc.)" How is the current approach different from the clustering-based approach where the optimized number of clusters are compared to sample labels using rand-index (where a high rand score means the clustering solution and the sample labels are in agreement) or other measures. ○ We agree the difference of our approach from clustering-based methods was not mentioned explicitly. We have added a paragraph to the Methods section discussing the difference of our approach.

"The proposed approach is different from existing metrics, such as often used Rand index and other clustering-based measures, as it allows one to avoid clustering procedure, that itself may be carried out with various approaches and parameters. In contract to clustering-based methods, H-scores directly reflects the sample stratification quality."
The analysis should consider experimental conditions (data derived from multiple experiments representing the same phenotype), data preprocessing methods, sample size, and gene expression data covariance structure.

○
The dataset details the reviewer mentions are crucial for understanding a particular dataset's structure, and it is correct that they may affect some types of analysis. However, since the nature of the statistic is nonparametric, the details do not affect our approach. We agree this point needs to be stressed explicitly in the manuscript, and add a paragraph in the Discussion section: "The non-parametric statistic used in the approach not only allows for MFS of various types (differential, predefined, etc.) and data modalities (expression, methylation, etc.), but also for different structure of contrasted samples groups (sample size, preprocessing methods, etc.)" The exploration of H-score distribution in relation to experimental conditions mentioned above certainly is an interesting fundamental question. However, it does not affect the practical implementation we introduce. Therefore, we believe that such study is beyond the scope of a current paper. In the revised version of manuscript differentially methylated MFS are also considered.
In this way, we feel the analysis and the interpretation the Reviewer suggested can be found in the manuscript.
The author should add accurate descriptions of all the notations used.

○
To address Reviewer's comment we carefully checked the manuscript to ensure all introduced notations are defined and explained, and added a subsection to the Methods section for more detailed description of H-score.
The example of figure 3 evaluates two different standard approaches for the analysis of RNAseq using Hobotnica and the H0 value as discriminant. It can be appreciated in this figure how the stratification quality of Deseq2 is decidedly more efficient than both random genes and top variant genes. Surprisingly, however, the H0 value calculated using the top 100 genes collected with edgeR is very similar to that calculated using random genes, this confused me. Authors should explain this similarity more in depth.

2.
The data presented in this article do not seem to convince satisfactorily. The quality evaluation power obtained by applying Hobotnica does not seem to correspond to the premises made. While not in fact a slate on the methodology, my advice is to review the examples and try to improve in benchmarking, adding different types of data.

Is the description of the method technically sound? Partly
Are sufficient details provided to allow replication of the method development and its use by others? Partly We thank the Reviewer for this suggestion. Indeed, the method we propose is applicable for Molecular Feature Sets evaluation of different nature, yet in the manuscript we demonstrated its work only for Expression based data.To improve the manuscript in a way the Reviewer suggested , we have performed additional analysis on Methylation based data for several datasets. We added this case study to the Validation and Results sections and comments to the Discussion section.
The example of figure 3 evaluates two different standard approaches for the analysis of RNAseq using Hobotnica and the H0 value as discriminant. It can be appreciated in this figure how the stratification quality of Deseq2 is decidedly more efficient than both random genes and top variant genes. Surprisingly, however, the H0 value calculated using the top 100 genes collected with edgeR is very similar to that calculated using random genes, this confused me. Authors should explain this similarity more in depth.

○
The point the Reviewer mentions indeed raises concern. After thorough examination we identified a data analysis related problem that caused the wrong genes subset process and resulted in the incorrect depiction in the presented barplot. We have corrected this issue. The rest of our findings were not changed during our reevaluation and all the results hold. The results are depicted and Fig.5 in the current manuscript version.
The data presented in this article do not seem to convince satisfactorily. The quality evaluation power obtained by applying Hobotnica does not seem to correspond to the premises made. While not in fact a slate on the methodology, my advice is to review the examples and try to improve in benchmarking, adding different types of data.

○
To address the Reviewer's comments on the evaluation made in the manuscript we conducted additional analysis and validation carried out for methylation data modality on several datasets, and improved and fixed validation for differential expression analysis. Now we feel that the argument and claims we make regarding the H-score are compelling and plausible.
Competing Interests: No competing interests were disclosed.