Bayesian Multi-View Clustering given complex inter-view structure

Multi-view datasets are becoming increasingly prevalent. These datasets consist of different modalities that provide complementary characterizations of the same underlying system. They can include heterogeneous types of information with complex relationships, varying degrees of missingness, and assorted sample sizes, as is often the case in multi-omic biological studies. Clustering multi-view data allows us to leverage different modalities to infer underlying systematic structure, but most existing approaches are limited to contexts in which entities are the same across views or have clear one-to-one relationships across data types with a common sample size. Many methods also make strong assumptions about the similarities of clusterings across views. We propose a Bayesian multi-view clustering approach (BMVC) which can handle the realities of multi-view datasets that often have complex relationships and diverse structure. BMVC incorporates known and complex many-to-many relationships between entities via a probabilistic graphical model that enables the joint inference of clusterings specific to each view, but where each view informs the others. Additionally, BMVC estimates the strength of the relationships between each pair of views, thus moderating the degree to which it imposes dependence constraints. We benchmarked BMVC on simulated data to show that it accurately estimates varying degrees of inter-view dependence when inter-view relationships are not limited to one-to-one correspondence. Next, we demonstrated its ability to capture visually interpretable inter-view structure in a public health survey of individuals and households in Puerto Rico following Hurricane Maria. Finally, we showed that BMVC clusters integrate the complex relationships between multi-omic profiles of breast cancer patient data, improving the biological homogeneity of clusters and elucidating hypotheses for functional biological mechanisms. We found that BMVC leverages complex inter-view structure to produce higher quality clusters than those generated by standard approaches. We also showed that BMVC is a valuable tool for real-world discovery and hypothesis generation.


Introduction
In biomedicine, public health, and a number of other fields, an unprecedented availability of increasingly diverse large-scale data related through complex mechanisms, both known and unknown (e.g.multi-omic patient information in combination with clinical and demographic data), has sparked the need for unsupervised machine learning methods that leverage both quantity and diversity of information.For many years, the field of machine learning has primarily emphasized questions of data abundance (e.g.high dimensionality).However, approaches tailored to integrative analysis and multi-view learning have recently come into focus.By leveraging the structure shared between multiple views of the same system, we expect to improve the quality of data clustering in any given view.We are thus able to construct a more informed representation of the latent structure and any properties unique to a particular modality.Discovery of this kind can improve our understanding of both the high-and low-level functional mechanisms that describe the system.As the heterogeneity and complexity of data sources steadily increases, it is critical to scientific advancement that we conceive of more flexible and interpretable machine learning approaches to understanding diverse, multi-modal collections of related data.
Several useful frameworks for multi-view clustering exist to-date, but these share the assumption that the cross-domain relationships between entities are one-to-one, i.e. that each view measures the same set of units.One of the most utilized models in this class, iCluster 1 and its extensions [2][3][4] use a joint latent variable model to integrate all data types and produce a single integrated clustering.These consensus-focused approaches employ a variety of strategies to learn an integrated representation rather than individual clusterings of the views.In contrast, the Multiple Dataset Integration (MDI) approach 5 aims to learn individual clusterings at a view-per-view granularity while incorporating known dependencies.Bayesian Consensus Clustering 6 and Clusternomics 7 also produce individual clusterings and jointly infer explicit global structure.These approaches provide more insight into the underlying mechanisms and unique properties governing the structure in each view than might be available when using consensus-focused approaches but are limited by their one-to-one assumption.
Critically, the one-to-one assumption quickly breaks down in many applications where sets of entities differ across views.In these cases there may be a many-to-one or many-to-many correspondence between views, and certain entities may not have corresponding related entities measured across all domains.For example, we may want to cluster epigenetic data across the genome (view 1) alongside related gene expression (view 2) and protein level (view 3) measurements.Methylation profiles in particular typically consist of many more entities (probes) than gene expression profiles (genes) with multiple probes mapping to, or near, a single gene; in some cases, a probe may not correspond to any of the measured genes.A gene may likewise correspond to more than one protein product due to alternative splicing effects 8 or to no protein if the gene is not protein-coding.Across domains, one common solution to this problem is to condense each dataset in order to force a strict one-to-one correspondence between views by combining or discarding measurements via some heuristic.The first shortcoming of this approach is that it eliminates information across modalities thereby reducing power.Furthermore, it can lead to downstream models which fail to account for important inter-view relationships.In addition, the extent to which relationships between views should bias estimates of intra-view structure is not always obvious.As such, another critical task is to learn the degree to which clusterings should inform one another.These challenges motivate methods of multi-view clustering that can intelligently account for more complex relationships between views.
Most recent efforts to address these challenges rely on extensions of non-negative matrix factorization 9 or its multi-view variant multiNMF. 10Co-regularized Graph Clustering (CGC) 11 poses the problem as a multi-view graph clustering and, as we do, assumes that the inter-view relationships are known a priori.Given a primary view of interest (the view "in focus"), CGC estimates the relative contributions of all other views to the overall clustering.Complex Mapping Multi-View Clustering (CMMVC) 12 takes a similar approach and additionally proposes a method for estimating unknown inter-view relationships as a pre-processing step.These approaches lack some of the flexibility and interpretability of probabilistic approaches like the model we propose here, and also require some additional assumptions such as choosing a primary view.Similar efforts have been to made to perform multi-view clustering in contexts where data is incomplete.Incomplete multi-view clustering algorithms (IMC) [13][14][15][16] often rely on matrix factorization, graph learning, kernel and/or deep approaches to construct a representation of the data from which a consensus clustering can be learned from the incomplete data.These solutions relate to part of the problem we pose here where relationships may be missing but not

REVISED Amendments from Version 1
In response to the reviewers' comments we have added two paragraphs to the 'Simulation Results' section of the manuscript detailing additional experiments and results on the convergence properties of the BMVC algorithm.We have additionally clarified our initialization approach in each of the applications, where applicable.
Any further responses from the reviewers can be found at the end of the article where relationships may extend beyond a one-to-one mapping.There may also be cases where an entity in one view may not be related to an entity in another view, but not due to missing data; in these cases, the assumption of incompleteness made by IMC is not always appropriate.Finally, it is again important to note that little to none of the existing work in this area has taken a fully Bayesian approach.
Here we propose Bayesian Multi-View Clustering (BMVC), an approach that uses a novel graphical model to jointly learn the strength of every pairwise relationship between views, along with a separate clustering within each view that takes into account user-supplied inter-view relationships.A key contribution of BMVC is that it is a Bayesian approach based on mixture models, similar to that used in MDI. 5 This flexible probabilistic framework allows users to model a variety of data types; quantify uncertainty; and incorporate a variety of assumptions by specifying priors on the mixture component distributions, the sparsity of component weights, and the weight of the inter-view dependency.The cluster base distribution used in BMVC may be chosen to fit key assumptions regarding the latent distribution of the data (e.g.allowing Gaussian process component priors to model clusters of time series data).Most importantly, to the best of our knowledge, BMVC is the first Bayesian clustering approach to rigorously leverage complex inter-view relationships between entitiessuch as one-to-many, many-to-many or one-to-zero relationshipsto improve clustering quality.

Model
Let v ∈ 1…V index the corresponding view in a V-view clustering application, and let k ∈ 1…K index a cluster.For ease of notation, assume the same total number of clusters K across all views, although the full model permits specification of K v on a per-view basis.Note that even with fixed K, the true numbers of clusters learned in each view are not required to be the same or exactly K; they may be arbitrarily smaller if elements are assigned to only a subset of clusters and others are left empty in the most likely clustering found via inference.The BMVC model likelihood can be viewed as a modification of the joint likelihood of V independent mixture models with an additional term encouraging related entities to cluster similarly across views.The probabilistic graphical model is shown in plate notation in Figure 1.
Let x be the data, G v be the probability density function of the component distribution parameterized by θ, z be the component assignments and π be the K-dimensional vector of mixture weights.Further, we specify priors on the θ using a base distribution G 0 parameterized by β, and Dirichlet priors on the mixture weights (π v ) parameterized by α.Let v and i respectively index the view and entities of the data such that x vi refers to entity i in view v and z vi refers to the corresponding cluster assignment.Assume the following prior distributions: For simplicity, we denote the number of entities for view v as N v .Building up from a standard mixture model with no ties between views, the conditional joint probability of all the component assignments Z of V independent Gaussian mixture models is merely a product of their priors dictated by π: In addition to the standard mixture model priors described above, we specify three additional variables R, C and ϕ: • Let R vu be a known, user-specified indicator matrix that encodes the graph of related entities between two views indexed v and u.Note that R may also have continuous positive values that describe known relationship weights.Specifically, assume that R vu [i, j] is non-zero if a relationship exists between entity x vi and x uj .For example, if the views respectively measure gene expression and methylation, then R should be non-zero where gene i is regulated by methylation probe j.
• Let C vu be a latent indicator matrix that encodes the graph of related matrices between two views indexed v and u, where every row and column has exactly one indicator.This reflects the assumption that every cluster in a view has a single corresponding cluster in another.Specifically, assume that C vu [k, l] is non-zero if cluster k in view v relates to cluster i in view u.
• Let ϕ vu ∈ℝ þ be a latent variable representing the overall strength of the relationship between views v and u.

BMVC cluster assignments
We place a conditional joint prior on the traditional mixture model cluster assignments as follows: Comparing the assignment strategy that results from this prior with that of the traditional mixture model, we see that BMVC increases the probability of a clustering assignment for a given entity if any related entity across views is assigned to a cluster with the same label.Note that the expression requires a normalization constant to define a distribution (as in a Markov random field).The normalization constant Z is computed by summing this quantity over all possible cluster assignments of all the entities across the views.Assume each view v ∈ 1…V is associated with a number of clusters K v , and a number of entities N v : ::: X KV zV1¼1 :::

Model estimation
We perform approximate inference to jointly learn the optimal values for the cluster weights π, cluster distribution hyperparameters θ, cluster assignments z, and the latent ϕ.To learn the model we conduct a modified hard expectation maximization procedure.In the E-step we compute the expected cluster assignments given the other model parameters.
Note that R is comprised of connected components of related entities across views.All assignments are dependent within a connected component of R; thus, an exact estimate of the expected assignment of any entity with relationships requires considering all possible combinations of assignments to all entities in the corresponding component of the relationship graph.Given complex relationships across multiple views with many related entities, this computation can quickly become intractable.To address this complexity, we perform the E-step in an iterative, alternating fashion by greedily estimating the cluster assignments within each view conditioned on the assignments made in the others, effectively performing coordinate ascent.
The M-step is comprised of three sub-steps.We compute the analytical maximum likelihood estimates of the cluster distribution parameters θ, which are conditionally independent of the other parameters given the assignments chosen in the E-step.Next, we attempt to maximize the likelihood with respect to the cluster relationship matrices C by iteratively evaluating re-assignment of relationships between possible pairs of clusters across views.The inclusion of C in our model thus allows us to switch cluster dependencies during our inference procedure as we converge to clusterings where new cluster relationships form.Finally, we are able to use black box mean field variational inference (BBVI) 17 or a direct optimization via ADAM 18 to update the cluster weights and parameters ϕ.To reduce the variance of the BBVI gradient estimate and increase the efficiency of the method, we choose Gaussian variational distributions and utilize the reparameterization trick. 19 the M step, it is necessary to estimate the normalization constant of the conditional prior on the cluster assignments.
Unfortunately, as the number of clusters, views, and/or the size of the related groups of entities go up; this computation eventually becomes intractable.In these cases, we utilize the pseudolikelihood 20 to approximate the partition function tractably.The pseudolikelihood performs comparably to the full likelihood in all of our simulations.
The coordinate ascent algorithm for performing the E-step in conjunction with the expectation-maximization algorithm and the BBVI algorithm each guarantee local optima.As such, we employed random restarts to explore local optima and ultimately improve the quality of the results.

Simulations
To test the efficacy of BMVC, we simulated a fully dependent two-view dataset where samples are drawn in each view from the same two independent clusters (A and B) defined by two-dimensional Gaussians.In this framework, related samples are generated by the same Gaussian distribution in each view.However, half the samples from A in the second view are shifted independently along the axis between the two cluster means, thus simulating data which are generated by cluster A but confounded by other effects in the second view.Such an effect might occur in a related modality with predictable underlying mechanistic differences.Let x A and x B be samples generated from A and B and let s be the shift percentage.We sample respectively: To shift a sample x Ai closer to μ B we apply the following transformation: Thus, sample membership inference is confounded in the second view due to the proximity of some of the A samples to B (see Figure 2, left).We varied the shift degree s and tested our ability to recover the correct clustering in both views given three structural cases: a semi-missing one-to-one simulation in which half the samples of view 2 have one-to-one relationships to samples in view 1; a complete one-to-one case without missingness; and finally a two-to-one simulation in which view 2 has twice as many samples, which map in pairs to individual view 1 samples.We compared Gaussian BMVC to a baseline model with ϕ set to zero; this produced two independent mixture models optimized with respect to hard cluster assignments (H-GMM).
In order to test our ability to estimate ϕ given varying levels of dependence between views, we adapted a previous approach to introduce randomness into the dependence of the cluster assignments. 7This was accomplished by re-assigning samples which would normally belong to cluster A some of the time per a binomial distribution parameterized by p chosen between 0 and 0.5.Within this framework, a baseline cluster A sample is generated by cluster A with probability p and is otherwise generated by cluster B with probability 1 À p; likewise, a cluster B sample is generated by cluster B with probability 1 À p. Thus, a p closer to 0.5 will reduce the dependence between views.We constrained the degree of relatedness of the datasets by introducing this modification into the data generated in the second view and observed our estimation of ϕ as we reduced the relatedness of the datasets.

Naive consensus clustering
In order to explore the performance of BMVC in comparison to a baseline, multi-view approach, we generated a more complex four-cluster simulation with similar elements of confounding.In addition to original simulation clusters A and B, we generated data from two additional clusters C and D. In view 2, we confounded these randomly such that half of the cluster C samples in view 2 were related to cluster C samples in view 1, and half were related to cluster D samples in view 1.Similarly, half of the cluster D samples in view 2 were related to cluster C samples in view 1 rather than cluster D samples. Figure 2 (right) demonstrates the ground truth clusters of view 1 and their confounded relatives in view 2. Additionally, we introduced the same shift-based confounding as used previously between clusters A and B in view 2. We compared BMVC clusterings of these data to the naive consensus clustering produced by concatenating these two datasets along one dimension and applying a single Gaussian mixture model.

Simulation results
To measure the quality of our simulations, we computed the adjusted Rand index (ARI) of the clustering generated by the model in comparison to the ground truth of the simulation and computed the average of this statistic over both views.
The adjusted Rand index measures the similarity between two clusterings while accounting for chance groupings of the elements.We ran each simulation ten times, where each run was initialized with a K-Means clustering.K-Means clusterings were initialized randomly using the k-means++ algorithm.Figure 3 shows the results of all the trials.As expected, BMVC and the baseline model (H-GMM) performed comparably at either end of the spectrum of shift degrees.When the shift was negligible, the true clustering was easily recoverable in view 2 without leveraging view 1.When the shift was too great, the shifted samples became so confounded with cluster B that we were no longer able to reasonably separate them.When shifts were substantial but did not significantly overlap cluster B, BMVC was able to leverage the known relationships between the two views to determine the correct cluster memberships, whereas H-GMM was not able to make such a distinction.Notably, the margin in improvement increased considerably as we added relationships in the complete one-to-one simulation (Figure 3, middle) and then doubled the structure again in the two-to-one case (Figure 3, bottom).
Figure 3 also shows the strength parameter learned by BMVC over combinations of the binomial parameter p and shift distance.There is a clear effect of the artificial randomness injected by p.As expected, BMVC estimated a higher dependence strength for simulations in which p was lower, and, thus, the inter-view dependence was more complete.This effect was stronger when there were more inter-view relationships.
In our 4-cluster simulations with two types of confounding, we found that both baselines (individual clusterings [H-GMM] and the naive consensus clustering [M-GMM]) produce insufficient interpretations of the data.The individual clusterings were unable to effectively deal with shift-based confounding while the naive consensus clustering approach was unable to capture the different structures present in the two views.In contrast, BMVC leveraged the known relationships to separate the shift-based confounding while also allowing for separate interpretations of the remaining data in the two views.Figure 4 demonstrates the improvement of BMVC over the two other approaches in terms of ARI.
In order to better understand the convergence properties of the BMVC algorithm, we also examined the distribution of log likelihoods across runs of the 4-cluster simulations with both entirely random and k-means initializations across seven simulations with different shifts (as above).For the purpose of these experiments, we ran ten trials at each shift.To ensure that convergence differences did not occur due to randomness in the data, we fixed the random seed for the simulation data generation process at each shift across all ten trials.We initialized the k-means clusterings randomly using the k-means++ algorithm.
In the majority of the simulations, the randomly initialized runs converged more often to the clustering with the highest log likelihood than to any other clustering, although convergence to some local optima occurred frequently in each simulation.In the two cases where the shift caused the most significant confounding (shifts 0.5 and 0.6), the algorithm converged more often to a nearby suboptimal maximum, one that was close to the presumed global optimum.Over all the experiments, the randomly initialized runs converged to the highest shift-specific log likelihood 43% of the time.Notably, a similar pattern of convergence to local optima emerged when clustering with the hard-labeled GMM (H-GMM) (with convergence to the highest shift-specific log likelihood occurring 46% of the time over all simulations).This suggests that convergence to local optima is not a sole consequence of the complexity of BMVC but is due in large part to the complexity of the data and the associated challenge of discovering a global optimum when initializing randomly.In contrast, our runs with k-means initializations show that the k-means initialized BMVC runs always converge (100% of the time) to the highest log likelihoods achieved by random initializations in each shift.As a result, we recommend this approach to initialization as it both reduces the time to convergence and also stabilizes the variability in convergence outcomes.We initialized all further applications of BMVC detailed in this work using the k-means approach.

Application to public health outcomes in Puerto Rico following Hurricane Maria
To examine BMVC's ability to leverage inter-view relationships and its utility for real-world problems, we first turned to an easily interpretable application in public health.We applied BMVC to survey data collected from 3,299 households and 9,522 individuals, 21 which details access to various resources along with individual demographics and statuses of residents of Puerto Rico following Hurricane Maria.For the purpose of clustering, features in both views (individuallevel and household-level) were filtered to feature sets common to all samples.The data includes general information such as household size and barrio, as well as information regarding the impact of the hurricane on households such as loss of access to medical facilities/equipment, loss of electricity, or loss of water.Individual data includes age, gender, whether the individual moved into their home that year, and their current status.Table 1 provides a heuristic guide to categorical feature values visualized in Figure 5. BMVC is ideally suited to fully leverage the multi-view structure present in the data because household size, and thus the degree of many-to-one relationships, varies from one individual to as many as 13.

Individual and household cluster analyses
To increase the interpretability of the experimental results, we ran a sample BMVC clustering on the data using a prior on the ϕ parameter which favors highly related clusters across the views (i.e.prefers large ϕ; all exact parameters of the run can be found in the provided Github repository).We found that the high degree of many-to-one relationships in the data made optimization of the full model likelihood intractable; as such, we utilized the pseudo-likelihood during inference.
To assist in our analysis of the resulting clustering, we computed the proportionfor every pair (A and B) of inter-view clustersof the relatives of cluster A entities that are in cluster B and of the relatives of cluster B entities that are in cluster A. The average of these statistics (hereafter referred to as'structural affinity') represents the degree to which the learned clusters adhere to the structural assumptions coded by R.
Figure 5 shows the cluster means (top) produced by BMVC as well as the structural affinity scores (bottom) of the BMVC clustering of the data and the hard-GMM clustering of the data for comparison.Structural affinity tends to be maximal along the diagonals much more often for the BMVC clustering than for the H-GMM, which demonstrates that the model successfully biases related entities to cluster more often under the paired labels encoded by C in contrast to two standard independent clusterings.Notably, clusters in the independent H-GMM household clustering do not tend to have uniquely high affinity scores with any particular individual clusters apart from the largest clusters (individual clusters 3 and 13), which are inherently more likely to contain individuals related to a given household.As such, the independent clustering of the data did not demonstrate many observable trends across views.In contrast, inspection of the cluster pairs with the highest structural affinity along the diagonal in the BMVC affinity map demonstrates that the model accurately captures and integrates some fundamental relationships between household conditions and individual demographic trends along with some additional, more complex causal relationships between household conditions and individual status (Table 2).
Cluster labels for which we estimated the lowest structural affinity across views in this clustering demonstrate the unique properties of the individual views in cases where the latent structure is not related across the views; thus, these represent a form of negative control for inter-view structure.For example, individual cluster 19 does not demonstrate structural affinity across the views and corresponds to missing individuals.Household cluster 18 does not demonstrate structural affinity across the views and corresponds to households where members reported significantly fewer neighbor deaths than the average.Household cluster 7 describes households which lost access to dialysis for an average of 2.25 days and  does not demonstrate structural affinity across the views in this clustering, which suggests that such cases may not have significantly contributed to mortality in this study.Previous findings suggest that end-stage renal failure typically results in death after an average of seven days in hospice patients without dialysis. 22This could explain why the clustering does not reflect a significant relationship between this medical issue and a mortality event or relocation.
Intuitively, we noted a correlation between rarer negative outcomes and lower structural affinity, which might reflect smaller sample sizes and the less deterministic relationships between worsening household conditions and individual outcomes.For example, household cluster 2 describes households that were more likely to require assistance at a temporary aid center and demonstrates only moderate structural affinity to a cluster containing slightly older individuals and some deaths.Household cluster 9 describes households where access to medical care was limited by affordability and has even less structural affinity to an individual trend given by this clustering.
Finally, we turn to an examination of the clusters with the highest structural affinity across views.Because half the features in the individual view are purely demographic, households with relatively few adverse events cluster alongside various demographic groups of individuals with uneventful status (i.e. that did not move or die) such as in clusters 1, 3, 5, 8, 12; further, households with widespread issues with electricity, water, cell availability, and that reported known deaths of neighbors were more likely to cluster alongside younger individuals who left despite having recently moved in (cluster 6).However, we are also able to make more specific observations about the relationships between very particular household factors and the individuals in those households; for example, households that reported a loss of access to respiratory machinery due to power failure (CPAP, BiPAP or nebulizers) clustered strongly alongside men averaging middle aged (cluster 17), coinciding with previous findings that sleep apnea and COPD are more likely to be diagnosed and to occur in older men. 23,24This is an example of the way in which BMVC can be used for mechanistic hypothesis generation.

Application to methylation accompanying prognostic gene expression in breast cancer patients
We now turn to a two-view genomic data application in which we clustered gene expression and methylation.This gave us the opportunity to more rigorously test BMVC's ability to both improve clustering interpretability and increase clustering quality as measured by biological homogeneity.
Using patient data from The Cancer Genome Atlas (TCGA), 25 we examined a set of 1,819 genes identified as intrinsically related to breast cancer subtype in four different studies of breast cancer patients and further studied by TCGA.We selected 1,363 methylation probes which fell within the gene bodies of the intrinsic genes according to available Infinium HumanMethylation27 BeadChip annotations 26 and which also fell into the top 50% of probes with the highest variance over the 466 patients for which data was available across all'omics platforms.Before filtering to the selected genes and probes, we mean-centered and scaled each subject in each view over all loci.A total of 366 genes in our target set mapped to more than one probe in the high variance set, and 1,019 of them were unmapped.Of the probes, 23 were related to more than one gene.As such, this biological application includes both many-to-one relationships and missing relationships that are crucial to fully understanding the available structure.Before clustering, we performed principal component analysis on the full dataset and removed the first principal component in each view of the data as a conservative measure to account for potential technical artefacts.We then projected the data in each view onto the top 10 principal componentsaccounting for the highest 41% and 30% of the remaining variances in the expression and methylation views respectively to avoid any cluster quality issues related to dimensionality.Consequently, we clustered the genes and methylation probes each in 10-dimensional space.
To test the capability of BMVC to produce biologically homogeneous clusters in the presence of confounding signal, we designed a preliminary experiment wherein we produced two dependent pseudo-simulated views from the gene expression data by independently adding standard Gaussian noise in each view.We then duplicated 25% of the entities at random in one of the views in order to simulate many-to-one relationships.Finally, we removed 25% of the edges from the resulting true relationship graph to simulate cases where some entities have no known relationships across views.We ran BMVC twenty times with a Gaussian base distribution on this data alongside H-GMM and compared the Adjusted Rand Index (ARI) of the H-GMM and BMVC clusterings relative to a single-view H-GMM clustering of the original expression data without noise (henceforth referred to as the 'ground truth' GMM or GT-GMM) in order to determine whether BMVC is able to leverage the artificial inter-view structure to recover signal closer to the latent ground truth despite the inclusion of the random signal.Each clustering run was initialized with a K-Means clustering, which was randomly initialized using the k-means++ algorithm.We evaluated the biological homogeneity of the clusterings in terms of the similarity of the genes in each cluster using a similarity metric derived from the pairwise Gene Ontology Term Overlap (GOTO) 27 statistic, which is similar to that used by. 5 The GOTO score of a pair of genes is computed by retrieving Gene Ontology (GO) terms and their ancestors for each gene and estimating the pairwise intersection; thus, higher scores indicate a stronger relationship between gene products.We average these scores over pairs within a cluster.Then, we take a weighted average of the cluster scores over the clusters (where cluster sizes are the weights).We computed GOTO scores in each of three GO categories (Biological Process, Molecular Function, Cellular Component) as well as with respect to all GO terms regardless of category.
As in our preliminary experiment, we ran BMVC on the true TCGA data using a Gaussian base distribution.We evaluated the quality of the clusterings in contrast to H-GMM cluster labels for twenty runs of each algorithm.As before, each run was initialized with a K-Means clustering, which was randomly initialized using the k-means++ algorithm.For methylation probes with multiple related genes, we estimated a GOTO score by averaging over all possible gene pairs.This process is further detailed in an included GitHub repository (see Extended data).
Finally, we ran Gene Set Enrichment Analysis (GSEA) 28 on the resulting gene expression clustering.GSEA is known to be biased when applied to methylation data, 29 but we were able to interpret much of the learned structure due to the natural cluster alignment produced by BMVC's inter-view dependence.

Analysis of biological results
In our preliminary pseudo-simulated experiment, we found that BMVC clusterings had a higher average ARI than H-GMM clusterings (Figure 6), where we took ground truth to be the clustering of the true expression data.Correspondingly, BMVC GOTO scores (Figure 7) were higher on average than those produced by H-GMM in both views and in all GO conditions, and, thus, they were closer in the first view to the GOTO scores produced by the reference GT-GMM clustering.Note that the duplication of entities in the second view of our pseudo-simulated data changed the baseline distribution of the data, and as such there was no ground truth; thus, we were not able to generate GOTO scores for GT-GMM in the second view (Figure 7), nor did we compute ARI in the second view (Figure 6).The correlation between an improvement in the ARI and an improvement in GOTO score suggests that GOTO is a reasonable, if approximate, measure of true biological homogeneity.Given this assumption, an improvement in GOTO by BMVC over H-GMM when applied to the true biological data would suggest that BMVC is able to successfully leverage the known inter-view relationships to produce an improved multi-view clustering.We ran BMVC and the independent GMMs on the processed expression and methylation datasets together, each with 60 random but identical initializations and no added noise (in contrast to our pseudo-simulations). Figure 8 compares the performance of BMVC and H-GMM in terms of average GOTO scores across the views and demonstrates that BMVC improves the overall biological homogeneity of the joint clustering.
Notably, the improvement in the methylation view when using BVMC is larger than in the expression view (Table 3), which is consistent with the observation that methylation data is generally noisier 25 and so may benefit more from transfer learning.GOTO scores of methylation clusterings may also increase more when informed by gene expression clusterings because GOTO is defined with respect to gene products rather than epigenetic processes.This could explain why biasing gene expression clusterings to more closely resemble the structure found in the epigenetic data does not increase the GOTO score in the expression view as much.
We performed GSEA for biological processes of the BMVC gene expression clusters over all trials alongside the corresponding H-GMM clusters using KEGG and Reactome Canonical Pathway genesets from MSigDB, 28,30 using the Benjamini-Hochberg correction at a FDR threshold of 0.1.We compiled a list of the enrichments found by BMVC but not the GMMs in each trial.We then compiled the list of enrichments found by the GMMs but not by BMVC in each trial; we counted the occurrences of the enrichments in each list and subtracted the counts of the second list from the first in order to control for random effects due to initialization.Finally, we examined the top scoring enrichments that were found in three or more remaining BMVC trials in order to determine what biological effects were most often recovered by BMVC clusterings but not by the independent Gaussian mixture models.
We found that by incorporating the dependency between the views, we were more likely to recover enrichments consistent with previous findings of the role of methylation in alternative splicing in breast cancers as well as and in regulation of pre-mRNA splicing (KEGG_SPLICEOSOME, REACTOME_MRNA_SPLICING, REACTOME_PROCESSING_OF_ CAPPED_INTRON_CON-TAINING_PRE_MRNA). 31,32We were also more likely to recover enrichments of the S phase (REACTOME_S_PHASE), which is consistent with previous work suggesting the existence of a methylation 'loss clock' which links late S-phase genome replication with a reduction of tumor methylation and an accumulation of methylation errors in breast cancer tumors. 33The top scoring enrichments also included processes related to G-protein coupled receptor signaling, which have been repeatedly implicated in tumorigenesis 34 and in particular secretin receptors (REACTOME_SIGNALING_BY_GPCR, REACTOME_CLASS_B_2_SECRETIN_FAMILY_RECEPTORS) which have previously been shown to be hypermethylated and downregulated in breast cancer tissues 35 and to be regulators of tumor cell proliferation.7][38] However, we found limited or no clear evidence for known mechanisms by which these processes are regulated by methylation.For example, previous findings 38 demonstrate that patients with frequent basal cell carcinomas may be more susceptible to the development of breast cancers due to germ-line defects in DNA repair genes.Other work stipulates that epigenetic silencing of said DNA repair genes could be a factor in tumorigenesis in cancers, 39 but further investigation is required to determine whether there is a specific epigenetic link between basal cell carcinomas and breast cancer development.Separately, recent work suggests that epigenetic mechanisms may underlie lysosomal storage disorders, 40 but that such a claim warrants further investigation.A different possible interpretation of these enrichments is that some of them represent gene expression mechanisms that are not directly affected by cis-regulatory methylation, but rather are much more identifiable in our clusterings when we account for the effects of methylation regulatory processes.We further examined the clustering run that achieved the highest average biological process GOTO score through the lens of the structural affinity metric used for the previous public health analysis (Figure 9).Among the enriched clusters with high structural affinity scores, we found enrichment for immune system processes, interferon signaling and immune cytokine signaling (cluster 19, e.g.: REACTOME_IMMUNE_SYSTEM, REACTOME_INTERFERON_ALPHA_ BETA_SIGNALING, REACTOME_CYTOKINE_SIGNALING_IN_IMMUNE_SYSTEM), which previous work shows are processes significantly correlated with genomic demethylation across tumour types. 41Notably, cluster 19 was also enriched for KEGG_LYSOSOME, lending further credence to the potential for the regulation of lysosomal processes via methylation in breast cancer and potentially in lysosomal storage disorders in general as previously suggested. 40We found further evidence for lipid-related processes in cluster 6 including enrichment for the peroxisome proliferator-activated receptor pathway (KEGG_PPAR_SIGNALING_PATHWAY), which is critical to metabolic and lipid-related processes and has also been found to regulate and be regulated by epigenetic processes including by DNA methyltransferases. 42spection of low-affinity clusters was also informative.Cluster 2 was enriched for KEGG_PRIMARY_ IMMUNODEFICIENCY and KEGG_HEMATOPOIETIC_CELL_LINEAGE.A class of rare disorders, primary immunodeficiency diseases (PIDs) are often treated with hematopoietic cell transplants and increase the risk of a cancer diagnosis. 43While secondary immunodeficiencies may be caused by environmental or epigenetic factors, PIDs are caused by genetic mutations, and thus genes implicated in PIDs are much less likely to be regulated by methylation in this context. 44The low structural affinity of cluster 2 in the BMVC clustering supports this claim and is another example of the way in which BMVC is able to provide a highly interpretable result with inferred relationships that are useful for hypothesis generation.Gene expression cluster 11 was not enriched for any terms, which might suggest that the corresponding methylation cluster represents an effect that does not functionally cis-regulate expression at the loci measured in this study; as such, this could warrant further investigation.

Discussion
An assumption of one-to-one relationships between entities often does not sufficiently represent structure across domains in multi-view applications.We show that BMVC can be used to effectively capture latent structure in multi-view data without imposing a one-to-one assumption.BMVC is the first fully Bayesian approach to accomplish this goal, and as such it provides a flexible framework that allows many other possible configurations of the model that may be appropriate for specific applications, including user's choice of priors on latent variables and base distributions for a given data type.Furthermore, our applications in public health and genomics highlight the ability of BMVC to generate natural interpretations of latent multi-view structure and to support mechanistic multi-view hypothesis generation.This shows promise for multi-view applications that commonly require formulation of novel hypotheses such as drug discovery and patient diagnosis.
The experiments presented here also demonstrate that our alternating inference procedure, in combination with gradientbased methods applied to the pseudo-likelihood, yields a relatively efficient inference approach capable of estimation with large numbers of entities and many clusters and relationships.Scalability will become increasingly important as the diversity and quantity of multi-view data increases across domains.Some limitations of BMVC are worth noting.One is its reliance on a priori known relationships encoded by the userspecified matrix R. Care must be taken to ensure that the relationships encoded by R reflect real and high-quality relationships, e.g. by filtering to relationships or entities that are correlated with particular outcomes or that have higher variance as we did in our biological analysis.Additionally, extra work may be needed to account for different kinds of real-world relationships.While using known relationships is likely to be a more robust approach than an attempt to automatically infer such relationships as a pre-processing step (such as in CMMVC 12 ), future work might include estimation or refinement of entity relationships as part of the clustering process itself.Another limitation of BMVC is that it uses hard cluster assignments, thus imposing discrete assumptions of cluster membership.The hard assignments directly determine the impact of ϕ on the overall clustering, and also contribute to the efficiency of the inference procedure proposed here.An effort to incorporate uncertainty into the estimation of cluster assignments by using soft expectation maximization could further improve the accuracy and robustness of the multi-view clustering.
There are also significant opportunities for extensions and further work in complex multi-view structure learning.It will continue to be important to ensure that these models are interpretable, especially in critical settings such as the clinic.
Here, we presented post-hoc interpretations of the inter-view cluster relationships by visualizing structural affinity scores.Future work could focus on Bayesian approaches that explicitly consider more fine-grained representations of the relationships in the data, e.g. by including a different latent variable for every pair of clusters between two views.Further, future approaches could include efforts to account for continuous structure beyond clusters, such as through factorization, and further complex dependencies between these structures, such as via hierarchical or tree-based models.We expect that flexible approaches like BMVC that can incorporate increasingly complex assumptions will continue to improve the application of multi-view models in real-world contexts.
BMVC approach is able to utilise between-views relationships when recovering clustering within each individual view.The authors have tested the method with simulated data and based on two different empirical datasets.The authors have also provided Python codes for the method implementation on the dedicate github.
The method is practically attractive and can be used in various data-rich fields.While reading the paper, I had some remaining questions regarding the method.Firstly, model estimation requires maximisation of a complex function containing multiple local optima.What is the variation of resulted clustering outcomes due to alternative attained local optima, and especially with respect to the global optimum?
Secondly, the authors recognise that their approach relies on hard cluster assignments (i.e. each node can belong to a single cluster only) and list this as a method limitation.How hard would it be to implement fuzzy cluster assignments in the suggested method?Would it be an extension of the proposed method, or perhaps a new method with outcomes greatly different for hard and fuzzy cluster assignments even if applied to the same dataset?
Finally, how sensitive clustering outcomes are to misspecification of R matrix (the matrix that codes between-view relationships)?
To summarise, the authors have presented a practically relevant and implementation promising method for assigning clusters within multi-view systems.The authors have provided Python codes for the method implementation, which would hopefully be replaced by more formal and better documented Python and R packages.

Is the rationale for developing the new method (or application) clearly explained? Yes
Is the description of the method technically sound?Yes

Are sufficient details provided to allow replication of the method development and its use by others? Partly
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Data science

Figure 1 .
Figure 1.A plate-notated visualization of Bayesian Multi-View Clustering (BMVC) for an application with two views (shown in solid lines) as well as an extension to the three-view model (shown in dashed lines).Further views may be incorporated in a similar fashion.Gray nodes denote variables known a priori.

Figure 2 .
Figure 2. Visualizations of simulations with different types of confounding.A: The first view of an example simulation with 100 samples from each cluster.B: Depicts the shift in view 2 dictated by variable s i for sample x i (shown as a green star in the first view).Means of the Gaussians used to generate the data are depicted in black.C: Shows the first view of an example 4-cluster simulation.D: Demonstrates two types of confounding added in the second view (a shift and a binomial randomization effect) for use in comparing BMVC to the naive consensus clustering.

Figure 3 .
Figure 3. Results of applying Bayesian Multi-View Clustering to 2-cluster simulations.On the left, we show the adjusted Rand index of simulations with one-to-one half-missing relationships (A), one-to-one complete relationships (B), and two-to-one relationships (C) over varying degrees of shift.Solid lines represent the median score and demonstrate the improvement of BMVC over H-GMM.To the right, we plot the corresponding BMVC estimates of the dependence strength parameter ϕ over varying degrees of shift and view independence p.

Figure 4 .
Figure 4. ARI of Bayesian Multi-View Clustering compared to a naive consensus clustering in the 4-cluster simulation.

Figure 5 .
Figure 5. Results of applying Bayesian Multi-View Clustering to the households and individuals in the Puerto Rico dataset.Scaled means of the BMVC clusters of households and individuals in the Puerto Rico dataset are shown at the top.Features are scaled across clusters in each view to ease the interpretation of the feature importances of each cluster relative to the others.Structural affinity maps (bottom) are shown alongside score box plots that depict affinity scores along each map row with blue stars denoting the score of the matched cluster.

Figure 6 .
Figure 6.Adjusted Rand index of 20 Bayesian Multi-View Clustering (BMVC) clusterings alongside hardassignment GMM (H-GMM) clusterings of pseudo-simulated data in the first view.The H-GMM clustering of the true expression data was considered to be ground truth for the purpose of ARI computation.BMVC captured true underlying biological structure more accurately than H-GMM on average.ARI was only computed in the first view due to the lack of ground truth in view 2 caused by duplication of the data.

Figure 7 .
Figure 7. Bayesian Multi-View Clustering analyses of the pseudo-simulated data better capture the underlying biological structure than Gaussian mixture models, as demonstrated here using the Gene Ontology Term Overlap statistic as a measure of biological homogeneity.Notably, ground truth clusterings of the noiseless data (in green, GT-GMM) can only be observed in the first view of the pseudo-simulated data, which contains no duplication.These clusterings achieve the highest scores, which supports the use of GOTO as a measure of biological homogeneity when measuring the performance of BMVC.

Figure 8 .
Figure 8. GOTO scores in each Gene Ontology category show that the average biological homogeneity of BMVC clusterings across views is higher than in clusterings produced by the Gaussian mixture models.

Figure 9 .
Figure 9. Structural affinity box plot showing affinity scores of the Bayesian Multi-view Clustering methylation clusters (y-axis) with respect to each expression cluster (x-axis).Blue stars denote the structural affinity scores of the methylation cluster respectively paired with each expression cluster.

Table 2 .
Household and individual clusters demonstrating structural affinity.

Table 3 .
Average Gene Ontology Term Overlap scores, measured in four Gene Ontology categories: Biological Process (BP), Cellular Component (CC), Molecular Function (MF), and all GO terms combined (All).