Ensemble method for cluster number determination and algorithm selection in unsupervised learning

Antoine Zambelli

doi:10.12688/f1000research.121486.1

Home Browse Ensemble method for cluster number determination and algorithm selection...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Method Article

Ensemble method for cluster number determination and algorithm selection in unsupervised learning

[version 1; peer review: 2 approved with reservations, 1 not approved]

Antoine Zambelli

PUBLISHED 25 May 2022

Author details Author details

University of California, Berkeley, Berkeley, CA, 94720, USA

Antoine Zambelli
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Computing Science collection.

Abstract

Unsupervised learning, and more specifically clustering, suffers from the need for expertise in the field to be of use. Researchers must make careful and informed decisions on which algorithm to use with which set of hyperparameters for a given dataset. Additionally, researchers may need to determine the number of clusters in the dataset, which is unfortunately itself an input to most clustering algorithms; all of this before embarking on their actual subject matter work. After quantifying the impact of algorithm and hyperparameter selection, we propose an ensemble clustering framework which can be leveraged with minimal input. It can be used to determine both the number of clusters in the dataset and a suitable choice of algorithm to use for a given dataset. A code library is included in the Conclusions for ease of integration.

Keywords

Clustering, Consensus Clustering, Ensemble, Gaussian Mixture, Hierarchical Clustering, K-means, Number of Clusters, Spectral Clustering, Unsupervised Learning

Corresponding author: Antoine Zambelli

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2022 Zambelli A. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Zambelli A. Ensemble method for cluster number determination and algorithm selection in unsupervised learning [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2022, 11:573 (https://doi.org/10.12688/f1000research.121486.1) First published: 25 May 2022, 11:573 (https://doi.org/10.12688/f1000research.121486.1) Latest published: 25 May 2022, 11:573 (https://doi.org/10.12688/f1000research.121486.1)

Introduction

Unsupervised learning - one of the main branches of machine learning - is the study of previously unlabelled datasets. With the hope of gaining new insights into their data and its structure, researchers can attempt to group or segment their data based on similarities between the data points in an exercise called clustering.

Clustering problems have been studied in-depth, with applications in the fields of computational biology, operations research and social sciences¹^–³ to name a few. Quite a few algorithms are available for use, including in popular open-source libraries.⁴^,⁵ However, these problems have often required human intervention on the part of the researcher. Naturally, this severely limits automation and is prone to human-error.

The need for researcher input is due mainly to two problems: determining the number of clusters in the dataset, and choosing an algorithm to cluster with. Both can produce highly inaccurate results if poorly selected.

The landscape

The canonical approach to determining the number of clusters in a dataset is to take an “elbow method” approach. We detail this in the Problem section. In short, it attempts to cluster the dataset with different numbers of clusters and compare the outputs, looking for an elbow in the curve of results. However, this has inherent shortcomings. A researcher must still choose which algorithm and which set of hyperparameters to use.

While hierarchical clustering algorithms (HCA) can be used to automatically find the number of clusters with reasonable accuracy in some cases,⁶ using this method means we miss out on using the many other algorithms that have been developed, even if they would be better suited to our dataset.

A large number of sophisticated methods have been explored as well.⁷ One of the more famous approaches was developed by Monti, et al.⁸ - called consensus clustering. Many of these methods (including Monti) suffer from a high level of complexity and abstraction, often based on the idea of partitioning the data.⁹^,¹⁰ Essentially, they attempt to cluster many different subsets of the data under different cluster numbers and then select the most stable ones.

Apart from usability stemming from their complexity, many of these methods can get computationally intensive (including memory requirements). Lastly, as noted in Ref. 11, they do not generally perform well when it comes to estimating the number of clusters. As a final note, these approaches also suffer from the need for researcher input as to the choice of algorithm and parameters. We consider it prudent, therefore, to explore the topic further.

In this paper, we propose an ensemble approach to answering the following questions: How many clusters are in the dataset, and which algorithm-hyperparameter choice is best for this data? Our approach outputs the number of clusters, as well as both a model choice and a set of hyperparameters to use with the model. Note that we do not define ensemble in the sense of a collection of partitions, but rather separate algorithms, as seen in Ref. 12.

Luckily, several of the graph-based methods mentioned above - including consensus clustering and its improvements - are actually complementary to our proposed method, and we see no reason why they could not be combined, albeit with a bit of work. Consensus clustering takes as input a model and its hyperparameters, but has no framework for choosing a suitable model. Likewise, this first discovery we are presenting does not account for any data partitions.

While we leave the task of truly combining our approach with existing methods to future work, we present evidence of the benefits of accounting for model and hyperparameter choices in Monti’s consensus clustering, and we invite the reader to consider the possibilities as they read through our work. In the following section, we look at the problem in more detail, exploring the three main areas that must be addressed. In the Methods we define our algorithm or workflow. We then discuss results, present typical usage in the Discussion, and finally summarize our approach and findings in the Conclusions.

The problem

As mentioned, the central issue we are facing is that of finding the number of clusters in our dataset. While we have previously stated that the choice of algorithm and hyperparameter set is important, we found it was often overlooked in the literature, being taken as a given or covered at a high level.¹³ So, we set out to compute some baseline performance changes that choosing the algorithm and hyperparameter set can have. We found that, in fact, this choice had a very large impact on predicting the number of clusters in a dataset, and were ourselves quite startled by the magnitude of these variations.

Perhaps the most common approach to determining the number of clusters is to use the elbow method.¹⁴^,¹⁵ This involves making many attempts at clustering, and then picking the one that seems to fit best. More directly, the workflow is:

1. Choosing a clustering algorithm and parameter set.
2. Clustering the data into $n$ clusters for $n \in \{2 \dots N\}$ .
3. For each $N - 1$ attempts, computing a metric (ex: $BIC$ ).
4. Finding the elbow in the curve of metric values, the x-axis value is our number of clusters.

Note that automatically finding the elbow can be done in several ways, such as the minimum absolute second derivative, the point of best linear fit, or as we will use here, the triangle method.

Intuitively, we can view these methods through the lens of Information Theory. Namely, the curve represents the amount of information explained as we increase the number of clusters. As we begin getting diminishing returns, we say that we are no longer explaining the data, and have too many clusters (hence the cutoff at the elbow).

Our approach will rest on this method and principle, but will tackle its three weaknesses - the choices that were implicitly made: algorithm, hyperparameters and metric. Many of the difficulties are discussed at a high-level in Ref. 13. We are essentially tackling Step 2 (and partially Step 1) in Figure 1 in their Conclusion.

Figure 1. Triangle method.

For now, let’s quantify exactly how important these choices are. Using the elbow method as a baseline, we computed some accuracy statistics on 100 randomly-generated three-cluster datasets, detailed in the Performance section. We defined accuracy as correctly determining the number of clusters we have (in this case 3). For each algorithm, we looked at performance across a broad range of hyperparameters and metrics, detailed in the Full dataset section.

Methods

Simulated data

Our data consisted of 100 two-dimensional three-cluster collections of 30,000 points each. The data was drawn from a standard normal distribution, with cluster centers randomly places between $(- 5, - 5)$ and $(5, 5)$ . Specifically, the data was constructed using the scikit-learn function make_blobs (Figure 2):

X, y = make_blobs( n_samples=30000, centers=3, n_features=2, center_box=(-5, 5), random_state=seed)

Figure 2. Simulated data with random seed $42$ .

Choice of algorithm

The first step many researchers will take to successfully cluster their data is to choose a clustering algorithm (step 1 in the traditional workflow above). A variety of inherently distinct algorithms exist, from Spectral methods to HCA and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). The issue at this step is that different algorithms can be better suited to different datasets,¹⁶ and this can be very difficult to determine ahead of time. While there are some generally accepted behaviors - ie, Spectral clustering works well on non-convex datasets⁴ - we can also see this experimentally.

For each algorithm, hyperparameter and metric choice, we computed the accuracy of our predicted number of clusters on the 100 datasets, giving us $M$ accuracy readings. For example, for one algorithm with two possible metrics and three hyperparameter values, we would have six accuracy readings for the clustering algorithm. If the choice of algorithm did not matter, then we would get the same statistics across different algorithms. The table below shows the statistics for those $M$ readings for K-means, Gaussian mixture model (GMM), HCA and spectral:

As we can see, different algorithms obtain rather different results (whether the mean or max performance). The performance of the spectral and HCA algorithms, which are clearly ill-suited to our datasets, is particularly interesting. On the other hand, we can see that there is some combination of metrics and hyperparameters for which GMM does quite well with a 91% accuracy. Its standard deviation is quite high though, suggesting that different metrics and hyperparameters can lead to quite different results.

Choice of hyperparameters

The second choice traditionally faced by researchers is hyperparameters (implicitly contained in step 2 of the traditional workflow). Generally referred to as hyperparameter tuning, this can greatly improve the performance of a model. Let’s examine the effect of hyperparameters in our experimental setup from the Simulated data section. Let’s hold the choice of metric constant (selecting inertia $I$ - more on this in the Choice of metric section). This gives us:

Table 2 shows us that the choice of hyperparameters, independent of other choices, can lead to large differences in performance. For instance, a judicious choice of hyperparameters in our K-means algorithm can lead to an 18% increase in performance. Similarly, a poor choice for Spectral leads to a staggering 86% drop. Unfortunately, we have no way of knowing ahead of time which parameter selection will yield the best results in a clustering problem (given an absence of ground truth).

Further, hyperparameters explain some of the variation we saw in Table 1 but not all. A quick look at the various statistics shows us we are missing another piece: the minimum performance seen by the K-means algorithm is now 71% instead of the rather shocking 8%. This indicates one more component to the problem.

Table 1. Stats for accuracy readings per algorithm.

std: standard deviation; min: minimum; max: maximum.

Stat	K-means	GMM	HCA	Spectral
mean	61.78	66.47	28.75	33.05
std	27.52	33.29	16.09	24.00
min	8.00	7.00	10.00	6.00
max	85.00	91.00	78.00	72.00

Table 2. Statistics for accuracy readings per algorithm-metric, across hyperparameters.

std: standard deviation; min: minimum; max: maximum.

Stat	$K - {means}_{I}$	${GMM}_{I}$	${HCA}_{I}$	${Spectral}_{I}$
Mean	77.63	85.39	41.25	51.33
Std	4.00	1.14	28.09	20.96
Min	71.00	83.00	10.00	10.00
Max	84.00	87.00	78.00	72.00

Choice of metric

Finally, we arrive at the last choice we have to make: which metric to use (step 3 in the aforementioned workflow). Previous work by the author showed that in the case of HCA, different metrics performed differently,⁶ but there hasn’t been much work on this topic in general. However, we can once again examine this experimentally. In the same experimental setup as we used above, let’s examine the performance of algorithms for a fixed selection of hyperparameters across different metrics.

Every algorithm used the Inertia and Silhouette Score metrics.⁴ HCA also used the maximum Difference and Elbow metrics from.⁶ K-means and GMM also used the Akaike information criterion (AIC) and Bayesian information criterion (BIC) metrics.⁴ For the table of results, we took the first set of hyperarameters for that algorithm, denoted by the superscript $0$ .

Once again, we can see variations in performance within each algorithm-hyperparameter choice, here based solely on the choice of metrics. It turns out that much like we saw in Table 1 with HCA and Spectral, there is a weak element: the Silhouette Score.

Table 3. Statistics for accuracy readings per algorithm-hyperparameter, across metrics.

std: standard deviation; min: minimum; max: maximum.

Stat	$K - {means}^{0}$	${GMM}^{0}$	${HCA}^{0}$	${Spectral}^{0}$
Mean	61.75	64.50	24.75	16.50
Std	33.17	37.70	7.59	9.19
Min	12.00	8.00	17.00	10.0
Max	79.00	85.00	35.00	23.00

Though not obvious from this table, further investigation shows that it has an average accuracy across all algorithms and hyperparameters of only 12%. By cutting the number of algorithm-hyperparmeter combinations we’ve amplified its effect, and are seeing it drag performance down across the board, something we couldn’t know ahead of time.

The problem is now apparent: we need to find a way to filter out the many possibly bad choices in algorithms, hyperparameters and metrics if we are to find the number of clusters in a dataset with reasonable accuracy.

Ensemble approach

As we saw in the previous section, we know there are some winning combinations of algorithm-hyperparameter-metric, but we must now figure out how to find them ahead of time. At a high level, our approach uses a fairly simple ensemble method. We clustered the dataset using all the combinations we could think of and selected our predicted number of clusters from all the results combined.

This section’s structure will follow the workflow of our algorithm, split into: set construction, building the ensemble, and voting. While we have found a preferred approach on our test dataset, we present several alternatives to each step of the workflow. First, however, the reader must endure some exposition of notation to be used throughout.

Suppose we work with an ensemble of algorithms, denoted by the set

(1)

A = \{A_{0} \dots A_{N}\} .

These would be the clustering algorithms such as K-means, GMM, etc.

Each algorithm can have a set of hyperparameters associated with it. Let that be written as

(2)

ℋ^{j} = \{h_{0}^{j} \dots h_{H_{j}}^{j}\}, \forall j \in [0, N]

where

h_{i}^{j}

denotes the

i

th hyperparameter selection for the

j

th algorithm. Note that different algorithms can have different numbers of hyperparameter selections (denoted by

H_{j}

).

Further, each $h_{i}^{j}$ can be comprised of several elements. For instance for GMM, $h_{0}^{1}$ could be the covariance type and regularization parameter: $(DIAGONAL, 10^{- 6})$ , and $h_{1}^{1} = (DIAGONAL, 10^{- 5})$ . If it helps, think of each $h_{i}^{j}$ as a set of kwargs passed into a model object.

Lastly, let’s write the set of metrics (inertia, AIC, etc) used as

(3)

ℳ^{j} = \{m_{0}^{j} \dots m_{M_{j}}^{j}\}, \forall j \in [0, N]

where

m_{i}^{j}

is the

i

th metric used to evaluate the

j

th algorithm.

Now, given $A$ , $ℋ^{j}$ and $ℳ^{j}$ , we have $\forall j \in [0, N]$

(4)

P^{j} = ℋ^{j} \times ℳ^{j}

(5)

= \{(h_{0}^{j}, m_{0}^{j}) \dots (h_{H_{j}}^{j}, m_{M_{j}}^{j})\}

and

(6)

C^{j} = \{A_{j} (p)| p \in P^{j}\} .

We admit the notation is somewhat opaque, but by constructing our actual test sets it should be illustrated nicely.

Workflow

Now that we have defined our $A$ , $ℋ$ and $ℳ$ sets - and obtained our $P$ sets - we must compute the clusterings.

This is, quite simply, an exhaustive loop over all the elements of the respective $P$ set, where we apply the elbow method as described in the Methods section. $\forall j \in [0, N]$ , and $\forall p \in P^{j}$ ,

1. We take an element $p \in P^{j}$ .
2. We use the hyperparameter values from $p$ to compute clusterings for a range of cluster numbers.
3. We then use the metric found in $p$ to get an elbow curve.
4. We find the elbow in the curve.

For each $p$ , we have now found a suitable number of clusters for our data. More formally, we have just computed

(7)

C^{j} = \{A^{j} (p)| p \in P^{j}\} .

Note that $C^{j}$ is simply a set of integers that map back to specific elements in $P^{j}$ . For K-means clustering a $3$ -cluster dataset, we might get

(8)

C^{0} = \{3,3,2,4,3 \dots 3\}

We can then combine the results into a collection and find the number of clusters and best algorithm-hyperparameter selection from there:

1. Construct the ensemble set $ℰ$ (we developed two approaches detailed in the Building the ensemble section).
2. Vote on the number of clusters (we developed three approaches detailed in the Voting section).

Set construction

The first step in the workflow is to construct our $A$ , $ℋ$ , and $ℳ$ sets. Let’s build some actual test sets to illustrate the structure. These were used for computational results in the Results section, and might help clarify the notation in the meantime. These models were built from the scikit-learn and fastcluster libraries.

First, let’s define the algorithms to look at:

(9)

A = \{K-means, GMM, HCA, Spectral\} .

These were selected based on their diverse natures. Ideally, one wants to choose a collection of algorithms that work well on different types of data. In this case, K-means is a fast and reasonably accurate algorithm for convex datasets, HCA is fundamentally different and does not take cluster numbers as inputs, GMM is well-suited to data with a roughly Gaussian distribution and Spectral has been known to do well with non-convex datasets.

Now, the set of hyperparameters can get rather cumbersome to write out, but let’s explicitly list those ranges for K-means:

(10)

init : \{k - means + +, random\}

and

(11)

\begin{array}{c} reassignment_ratio : \\ np . geomspace (1 e - 4, 0.5, 8) \end{array}

which is

[1.000e-04 3.376e-04 1.139e-03 3.848e-03 1.299e-02 4.386e-02 1.480e-01 5.000e-01].

This gives us the following set of hyperparameters for K-means:

(12)

ℋ^{0} = \{h_{0}^{0} \dots h_{16}^{0}\} = \{(k - means + +, 1.000 e - 04) (k - means + +, 3.376 e - 04), \dots

(13)

(random, 5.000 e - 01)\} .

For brevity, we present the remaining sets based on their base ranges:

GMMcovariance_type: [diag, tied, spherical],reg_covar: np.geomspace(1e-8, 1e-2, 6)HCA method: [centroid, median, single, ward],metric: [euclidean]Spectral affinity:[laplacian, precomputed, rbf, sigmoid],metric: [cosine, l2, l1],n_neighbors: [5, 20, 100],gamma: [0.1, 1.0, 10.0]

(where, for Spectral, metric and n_neighbors are only used for precomputed, and gamma is ignored for precomputed).

In general, hyperparameters should be selected based on available information. If a researcher can somehow narrow the hyperparameter space through other knowledge, they should do so. In the absence of such information, as is our case here, we tried to choose hyperparameter ranges that span the space.

Obviously, some of these parameters can take on an infinite number of values (and we have limited computing resources), but we find it judicious to choose a smaller number of values across orders of magnitude to obtain a representative sample of reasonable values.

Now that we have our $A$ and $ℋ$ sets, we need $ℳ$ . Per algorithm, we have:

K-means[aic, bic, inertia, silhouette_score]GMM[aic, bic, inertia, silhouette_score]HCA[elbow, inertia, silhouette_score, max_diff]Spectral[inertia, silhouette_score]

In other words, for K-means, we get:

(14)

P^{0} = \{(k - means + +, 1.000 e - 04, aic) \dots (random, 5.000 e - 01, silhouette) \dots\} .

What remains now is to construct our ensemble collection $ℰ$ .

Building the ensemble

We present two approaches for building the ensemble set $ℰ$ in the following sections, and detail the rest of the workflow thereafter.

Raw

Given our $C$ sets, the most natural way to construct our ensemble is simply to check every possible combination, essentially a cross product of our sets. This gives us our ensemble of values

(15)

ℰ = C^{0} \times \dots \times C^{N}

(16)

= \{A_{0} (p)| p \in P^{0}\} \times \dots \times \{A_{N} (p)| p \in P^{N}\}

(17)

= \{A_{j} (p_{i}^{j})| j \in [0, N]| i \in [0| H_{j} M_{j}]\}

(18)

= \{(A_{0} (p_{0}^{0}) \dots A_{N} (p_{0}^{N})) \dots (A_{0} (p_{H_{0} M_{0}}^{0}) \dots A_{N} (p_{H_{N} M_{N}}^{N}))\} .

Following the example from the previous section, $ℰ$ would be comprised of four-tuples spanning all possible combinations. Each tuple would contain a “guessed” cluster number given a specific algorithm-hyperparameter-metric choice.

If we structured $ℰ$ as a matrix, the first few rows might look like

E = [3 3 2 3]    [3 3 2 4]    [3 3 2 2]    [3 3 2 5]

where the first column corresponds to guesses from K-means, the second from GMM, then HCA and Spectral algorithms.

Mode

While the Raw approach detailed above is the simplest, we consider the fact that it ignores a potentially important point. The choice of metric, while critical to the workflow, is intrinsically an "elbow-method" parameter. This sets it apart from the choice of algorithm and hyperparameters, which would be necessary regardless of approach.

With this in mind, we consider another formulation which first takes the mode of the results across metrics. That is, for a given algorithm and hyperparameter configuration, we take as a result the most commonly guessed number of clusters across all metric choices. We define

(19)

\begin{array}{c} {\bar{C}}^{j} = \{Mode (\{A_{j} (h_{0}^{j}, m)| m \in ℳ^{j}\}) \dots \\ Mode (\{A_{j} (h_{H_{j}}^{j}, m)| m \in ℳ^{j}\})\} \end{array}

(20)

= \{Mode (\{A_{j} (h, m)| m \in ℳ^{j}\})| h \in ℋ^{j}\} .

From this we can define our ensemble $\bar{ℰ}$ in the same way as $ℰ$ ,

(21)

\bar{ℰ} = {\bar{C}}^{0} \times \dots \times {\bar{C}}^{N},

which might give us an example matrix of

E = [3 3 3 3]    [3 2 2 4]    [3 3 2 2]    [3 3 2 3]

Here, our matrix will be smaller than in the Raw approach. Each column still corresponds to an algorithm, but each entry is now the mode of the guesses produced by a set of hyperparameters.

Voting

Now, given our matrix $E$ , there are a few ways to combine the results and vote on them. As a toy example, consider this result for a three-cluster dataset:

E = [2 2 2 2]    [2 2 2 2]    [3 2 3 3]    [3 3 2 3]    [3 3 3 2]

Full

The simplest approach would be to simply take the most common cluster number found in our ensemble. While straightforward, it doesn’t allow us to capture any additional information, nor filter out errors or biases in any way. Our toy example contains 11 $2$ s and nine $3$ s; we would get an incorrect final result of $R = 2$ .

Column-First

Another naive approach would be to vote along algorithms, giving us 4 results, and then voting for the most common answer within those 4. One possible issue with this approach is the case where we have a few particularly ill-suited algorithms. Looking at the same example, we get

3 2 2 2

and an incorrect final result of $R = 2$ .

Row-First

Finally, we can look to capture what we are calling the cohesion between the results, essentially, favoring their agreement. By first looking at the individual rows of our example we would have

and therefore $R = 3$ .

Given that our set $ℰ$ (or $\bar{ℰ}$ ) covers all combinations of results, we are choosing to prioritize those cases where our different algorithm-parameter-metric results are cohesive (row-wise), before looking at their actual value (column-wise).

We present results for the three approaches in the next section.

Results

Now, we arrive at our results. In the following sections, we present the results of the six different approaches (Raw/Mode with Row/Column/Full) on 100 simulated datasets from Section 1. We compared our results to two benchmarks.

The first is the expected value from randomly sampling our result set 100 times; essentially the accuracy we could expect from choosing an algorithm, hyperparameters and metrics beforehand.

The second is the consensus clustering approach put forward by Monti et al.⁸ In this case, we used K-means and GMM and attempted to pass in both default hyperparameters (D) and the best performing hyperparameters (B) as determined by our method.

Performance

Here, we define accuracy by comparing the predicted number of clusters for each of the 100 datasets - based on voting on the $ℰ$ or $\bar{ℰ}$ set - to the actual number of clusters (three in every case).

Overall results are shown in Table 4.

Table 4. Accuracy on 100 simulated datasets.

	Raw	Mode
Full	86.00	93.00
Row	89.00	92.00
Col	88.00	90.00

While there were differences in performance between the voting methods, the signal was somewhat muddled. In the case of the Raw construction the Row-first approach was best, while for a Mode construction a Full vote was preferable. Overall, the differences in voting performance were also small, providing at most a $3 %$ increase. We don’t consider it prudent to declare one voting approach more beneficial than another.

On the other hand, the choice between using a Raw ensemble construction $ℰ$ or a mode-based $\bar{ℰ}$ set seems to be clearer. Using a mode construction improved performance across the board, yielding a $2 - 8 %$ increase in accuracy.

Given a mode-based $\bar{ℰ}$ set, the naive voting took the lead: it was overall the best performer with $93 %$ accuracy. Additionally, we find it important to note that we actually outperformed even the maximum performance we saw in Table 1, which was $91 %$ .

Given that the latter could only occur given a prefect guess as to which algorithm-hyperparameter-metric combination to use, we find it even more satisfying.

Benchmarks

Table 5 details our benchmark performance as defined in this section: consensus clustering and expected value from random sampling.

Table 5. Benchmark accuracy on 100 simulated datasets.

	Expected value	Consensus
—	57.29	—
KMeans (D)	—	82.00
KMeans (B)	—	87.00
GMM (D)	—	66.00
GMM (B)	—	46.00

Not unexpectedly, randomly guessing at possible solutions yielded unsatisfactory results. We note that it consistently scored above $50 %$ , likely due to the fact that even poor algorithm configurations can still pick up some signal.

Consensus clustering gave very unpredictable results, with a very large variance in performance - though it did peak at a respectable $87 %$ . We reiterate our previous point, however, that it requires a choice of algorithm and hyperparameters as inputs, greatly reducing its effectiveness in practice.

Perhaps the most interesting result to come out of benchmarking was the performance of consensus clustering with respect to hyperparameter selections that did very well in our method. In the case of GMM this led to a drastic drop in performance ( $30 %$ ), while for K-means we saw a $6 %$ increase. This would indicate that while it is likely a non-trivial exercise to combine the approaches in a reasonable way, it could be worth further investigation.

Discussion

Usage

Now that we have examined this problem and our proposed solution, we’d like to discuss some other elements. Namely, typical usage setups, and some simple approaches to selecting the best algorithm-hyperparameter combination in each case. As our colleagues in industry would ask, how do we use this in production?

We would like to note that the following algorithm selection methods in particular are merely simple approaches to get things off the ground. There are undoubtedly other ways to solve this and we encourage further work in this area.

We believe there to be two general use-cases depending mainly on computational constraints: the case where all our data can be processed at once, and the case where we must slice our data into subsets first.

Data subsets

If we look at the case where we must partition our data due to computational limitations, we find ourselves in essentially the same framework that we did throughout the paper. While we will leave it to the reader to determine the best way to sample subsets of their data while capturing all clusters, let’s examine how this ensemble framework would work.

Throughout, we have looked at $100$ simulated datasets as a means of getting accuracy metrics. Suppose now we take our large dataset and split it into $100$ smaller, more manageable, datasets. We are now in the same situation as we were earlier in the paper: we would expect the number of clusters to be the same for each of the smaller datasets, and we would aggregate the results of $100$ ensemble methods (albeit not with “accuracy”).

Estimating the number of clusters

Now that we have our $100$ subsets, we can construct our $E$ matrix for each one of them and vote on the number of clusters. This gives us $100$ answers, one cluster number estimate per subset.

From there, we could select the most common answer (i.e., the mode) as our global estimated number of clusters (instead of computing accuracy as we did in this paper). Having determined how many clusters our dataset has, we arrive at the question of choosing an algorithm.

Algorithm-hyperparameter selection

Still within our $100$ subset context, let’s examine how we could choose a best algorithm-hyperparameter combination. We reiterate that this is only one of what is likely many possible approaches.

Given our estimated number of clusters, all the answers estimated by each algorithm-hyperparameter-metric combination found in the subset $E$ matrices can be stored. From there, the accuracy of each algorithm-hyperparameter-metric combination (relative to our global estimated number of clusters) can be computed. The best-performing combination can be taken as a reasonable way of clustering future data from the same dataset.

Indeed, in our simulated case, this approach correctly identified that $GMM$ is the best choice, and more specifically that the combination

algorithm: GMMhyperparameters: covariance_type: diag, reg_covar: 1e-8metric: AIC

achieved the best results, with $91 %$ accuracy. Readers will note this is indeed the top performance we can achieve as per Table 1.

It seems reasonable to assume that given more data drawn from the same dataset, this algorithm with these hyperparameters would do well at clustering it in a sensible way. We do note that it is possible to apply the logic presented in the Algorithm-hyperparameter selection section under Methods; in this case as well, such logic is included in the code library by default.

Full dataset

Now let’s look at the simpler case where all of our data can be processed together as a single dataset.

Estimating the number of clusters

In this happy scenario, estimating the number of clusters is relatively straightforward: we run a single workflow. We begin by constructing our $E$ matrix. Then, we compute the results of voting, which directly gives us the predicted number of clusters in our data.

In this case there is no need for any aggregation as we have a single outcome from the vote.

Algorithm-hyperparameter selection

When it comes to identifying the right choice of combination, however, we can’t proceed as we did in the subset case. Given that we have no way of computing accuracy, we will instead look for the “most stable” choice. Once more, this is simply a first approach and that future work could likely result in improvements.

For each algorithm-hyperparameter combination, we look at its predictions across metrics (which are not needed for future clustering). For example, suppose that on the first dataset from our simulated data the $GMM$ combination

algorithm: GMMhyperparameters: covariance_type: diag, reg_covar: 1e-8

obtained results of [3, 3, 3, 3] across its four metrics, whereas

algorithm: GMMhyperparameters: covariance_type: spherical, reg_covar: 1e-2

obtained. [3, 4, 2, 3]

In this example, we would favor the combination that was most often correct: [3, 3, 3, 3]. Note that we are essentially looking for stability and insensitivity to metric choice.

If we extend this comparison to every algorithm-hyperparameter combination, we arrive at a suitable combination choice to use for future clustering. We note, however, that given the small number of metric choices, this approach is less likely to yield a unique best choice.

If that is the case, any of the top-ranked combinations may be selected, as they are equally likely to achieve desirable results - as per this stability framework.

Conclusion

We have developed a workflow with six possible configurations for determining the number of clusters in an unlabeled dataset, while offering a flexible basis for determining the optimal choice of algorithm and associated hyperparameters. While certain methods already exist, such as for agglomerative hierarchical clustering and consensus clustering, they each present difficulties in the field that our approach addresses.

Fistly, we no longer require researchers - who may not be subject matter experts in unsupervised learning, but rather their own domains - to provide such specific inputs as particular algorithms and hyperparameter-metric configurations. Instead, given reasonably spanning ranges of hyperparameters, and a diverse selection of algorithms, we can reliably predict the number of clusters present with more than $90 %$ accuracy.

We also obtained - at very little cost - a reasonably suitable choice for which algorithm and which hyperparameters to use to further cluster the data. While not every situation will require clustering of additional incoming data from the same distribution, combination performance findings are sure to be beneficial to researchers.

Lastly, it is our hope that the simple algorithmic structure of this approach can lead to reliably simple integration into commonly used software libraries - thereby removing another barrier to entry. The code used for this framework can be found on GitHub.

In the future we hope to explore several avenues of work related to this approach. This includes studying other means of measuring cohesion (other than a row-first approach). We would also like a process to automatically track the ensemble set as it is populated and to adjust the hyperparameter space dynamically. This could allow for faster computations and less noise in the resulting ensemble. Finally, a careful and constructive combination of consensus clustering and our ensemble method.

Data availability

Zenodo: antoinezambelli/ensemble-clustering: v1.0.0, https://github.com/antoinezambelli/ensemble-clustering/tree/v1.0.0

Analysis code

Analysis code available from: https://github.com/antoinezambelli/ensemble-clustering

Archived analysis code as at time of publication: https://github.com/antoinezambelli/ensemble-clustering/tree/v1.0.0

License: MIT

Acknowledgments

We thank Dr. Alexandra Cunliffe for her help in reorganizing the paper for better clarity.

References

1. McGuirl M, Smith S, Sandstede B, et al.: Detecting shared genetic architecture among multiple phenotypes by hierarchical clustering of gene-level association statistics. Genetics. 06 2020; 215 (2): 511–529. 1943-2631. PubMed Abstract | Publisher Full Text
2. Song X, Li W, Ma D, et al.: An enhanced clustering-based method for determining time-of-day breakpoints through process optimization. IEEE Access. 2018; 6: 29241–29253. Publisher Full Text
3. Caoli AJ: Machine learning in the analysis of social problems: The case of global human trafficking. The British University in Dubai, (Dissertation). 2019.Reference Source
4. Pedregosa F, Varoquaux G, Gramfort A, et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011; 12: 2825–2830.
5. Müllner D: fastcluster: Fast hierarchical, agglomerative clustering routines for r and python. J. Stat. Softw. 2013; 53(9): 1–18. Publisher Full Text
6. Zambelli AE: A data-driven approach to estimating the number of clusters in hierarchical clustering. F1000Res. 2016; 5(ISCB Comm J): 2809. PubMed Abstract | Publisher Full Text
7. Vega-Pons S, Ruiz-Shulcloper J: A survey of custering ensemble algorithms. Int. J. Pattern Recognit. Artif. Intell. 2011; 25(3): 337–372. Publisher Full Text
8. Monti S, Tamayo P, Mesirov J, et al.: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 2003; 52: 91–118. Publisher Full Text
9. Alqurashi T, Wang W: Clustering ensemble method. Int. J. Mach. Learn. Cybern. 2019; 10: 1227–1246. Publisher Full Text
10. Yu Z, Wong H-S, Wang H: Graphbased consensus clustering for class discovery from gene expression data. Bioinformatics. 09 2007; 23(21): 2888–2896. 1367-4803. PubMed Abstract | Publisher Full Text
11. Șenbabaoğlu Y, Michailidis G, Li JZ: Critical limitations of consensus clustering in class discovery. Sci. Rep. 2014; 4: 6207. Publisher Full Text
12. Yi J, Yang T, Jin R, et al.: Robust ensemble clustering by matrix completion. 2012 IEEE 12th International Conference on Data Mining. 2012; pages 1176–1181. Publisher Full Text
13. Handl J, Knowles J, Kell DB: Computational cluster validation in post-genomic data analysis. Bioinformatics. 08 2005; 21(15): 3201–3212. PubMed Abstract | Publisher Full Text
14. Dangeti P: Statistics for Machine Learning. Packt Publishing;2017.
15. Shi C, Wei B, Wei S, et al.: A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. J. Wireless Com. Network. 2021; 2021. Publisher Full Text
16. Gordon A: Classification. CRC Press;1999.

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 25 May 2022

Author details Author details

University of California, Berkeley, Berkeley, CA, 94720, USA

Antoine Zambelli
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (1)

version 1

Published: 25 May 2022, 11:573

https://doi.org/10.12688/f1000research.121486.1

Copyright

© 2022 Zambelli A. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Zambelli A. Ensemble method for cluster number determination and algorithm selection in unsupervised learning [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2022, 11:573 (https://doi.org/10.12688/f1000research.121486.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 1

VERSION 1

PUBLISHED 25 May 2022

Views

8

Reviewer Report 26 Jan 2024

Luca Coraggio, University of Naples Federico II, Naples, Italy

Not Approved

https://doi.org/10.5256/f1000research.133349.r235165

The work addresses a relevant research question in cluster analysis, which has long been and stays an open-question, namely determining the optimal clustering solution, with a focus on the selection of a clustering algorithm along with its hyperparameters.

... Continue reading

The work addresses a relevant research question in cluster analysis, which has long been and stays an open-question, namely determining the optimal clustering solution, with a focus on the selection of a clustering algorithm along with its hyperparameters.

The author addresses the problem by developing an ensemble method, where multiple clustering algorithms are fitted on the data (or subsamples), using different hyperparameters and cluster-validation measures. Essentially, each clustering algorithm comes with a set of hyperparameters and cluster-validation indexes (generically called "metric" by the author). For each hyperparameters configuration, the clustering algorithm is fitted on the data, and an optimal number of clusters is determined for all of the validation indexes adopted for the specific algorithm. Thus, each algorithm-hyperparameters-metric returns an optimal number of clusters. These are used in a cartesian product with solutions from other algorithms to obtain the ensemble matrix.
This is used with three selection methods proposed by the author to derive the overall best number of clusters, and the best algorithm-hyperparameter configuration that produces it.

I really appreciated the emphasis posed on the hyperparameters selection, which is often overlooked in research. It is also praiseworthy the effort of developing a data-driven procedure aimed at practitioners (potentially, from other fields), which is both easy to understand and to apply.

Albeit, as it stands now, the work presents some critical issues which need to be addressed. Significantly expanding the experimental analysis could help to dismiss most of the concerns. In what follows, I outline 3 major points, and a series of other comments.

M1. As it is presented, the proposed ensemble procedure mixes together extremely different clustering methodologies and validation indexes --- typically used in different problems --- failing to explain or provide enough supporting evidence of the overall validity of the procedure. It is well known that different clustering algorithms imply different cluster concepts (Hennig et al., 2015¹), but this fact, while acknowledged by the author, seems to be totally disregarded in the application. This is evident from comparing, in the same ensemble, methods like GMM and hierarchical clustering with single linkage, which are after very different clustering shapes: the former being more adequate with elliptical-symmetric shapes, the latter being more similar to density-based methods. In real applications, two different clustering algorithms might be capturing different aspects along which the data are clustered, both of which might be perfectly valid. This might imply looking at two different cluster concepts (e.g. a elliptical-symmetric clusters, and clusters of density-connected points). Having these methods compared in the same ensemble might lead to lose valid solutions simply because outnumbered by solutions of the other type. More empirical evidence would help dissipating this concern. At least in the case of different cluster concepts, I think it would be more prudent to report multiple solutions to the researcher, which he/she can later investigate and validate with domain-specific knowledge. This also leads me to my second point below.

M2. Considering point 1 above, creating a major categorization (columns of the ensemble matrix) based on the algorithmic implementation of some clustering methods seem to be rather arbitrary. This choice seems to be dictated more by the software-related considerations rather than statistical ones. For example, consider two different implementations of Gaussian Mixture Models (e.g. Mclust (Scrucca et al., 2023²) and Otrimle (https://cran.r-project.org/web/packages/otrimle/index.html)) and a single algorithm for a hierarchical clustering with single linkage. It is not difficult to find data examples where GMM would be inappropriate compared to the HCA: to make an example using a library known to the author, the "make_moon" data generating process from Python's scikit-learn is a good one (it consists of "moon-shaped" clusters, which are quite a departure from elliptical-symmetric shapes; a density-based approach would be better suited here). The GMM algorithm would most likely pick an excessively high number of clusters to fit the underlying data density well, struggling to capture the true clusters. Even if the HCA would do well, it might well be outnumbered by the "wrong" solutions simply because two GMM implementations are compared with a single HCA. This problem is likely even more evident for the Column-first strategy proposed by the author.

M3. One last major issue is the lack of a sufficient experimental analysis. The procedure proposed by the author is purely backed-up by experimental evidence. As there is no formal argument on the effectiveness of the proposed ensemble procedure, it is very important to expand on this section considering more and more challenging experimental designs. The author presents an extremely simple clustering problem, where clusters are reasonably well separated, spherical Gaussian and with ten thousand points each, in two dimensions. This example alone is not at all sufficient to prove the general validity of any methodology. I suggest considering, at least, experimental designs where spectral clustering and/or HAC would be better suited as compared to K-means and GMM; with a far lower number of points (< 1000); and with more than 2 dimensions. Some real data sets could also be considered for benchmarking.

Here are a list of other comments/suggestions:

Page 1: "While we leave the task... we invite the reader to consider possibilities as they read through our work." I would strongly suggest removing or rephrasing this. I think is not the job of the reader to think of possible/future applications as much as it is the job of the author to point-out viable ones. I think that having this statement depicts questionable the extent of the author's contribution, and may confuse the reader about what is actually being done and what is left as a sketch of future (feasible?) work.

2. Table 1, Table 2, Table 3 are presented to show that different settings (algorithms, hyperparameter, or metric) may lead to different results; emphasis is put on the striking difference in performances across some entries. However, in the tables is not clear what the exact settings being compared are. I would suggest specifying those to make more sense of the differences in the presented figures. For example, Table 2 gives very different results as compared to Table 1, why? What is the exact setting here?

3. In the experimental analysis, the author evaluates the solutions with the accuracy index based on the selected number of clusters across 100 data sets. While a 3-cluster-solution might have the exact same number of clusters of the true data generating process, these cluster might be totally off with respect to the true ones. I think that reporting the misclassification error (or the ARI index) as well might provide a better description of the solution's performances.

4. Each clustering algorithm uses potentially different validation indexes (metric) to select the final number of clusters. Different validation indexes value some aspects more then other in assessing the clustering solutions. If the same validation metric is over-represented across different clustering algorithms, this could introduce a bias in the selection of the optimal solution from the ensemble. Further discussion or empirical evidence should be provided. (Linked to M1).

5. (Linked to point 4 above) From the manuscript, in the case of Gaussian Mixture Models, BIC (or AIC) seems to be used to select the number of clusters among GMMs with the same covariance parameterization. It is not clear why BIC is not used pooling together GMMs with both different covariance parameterization and number of clusters. It should be used like this, as BIC allows to penalize for model complexity. Otherwise, poorer solution might be over-represented in the ensemble, hampering the overall performance.

6. (Example for M1 and M2) Consider clustering a 2-class moon-dataset (see M2) with two GMM algorithms (GMM1: implementing covariance parameterization; GMM2: implementing eigen-ratio constraint), and consider HAC. An ensemble might look like:
GMM1 GMM2 HAC
9       9           2
10             9       2
8               9           2
...
In such cases, an overly-complex solution might be selected just because an inappropriate clustering method (GMM) was over-represented in the ensemble.

7. The author should discuss how to break ties. Suppose obtaining an ensemble of the form
ALGO1 ALGO2
7   2
7    2
7               2
...
What should be the selected number of clusters, and algorithm-hyperparameter in this case?

8. Page 4. I would suggest describing the data generating process in more details. E.g. a 3-components standard-Gaussian mixture model, with equal mixing probabilities.

9. In general, number of clusters are hyperparameters to many clustering algorithms. A remark should be made to distinguish them from other hyperparameters, and possibly this choice should be discussed in greater details. This issue is linked with point 5: with some clustering procedures it might be better to select the optimal solution comparing a validation index across both varying number of clusters and other hyperparameters.

10. Page 7: "If it helps, think of each h as a set of kwargs passed into a model object". I would suggest removing this phrase. This assumes that the reader is familiar with Python. "kwargs" is not generally adopted by all programming languages. "model object" has no statistical meaning in the way it is used here. This phrase anchors the work to the Python niche and mind-set.

11. Page 8: eq 11. These writings should be explained. "np.geomspace" assumes that the reader is familiar with Python; that she is familiar with the common "np" alias given to the numpy package; that she is familiar with the numpy package and functions. I would suggest to simply replace "np.geomspace" with its output (the following line). Once again, this anchors the manuscript to the Python niche, in my opinion.

12. Page 10: presentation of E matrix. I think it would be a nice visual aid to present the E matrix first in the form of tuples as shown in (18), and then substituting the number of clusters.

13. Page 10: "The choice of metric...parameter". The phrase is confusing. The "elbow-method" is not a metric, but rather a selection method.

14. Page 10: "Here, our matrix will be smaller than in the Raw approach". It could be nice to provide dimensions for the E matrix, both in the row approach and in the "mode" approach.

15. Page 11: the author should specify which E matrix (row or mode) is being used in the computation of Full/Column-first/Row-first solution.

16. Page 11: Result section - "The first... sampling our result set 100 times". It is not clear to me what is being sampled and from what distribution.

17. Page 11: Results section. The author mentions passing "default parameters"; it would be better to specify what the default values are.

18. Table 4: are the difference in accuracy statistically significant?

19. Page 12: what is a possible explanation that causes the methodology to outperform Table 1 results at a value of 93% accuracy against a 91% from Table 1?

20. Page 12: "Perhaps the most interesting...investigation". This results is quite worrying in terms of future applications with consensus clustering. Presenting further experimental evidence might help.

21. Since the proposed procedure involves forming numerous clustering solutions, it would be nice if the author commented on computational aspects of the proposed ensemble strategy.

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

No

References

1. Handbook of Cluster Analysis. 2015. Publisher Full Text
2. Scrucca L, Fraley C, Murphy T, Adrian E. R: Model-Based Clustering, Classification, and Density Estimation Using mclust in R. 2023. Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Cluster Analysis, Model-based clustering, Classification

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Respond or Comment

Views

8

Reviewer Report 17 Dec 2023

Ana Estela Antunes da Silva, University of Campinas (Unicamp), Campinas, Brazil

Approved with Reservations

https://doi.org/10.5256/f1000research.133349.r220274

The paper presents a real problem for users of machine learning algorithms which apply the task of clustering. One of the problems is choosing the number of groups, as well as the best algorithm to be applied to the problem.
... Continue reading

The paper presents a real problem for users of machine learning algorithms which apply the task of clustering. One of the problems is choosing the number of groups, as well as the best algorithm to be applied to the problem.
As a solution to this problem, authors propose an ensemble approach to answering the following questions: How many clusters are in the dataset, and which algorithm-hyperparameter choice is best for this data? Their approach outputs the number of clusters, as well as both a model choice and a set of hyperparameters to use with the model.
This problem is very relevant and a tool with these characteristics can be of great importance.

In order to explain the answers for which I selected "partially", I present the points below.

On page 3 the author says:

“While we leave the task of truly combining our approach with existing methods to future work, we present evidence of the benefits of accounting for model and hyperparameter choices in Monti’s consensus clustering, and we invite the reader to consider the possibilities as they read through our work”.
This sentence does not make it clear what will be presented in the article. Phrases like this compromise the reading of the article. I suggest rewriting, according to the Results, on page 11, a subsection that says: “The second is the consensus clustering approach put forward by Monti et al.8 In this case, we used K-means and GMM and attempted to pass in both default hyperparameters (D) and the best-performing hyperparameters (B) as determined by our method”.

2. Figure 1, on page 4, is a central figure for the article. I suggest explaining each of the steps more formally. For example, what is N? In step 3, which metric is the author referring to?

3. Also, I don't think the elbow method (Figure 1) should be in the Introduction section, but rather in the Methods section. The entire methodology should have been explained according to the detailed steps of the method, including the explanation of the metrics.

4. On page 4, the sentence “Using the elbow method as a baseline, we computed some accuracy statistics on 100 randomly generated three-cluster datasets…”
It is necessary to explain which are the “accuracy statistics” and why the author decided to use them.

5. On page 4, the sentence “For each algorithm, we looked at performance across a broad range of hyperparameters and metrics, detailed in the Full dataset section.”

6. The author uses the word metric in a generic way without explaining what type of metric she/he is referring to.

7. It is not clear the information from table 1, on page 5. Is the mean and SD of the 100 datasets with K-means? I suggest that the explanation above in Table 1 be revised. For example, the meaning of the sentence “If the choice of algorithm did not matter, then we would get the same statistics across different algorithms” is not clear.

8. To facilitate understanding of the methodology itself, the subtitles in the Methods section should be more explanatory. For example:
Simulated data ==> Data used in the experiments of our method.
Choice of algorithm==> Selection of the most adequate clustering algorithm.
Choice of hyperparameters ==> Here it is not clear whether this choice involves number of parameters or type of parameters or both. A most adequate subtitle would help.

9. Table 2 shows the execution of algorithms with different parameter choices, however, which and how these choices were made is not clear. If the objective was not to discuss these choices, but to highlight that different choices can lead to different results, I suggest that this is made clearer in the text.

10. The Set Construction subsection, on page 8, could be called: Ensemble Algorithm. All the following subsections should be summarized in an algorithmic body so that the steps performed can be understood in general at first. Then, each part of the algorithm could be detailed. This way, you can get an idea of how the steps are done in sequence.

11. The subtitle "Building the ensemble" should be "Approaches for building the ensemble". The subtitles Raw, Mode, Voting etc. could also be revised.

12. Regarding the Results section, did the authors consider adopting other datasets with more than two attributes? How would the method behave with unstructured data, such as text for example?

13. I'm not sure the Discussion section addressed the ensemble results. It seemed to me that other forms of experiments were presented, which had not been mentioned in the Methods section. If sliced data sets were to be addressed, it should have been presented in the Methods section.

Is the rationale for developing the new method (or application) clearly explained?

Partly
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Artificial Intelligence ; Machine Learning

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

12

Reviewer Report 11 Aug 2022

Dong Huang, College of Mathematics and Informatics, South China Agricultural University, Guangzhou, China

Approved with Reservations

https://doi.org/10.5256/f1000research.133349.r147137

This paper proposes a new ensemble clustering framework that is able to determine the number of clusters and the suitable choice of the algorithm to use for a given dataset.

This work aims to automatically find the number ... Continue reading

This paper proposes a new ensemble clustering framework that is able to determine the number of clusters and the suitable choice of the algorithm to use for a given dataset.

This work aims to automatically find the number of clusters in ensemble clustering. Yet the related works should be enriched. For example, the ensemble clustering by factor graph (ECFG) algorithm finds the number of clusters via probabilistic formulation, which should also be discussed. Furthermore, the references are mostly outdated, where only one paper is published after 2020. Some recent works, such as multidiversified ensemble clustering, ensemble clustering via fast propagation of cluster-wise similarities, and ultra-scale ensemble clustering, should be discussed.

In the experiments, k-means, GMM, HCA, and spectral clustering are used as baselines. However, I would like to see further experimental comparison between the proposed algorithm and some other ensemble clustering algorithms.

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Yes
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Ensemble clustering; Multi-view clustering; Large-scale clustering

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 25 May 2022

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 1 25 May 22	read	read	read

Dong Huang, South China Agricultural University, Guangzhou, China
Ana Estela Antunes da Silva, University of Campinas (Unicamp), Campinas, Brazil
Luca Coraggio, University of Naples Federico II, Naples, Italy

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

8 Views

26 Jan 2024 | for Version 1

Luca Coraggio, University of Naples Federico II, Naples, Italy

8 Views Cite this report Responses(0)

Not Approved

The work addresses a relevant research question in cluster analysis, which has long been and stays an open-question, namely determining the optimal clustering solution, with a focus on the selection of a clustering algorithm along with its hyperparameters.

The author addresses the problem by developing an ensemble method, where multiple clustering algorithms are fitted on the data (or subsamples), using different hyperparameters and cluster-validation measures. Essentially, each clustering algorithm comes with a set of hyperparameters and cluster-validation indexes (generically called "metric" by the author). For each hyperparameters configuration, the clustering algorithm is fitted on the data, and an optimal number of clusters is determined for all of the validation indexes adopted for the specific algorithm. Thus, each algorithm-hyperparameters-metric returns an optimal number of clusters. These are used in a cartesian product with solutions from other algorithms to obtain the ensemble matrix.
This is used with three selection methods proposed by the author to derive the overall best number of clusters, and the best algorithm-hyperparameter configuration that produces it.

I really appreciated the emphasis posed on the hyperparameters selection, which is often overlooked in research. It is also praiseworthy the effort of developing a data-driven procedure aimed at practitioners (potentially, from other fields), which is both easy to understand and to apply.

Albeit, as it stands now, the work presents some critical issues which need to be addressed. Significantly expanding the experimental analysis could help to dismiss most of the concerns. In what follows, I outline 3 major points, and a series of other comments.

M1. As it is presented, the proposed ensemble procedure mixes together extremely different clustering methodologies and validation indexes --- typically used in different problems --- failing to explain or provide enough supporting evidence of the overall validity of the procedure. It is well known that different clustering algorithms imply different cluster concepts (Hennig et al., 2015¹), but this fact, while acknowledged by the author, seems to be totally disregarded in the application. This is evident from comparing, in the same ensemble, methods like GMM and hierarchical clustering with single linkage, which are after very different clustering shapes: the former being more adequate with elliptical-symmetric shapes, the latter being more similar to density-based methods. In real applications, two different clustering algorithms might be capturing different aspects along which the data are clustered, both of which might be perfectly valid. This might imply looking at two different cluster concepts (e.g. a elliptical-symmetric clusters, and clusters of density-connected points). Having these methods compared in the same ensemble might lead to lose valid solutions simply because outnumbered by solutions of the other type. More empirical evidence would help dissipating this concern. At least in the case of different cluster concepts, I think it would be more prudent to report multiple solutions to the researcher, which he/she can later investigate and validate with domain-specific knowledge. This also leads me to my second point below.

M2. Considering point 1 above, creating a major categorization (columns of the ensemble matrix) based on the algorithmic implementation of some clustering methods seem to be rather arbitrary. This choice seems to be dictated more by the software-related considerations rather than statistical ones. For example, consider two different implementations of Gaussian Mixture Models (e.g. Mclust (Scrucca et al., 2023²) and Otrimle (https://cran.r-project.org/web/packages/otrimle/index.html)) and a single algorithm for a hierarchical clustering with single linkage. It is not difficult to find data examples where GMM would be inappropriate compared to the HCA: to make an example using a library known to the author, the "make_moon" data generating process from Python's scikit-learn is a good one (it consists of "moon-shaped" clusters, which are quite a departure from elliptical-symmetric shapes; a density-based approach would be better suited here). The GMM algorithm would most likely pick an excessively high number of clusters to fit the underlying data density well, struggling to capture the true clusters. Even if the HCA would do well, it might well be outnumbered by the "wrong" solutions simply because two GMM implementations are compared with a single HCA. This problem is likely even more evident for the Column-first strategy proposed by the author.

M3. One last major issue is the lack of a sufficient experimental analysis. The procedure proposed by the author is purely backed-up by experimental evidence. As there is no formal argument on the effectiveness of the proposed ensemble procedure, it is very important to expand on this section considering more and more challenging experimental designs. The author presents an extremely simple clustering problem, where clusters are reasonably well separated, spherical Gaussian and with ten thousand points each, in two dimensions. This example alone is not at all sufficient to prove the general validity of any methodology. I suggest considering, at least, experimental designs where spectral clustering and/or HAC would be better suited as compared to K-means and GMM; with a far lower number of points (< 1000); and with more than 2 dimensions. Some real data sets could also be considered for benchmarking.

Here are a list of other comments/suggestions:

Page 1: "While we leave the task... we invite the reader to consider possibilities as they read through our work." I would strongly suggest removing or rephrasing this. I think is not the job of the reader to think of possible/future applications as much as it is the job of the author to point-out viable ones. I think that having this statement depicts questionable the extent of the author's contribution, and may confuse the reader about what is actually being done and what is left as a sketch of future (feasible?) work.

2. Table 1, Table 2, Table 3 are presented to show that different settings (algorithms, hyperparameter, or metric) may lead to different results; emphasis is put on the striking difference in performances across some entries. However, in the tables is not clear what the exact settings being compared are. I would suggest specifying those to make more sense of the differences in the presented figures. For example, Table 2 gives very different results as compared to Table 1, why? What is the exact setting here?

3. In the experimental analysis, the author evaluates the solutions with the accuracy index based on the selected number of clusters across 100 data sets. While a 3-cluster-solution might have the exact same number of clusters of the true data generating process, these cluster might be totally off with respect to the true ones. I think that reporting the misclassification error (or the ARI index) as well might provide a better description of the solution's performances.

4. Each clustering algorithm uses potentially different validation indexes (metric) to select the final number of clusters. Different validation indexes value some aspects more then other in assessing the clustering solutions. If the same validation metric is over-represented across different clustering algorithms, this could introduce a bias in the selection of the optimal solution from the ensemble. Further discussion or empirical evidence should be provided. (Linked to M1).

5. (Linked to point 4 above) From the manuscript, in the case of Gaussian Mixture Models, BIC (or AIC) seems to be used to select the number of clusters among GMMs with the same covariance parameterization. It is not clear why BIC is not used pooling together GMMs with both different covariance parameterization and number of clusters. It should be used like this, as BIC allows to penalize for model complexity. Otherwise, poorer solution might be over-represented in the ensemble, hampering the overall performance.

6. (Example for M1 and M2) Consider clustering a 2-class moon-dataset (see M2) with two GMM algorithms (GMM1: implementing covariance parameterization; GMM2: implementing eigen-ratio constraint), and consider HAC. An ensemble might look like:
GMM1 GMM2 HAC
9       9           2
10             9       2
8               9           2
...
In such cases, an overly-complex solution might be selected just because an inappropriate clustering method (GMM) was over-represented in the ensemble.

7. The author should discuss how to break ties. Suppose obtaining an ensemble of the form
ALGO1 ALGO2
7   2
7    2
7               2
...
What should be the selected number of clusters, and algorithm-hyperparameter in this case?

8. Page 4. I would suggest describing the data generating process in more details. E.g. a 3-components standard-Gaussian mixture model, with equal mixing probabilities.

9. In general, number of clusters are hyperparameters to many clustering algorithms. A remark should be made to distinguish them from other hyperparameters, and possibly this choice should be discussed in greater details. This issue is linked with point 5: with some clustering procedures it might be better to select the optimal solution comparing a validation index across both varying number of clusters and other hyperparameters.

10. Page 7: "If it helps, think of each h as a set of kwargs passed into a model object". I would suggest removing this phrase. This assumes that the reader is familiar with Python. "kwargs" is not generally adopted by all programming languages. "model object" has no statistical meaning in the way it is used here. This phrase anchors the work to the Python niche and mind-set.

11. Page 8: eq 11. These writings should be explained. "np.geomspace" assumes that the reader is familiar with Python; that she is familiar with the common "np" alias given to the numpy package; that she is familiar with the numpy package and functions. I would suggest to simply replace "np.geomspace" with its output (the following line). Once again, this anchors the manuscript to the Python niche, in my opinion.

12. Page 10: presentation of E matrix. I think it would be a nice visual aid to present the E matrix first in the form of tuples as shown in (18), and then substituting the number of clusters.

13. Page 10: "The choice of metric...parameter". The phrase is confusing. The "elbow-method" is not a metric, but rather a selection method.

14. Page 10: "Here, our matrix will be smaller than in the Raw approach". It could be nice to provide dimensions for the E matrix, both in the row approach and in the "mode" approach.

15. Page 11: the author should specify which E matrix (row or mode) is being used in the computation of Full/Column-first/Row-first solution.

16. Page 11: Result section - "The first... sampling our result set 100 times". It is not clear to me what is being sampled and from what distribution.

17. Page 11: Results section. The author mentions passing "default parameters"; it would be better to specify what the default values are.

18. Table 4: are the difference in accuracy statistically significant?

19. Page 12: what is a possible explanation that causes the methodology to outperform Table 1 results at a value of 93% accuracy against a 91% from Table 1?

20. Page 12: "Perhaps the most interesting...investigation". This results is quite worrying in terms of future applications with consensus clustering. Presenting further experimental evidence might help.

21. Since the proposed procedure involves forming numerous clustering solutions, it would be nice if the author commented on computational aspects of the proposed ensemble strategy.

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

No

References

1. Handbook of Cluster Analysis. 2015. Publisher Full Text
2. Scrucca L, Fraley C, Murphy T, Adrian E. R: Model-Based Clustering, Classification, and Density Estimation Using mclust in R. 2023. Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Cluster Analysis, Model-based clustering, Classification

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

8 Views

17 Dec 2023 | for Version 1

Ana Estela Antunes da Silva, University of Campinas (Unicamp), Campinas, Brazil

8 Views Cite this report Responses(0)

Approved With Reservations

The paper presents a real problem for users of machine learning algorithms which apply the task of clustering. One of the problems is choosing the number of groups, as well as the best algorithm to be applied to the problem.
As a solution to this problem, authors propose an ensemble approach to answering the following questions: How many clusters are in the dataset, and which algorithm-hyperparameter choice is best for this data? Their approach outputs the number of clusters, as well as both a model choice and a set of hyperparameters to use with the model.
This problem is very relevant and a tool with these characteristics can be of great importance.

In order to explain the answers for which I selected "partially", I present the points below.

On page 3 the author says:

“While we leave the task of truly combining our approach with existing methods to future work, we present evidence of the benefits of accounting for model and hyperparameter choices in Monti’s consensus clustering, and we invite the reader to consider the possibilities as they read through our work”.
This sentence does not make it clear what will be presented in the article. Phrases like this compromise the reading of the article. I suggest rewriting, according to the Results, on page 11, a subsection that says: “The second is the consensus clustering approach put forward by Monti et al.8 In this case, we used K-means and GMM and attempted to pass in both default hyperparameters (D) and the best-performing hyperparameters (B) as determined by our method”.

2. Figure 1, on page 4, is a central figure for the article. I suggest explaining each of the steps more formally. For example, what is N? In step 3, which metric is the author referring to?

3. Also, I don't think the elbow method (Figure 1) should be in the Introduction section, but rather in the Methods section. The entire methodology should have been explained according to the detailed steps of the method, including the explanation of the metrics.

4. On page 4, the sentence “Using the elbow method as a baseline, we computed some accuracy statistics on 100 randomly generated three-cluster datasets…”
It is necessary to explain which are the “accuracy statistics” and why the author decided to use them.

5. On page 4, the sentence “For each algorithm, we looked at performance across a broad range of hyperparameters and metrics, detailed in the Full dataset section.”

6. The author uses the word metric in a generic way without explaining what type of metric she/he is referring to.

7. It is not clear the information from table 1, on page 5. Is the mean and SD of the 100 datasets with K-means? I suggest that the explanation above in Table 1 be revised. For example, the meaning of the sentence “If the choice of algorithm did not matter, then we would get the same statistics across different algorithms” is not clear.

8. To facilitate understanding of the methodology itself, the subtitles in the Methods section should be more explanatory. For example:
Simulated data ==> Data used in the experiments of our method.
Choice of algorithm==> Selection of the most adequate clustering algorithm.
Choice of hyperparameters ==> Here it is not clear whether this choice involves number of parameters or type of parameters or both. A most adequate subtitle would help.

9. Table 2 shows the execution of algorithms with different parameter choices, however, which and how these choices were made is not clear. If the objective was not to discuss these choices, but to highlight that different choices can lead to different results, I suggest that this is made clearer in the text.

10. The Set Construction subsection, on page 8, could be called: Ensemble Algorithm. All the following subsections should be summarized in an algorithmic body so that the steps performed can be understood in general at first. Then, each part of the algorithm could be detailed. This way, you can get an idea of how the steps are done in sequence.

11. The subtitle "Building the ensemble" should be "Approaches for building the ensemble". The subtitles Raw, Mode, Voting etc. could also be revised.

12. Regarding the Results section, did the authors consider adopting other datasets with more than two attributes? How would the method behave with unstructured data, such as text for example?

13. I'm not sure the Discussion section addressed the ensemble results. It seemed to me that other forms of experiments were presented, which had not been mentioned in the Methods section. If sliced data sets were to be addressed, it should have been presented in the Methods section.

Is the rationale for developing the new method (or application) clearly explained?

Partly
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Artificial Intelligence ; Machine Learning

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

12 Views

11 Aug 2022 | for Version 1

Dong Huang, College of Mathematics and Informatics, South China Agricultural University, Guangzhou, China

12 Views Cite this report Responses(0)

Approved With Reservations

This paper proposes a new ensemble clustering framework that is able to determine the number of clusters and the suitable choice of the algorithm to use for a given dataset.

This work aims to automatically find the number of clusters in ensemble clustering. Yet the related works should be enriched. For example, the ensemble clustering by factor graph (ECFG) algorithm finds the number of clusters via probabilistic formulation, which should also be discussed. Furthermore, the references are mostly outdated, where only one paper is published after 2020. Some recent works, such as multidiversified ensemble clustering, ensemble clustering via fast propagation of cluster-wise similarities, and ultra-scale ensemble clustering, should be discussed.

In the experiments, k-means, GMM, HCA, and spectral clustering are used as baselines. However, I would like to see further experimental comparison between the proposed algorithm and some other ensemble clustering algorithms.

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Yes
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Ensemble clustering; Multi-view clustering; Large-scale clustering

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. McGuirl M, Smith S, Sandstede B, et al.: Detecting shared genetic architecture among multiple phenotypes by hierarchical clustering of gene-level association statistics. Genetics. 06 2020; 215 (2): 511–529. 1943-2631. PubMed Abstract | Publisher Full Text

[2] 2. Song X, Li W, Ma D, et al.: An enhanced clustering-based method for determining time-of-day breakpoints through process optimization. IEEE Access. 2018; 6: 29241–29253. Publisher Full Text

[3] 3. Caoli AJ: Machine learning in the analysis of social problems: The case of global human trafficking. The British University in Dubai, (Dissertation). 2019.Reference Source

[4] 4. Pedregosa F, Varoquaux G, Gramfort A, et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011; 12: 2825–2830.

[5] 5. Müllner D: fastcluster: Fast hierarchical, agglomerative clustering routines for r and python. J. Stat. Softw. 2013; 53(9): 1–18. Publisher Full Text

[6] 6. Zambelli AE: A data-driven approach to estimating the number of clusters in hierarchical clustering. F1000Res. 2016; 5(ISCB Comm J): 2809. PubMed Abstract | Publisher Full Text

[7] 7. Vega-Pons S, Ruiz-Shulcloper J: A survey of custering ensemble algorithms. Int. J. Pattern Recognit. Artif. Intell. 2011; 25(3): 337–372. Publisher Full Text

[8] 8. Monti S, Tamayo P, Mesirov J, et al.: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 2003; 52: 91–118. Publisher Full Text

[9] 9. Alqurashi T, Wang W: Clustering ensemble method. Int. J. Mach. Learn. Cybern. 2019; 10: 1227–1246. Publisher Full Text

[10] 10. Yu Z, Wong H-S, Wang H: Graphbased consensus clustering for class discovery from gene expression data. Bioinformatics. 09 2007; 23(21): 2888–2896. 1367-4803. PubMed Abstract | Publisher Full Text

[11] 11. Șenbabaoğlu Y, Michailidis G, Li JZ: Critical limitations of consensus clustering in class discovery. Sci. Rep. 2014; 4: 6207. Publisher Full Text

[12] 12. Yi J, Yang T, Jin R, et al.: Robust ensemble clustering by matrix completion. 2012 IEEE 12th International Conference on Data Mining. 2012; pages 1176–1181. Publisher Full Text

[13] 13. Handl J, Knowles J, Kell DB: Computational cluster validation in post-genomic data analysis. Bioinformatics. 08 2005; 21(15): 3201–3212. PubMed Abstract | Publisher Full Text

[14] 14. Dangeti P: Statistics for Machine Learning. Packt Publishing;2017.

[15] 15. Shi C, Wei B, Wei S, et al.: A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. J. Wireless Com. Network. 2021; 2021. Publisher Full Text

[16] 16. Gordon A: Classification. CRC Press;1999.

Ensemble method for cluster number determination and algorithm selection in unsupervised learning

Abstract

Keywords

Introduction

The landscape

The problem

Figure 1. Triangle method.

Methods

Simulated data

Figure 2. Simulated data with random seed 42.

Choice of algorithm

Choice of hyperparameters

Table 1. Stats for accuracy readings per algorithm.

Table 2. Statistics for accuracy readings per algorithm-metric, across hyperparameters.

Choice of metric

Table 3. Statistics for accuracy readings per algorithm-hyperparameter, across metrics.

Ensemble approach

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

Building the ensemble

(15)

(16)

(17)

(18)

(19)

(20)

(21)

Voting

Results

Performance

Table 4. Accuracy on 100 simulated datasets.

Benchmarks

Table 5. Benchmark accuracy on 100 simulated datasets.

Discussion

Usage

Data subsets

Full dataset

Conclusion

Data availability

Analysis code

Acknowledgments

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated

Figure 2. Simulated data with random seed $42$ .