ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

Ensemble method for cluster number determination and algorithm selection in unsupervised learning

[version 1; peer review: 2 approved with reservations, 1 not approved]
PUBLISHED 25 May 2022
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Computing Science collection.

Abstract

Unsupervised learning, and more specifically clustering, suffers from the need for expertise in the field to be of use. Researchers must make careful and informed decisions on which algorithm to use with which set of hyperparameters for a given dataset. Additionally, researchers may need to determine the number of clusters in the dataset, which is unfortunately itself an input to most clustering algorithms; all of this before embarking on their actual subject matter work. After quantifying the impact of algorithm and hyperparameter selection, we propose an ensemble clustering framework which can be leveraged with minimal input. It can be used to determine both the number of clusters in the dataset and a suitable choice of algorithm to use for a given dataset. A code library is included in the Conclusions for ease of integration.

Keywords

Clustering, Consensus Clustering, Ensemble, Gaussian Mixture, Hierarchical Clustering, K-means, Number of Clusters, Spectral Clustering, Unsupervised Learning

Introduction

Unsupervised learning - one of the main branches of machine learning - is the study of previously unlabelled datasets. With the hope of gaining new insights into their data and its structure, researchers can attempt to group or segment their data based on similarities between the data points in an exercise called clustering.

Clustering problems have been studied in-depth, with applications in the fields of computational biology, operations research and social sciences13 to name a few. Quite a few algorithms are available for use, including in popular open-source libraries.4,5 However, these problems have often required human intervention on the part of the researcher. Naturally, this severely limits automation and is prone to human-error.

The need for researcher input is due mainly to two problems: determining the number of clusters in the dataset, and choosing an algorithm to cluster with. Both can produce highly inaccurate results if poorly selected.

The landscape

The canonical approach to determining the number of clusters in a dataset is to take an “elbow method” approach. We detail this in the Problem section. In short, it attempts to cluster the dataset with different numbers of clusters and compare the outputs, looking for an elbow in the curve of results. However, this has inherent shortcomings. A researcher must still choose which algorithm and which set of hyperparameters to use.

While hierarchical clustering algorithms (HCA) can be used to automatically find the number of clusters with reasonable accuracy in some cases,6 using this method means we miss out on using the many other algorithms that have been developed, even if they would be better suited to our dataset.

A large number of sophisticated methods have been explored as well.7 One of the more famous approaches was developed by Monti, et al.8 - called consensus clustering. Many of these methods (including Monti) suffer from a high level of complexity and abstraction, often based on the idea of partitioning the data.9,10 Essentially, they attempt to cluster many different subsets of the data under different cluster numbers and then select the most stable ones.

Apart from usability stemming from their complexity, many of these methods can get computationally intensive (including memory requirements). Lastly, as noted in Ref. 11, they do not generally perform well when it comes to estimating the number of clusters. As a final note, these approaches also suffer from the need for researcher input as to the choice of algorithm and parameters. We consider it prudent, therefore, to explore the topic further.

In this paper, we propose an ensemble approach to answering the following questions: How many clusters are in the dataset, and which algorithm-hyperparameter choice is best for this data? Our approach outputs the number of clusters, as well as both a model choice and a set of hyperparameters to use with the model. Note that we do not define ensemble in the sense of a collection of partitions, but rather separate algorithms, as seen in Ref. 12.

Luckily, several of the graph-based methods mentioned above - including consensus clustering and its improvements - are actually complementary to our proposed method, and we see no reason why they could not be combined, albeit with a bit of work. Consensus clustering takes as input a model and its hyperparameters, but has no framework for choosing a suitable model. Likewise, this first discovery we are presenting does not account for any data partitions.

While we leave the task of truly combining our approach with existing methods to future work, we present evidence of the benefits of accounting for model and hyperparameter choices in Monti’s consensus clustering, and we invite the reader to consider the possibilities as they read through our work. In the following section, we look at the problem in more detail, exploring the three main areas that must be addressed. In the Methods we define our algorithm or workflow. We then discuss results, present typical usage in the Discussion, and finally summarize our approach and findings in the Conclusions.

The problem

As mentioned, the central issue we are facing is that of finding the number of clusters in our dataset. While we have previously stated that the choice of algorithm and hyperparameter set is important, we found it was often overlooked in the literature, being taken as a given or covered at a high level.13 So, we set out to compute some baseline performance changes that choosing the algorithm and hyperparameter set can have. We found that, in fact, this choice had a very large impact on predicting the number of clusters in a dataset, and were ourselves quite startled by the magnitude of these variations.

Perhaps the most common approach to determining the number of clusters is to use the elbow method.14,15 This involves making many attempts at clustering, and then picking the one that seems to fit best. More directly, the workflow is:

  • 1. Choosing a clustering algorithm and parameter set.

  • 2. Clustering the data into n clusters for n2N.

  • 3. For each N1 attempts, computing a metric (ex: BIC).

  • 4. Finding the elbow in the curve of metric values, the x-axis value is our number of clusters.

Note that automatically finding the elbow can be done in several ways, such as the minimum absolute second derivative, the point of best linear fit, or as we will use here, the triangle method.

Intuitively, we can view these methods through the lens of Information Theory. Namely, the curve represents the amount of information explained as we increase the number of clusters. As we begin getting diminishing returns, we say that we are no longer explaining the data, and have too many clusters (hence the cutoff at the elbow).

Our approach will rest on this method and principle, but will tackle its three weaknesses - the choices that were implicitly made: algorithm, hyperparameters and metric. Many of the difficulties are discussed at a high-level in Ref. 13. We are essentially tackling Step 2 (and partially Step 1) in Figure 1 in their Conclusion.

a0b78fb9-d2e2-4b13-8793-3b1fc733be91_figure1.gif

Figure 1. Triangle method.

For now, let’s quantify exactly how important these choices are. Using the elbow method as a baseline, we computed some accuracy statistics on 100 randomly-generated three-cluster datasets, detailed in the Performance section. We defined accuracy as correctly determining the number of clusters we have (in this case 3). For each algorithm, we looked at performance across a broad range of hyperparameters and metrics, detailed in the Full dataset section.

Methods

Simulated data

Our data consisted of 100 two-dimensional three-cluster collections of 30,000 points each. The data was drawn from a standard normal distribution, with cluster centers randomly places between 55 and 55. Specifically, the data was constructed using the scikit-learn function make_blobs (Figure 2):

X, y = make_blobs(n_samples=30000,centers=3,n_features=2,center_box=(-5, 5),random_state=seed)

a0b78fb9-d2e2-4b13-8793-3b1fc733be91_figure2.gif

Figure 2. Simulated data with random seed 42.

Choice of algorithm

The first step many researchers will take to successfully cluster their data is to choose a clustering algorithm (step 1 in the traditional workflow above). A variety of inherently distinct algorithms exist, from Spectral methods to HCA and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). The issue at this step is that different algorithms can be better suited to different datasets,16 and this can be very difficult to determine ahead of time. While there are some generally accepted behaviors - ie, Spectral clustering works well on non-convex datasets4 - we can also see this experimentally.

For each algorithm, hyperparameter and metric choice, we computed the accuracy of our predicted number of clusters on the 100 datasets, giving us M accuracy readings. For example, for one algorithm with two possible metrics and three hyperparameter values, we would have six accuracy readings for the clustering algorithm. If the choice of algorithm did not matter, then we would get the same statistics across different algorithms. The table below shows the statistics for those M readings for K-means, Gaussian mixture model (GMM), HCA and spectral:

As we can see, different algorithms obtain rather different results (whether the mean or max performance). The performance of the spectral and HCA algorithms, which are clearly ill-suited to our datasets, is particularly interesting. On the other hand, we can see that there is some combination of metrics and hyperparameters for which GMM does quite well with a 91% accuracy. Its standard deviation is quite high though, suggesting that different metrics and hyperparameters can lead to quite different results.

Choice of hyperparameters

The second choice traditionally faced by researchers is hyperparameters (implicitly contained in step 2 of the traditional workflow). Generally referred to as hyperparameter tuning, this can greatly improve the performance of a model. Let’s examine the effect of hyperparameters in our experimental setup from the Simulated data section. Let’s hold the choice of metric constant (selecting inertia I - more on this in the Choice of metric section). This gives us:

Table 2 shows us that the choice of hyperparameters, independent of other choices, can lead to large differences in performance. For instance, a judicious choice of hyperparameters in our K-means algorithm can lead to an 18% increase in performance. Similarly, a poor choice for Spectral leads to a staggering 86% drop. Unfortunately, we have no way of knowing ahead of time which parameter selection will yield the best results in a clustering problem (given an absence of ground truth).

Further, hyperparameters explain some of the variation we saw in Table 1 but not all. A quick look at the various statistics shows us we are missing another piece: the minimum performance seen by the K-means algorithm is now 71% instead of the rather shocking 8%. This indicates one more component to the problem.

Table 1. Stats for accuracy readings per algorithm.

std: standard deviation; min: minimum; max: maximum.

StatK-meansGMMHCASpectral
mean61.7866.4728.7533.05
std27.5233.2916.0924.00
min8.007.0010.006.00
max85.0091.0078.0072.00

Table 2. Statistics for accuracy readings per algorithm-metric, across hyperparameters.

std: standard deviation; min: minimum; max: maximum.

StatKmeansIGMMIHCAISpectralI
Mean77.6385.3941.2551.33
Std4.001.1428.0920.96
Min71.0083.0010.0010.00
Max84.0087.0078.0072.00

Choice of metric

Finally, we arrive at the last choice we have to make: which metric to use (step 3 in the aforementioned workflow). Previous work by the author showed that in the case of HCA, different metrics performed differently,6 but there hasn’t been much work on this topic in general. However, we can once again examine this experimentally. In the same experimental setup as we used above, let’s examine the performance of algorithms for a fixed selection of hyperparameters across different metrics.

Every algorithm used the Inertia and Silhouette Score metrics.4 HCA also used the maximum Difference and Elbow metrics from.6 K-means and GMM also used the Akaike information criterion (AIC) and Bayesian information criterion (BIC) metrics.4 For the table of results, we took the first set of hyperarameters for that algorithm, denoted by the superscript 0.

Once again, we can see variations in performance within each algorithm-hyperparameter choice, here based solely on the choice of metrics. It turns out that much like we saw in Table 1 with HCA and Spectral, there is a weak element: the Silhouette Score.

Table 3. Statistics for accuracy readings per algorithm-hyperparameter, across metrics.

std: standard deviation; min: minimum; max: maximum.

StatKmeans0GMM0HCA0Spectral0
Mean61.7564.5024.7516.50
Std33.1737.707.599.19
Min12.008.0017.0010.0
Max79.0085.0035.0023.00

Though not obvious from this table, further investigation shows that it has an average accuracy across all algorithms and hyperparameters of only 12%. By cutting the number of algorithm-hyperparmeter combinations we’ve amplified its effect, and are seeing it drag performance down across the board, something we couldn’t know ahead of time.

The problem is now apparent: we need to find a way to filter out the many possibly bad choices in algorithms, hyperparameters and metrics if we are to find the number of clusters in a dataset with reasonable accuracy.

Ensemble approach

As we saw in the previous section, we know there are some winning combinations of algorithm-hyperparameter-metric, but we must now figure out how to find them ahead of time. At a high level, our approach uses a fairly simple ensemble method. We clustered the dataset using all the combinations we could think of and selected our predicted number of clusters from all the results combined.

This section’s structure will follow the workflow of our algorithm, split into: set construction, building the ensemble, and voting. While we have found a preferred approach on our test dataset, we present several alternatives to each step of the workflow. First, however, the reader must endure some exposition of notation to be used throughout.

Suppose we work with an ensemble of algorithms, denoted by the set

(1)
A=A0AN.

These would be the clustering algorithms such as K-means, GMM, etc.

Each algorithm can have a set of hyperparameters associated with it. Let that be written as

(2)
j=h0jhHjj,j0N
where hij denotes the ith hyperparameter selection for the jth algorithm. Note that different algorithms can have different numbers of hyperparameter selections (denoted by Hj).

Further, each hij can be comprised of several elements. For instance for GMM, h01 could be the covariance type and regularization parameter: DIAGONAL106, and h11=DIAGONAL105. If it helps, think of each hij as a set of kwargs passed into a model object.

Lastly, let’s write the set of metrics (inertia, AIC, etc) used as

(3)
j=m0jmMjj,j0N
where mij is the ith metric used to evaluate the jth algorithm.

Now, given A, j and j, we have j0N

(4)
Pj=j×j
(5)
=h0jm0jhHjjmMjj
and
(6)
Cj=AjppPj.

We admit the notation is somewhat opaque, but by constructing our actual test sets it should be illustrated nicely.

Workflow

Now that we have defined our A, and sets - and obtained our P sets - we must compute the clusterings.

This is, quite simply, an exhaustive loop over all the elements of the respective P set, where we apply the elbow method as described in the Methods section. j0N, and pPj,

  • 1. We take an element pPj.

  • 2. We use the hyperparameter values from p to compute clusterings for a range of cluster numbers.

  • 3. We then use the metric found in p to get an elbow curve.

  • 4. We find the elbow in the curve.

For each p, we have now found a suitable number of clusters for our data. More formally, we have just computed

(7)
Cj=AjppPj.

Note that Cj is simply a set of integers that map back to specific elements in Pj. For K-means clustering a 3-cluster dataset, we might get

(8)
C0=3,3,2,4,33

We can then combine the results into a collection and find the number of clusters and best algorithm-hyperparameter selection from there:

  • 1. Construct the ensemble set (we developed two approaches detailed in the Building the ensemble section).

  • 2. Vote on the number of clusters (we developed three approaches detailed in the Voting section).

Set construction

The first step in the workflow is to construct our A, , and sets. Let’s build some actual test sets to illustrate the structure. These were used for computational results in the Results section, and might help clarify the notation in the meantime. These models were built from the scikit-learn and fastcluster libraries.

First, let’s define the algorithms to look at:

(9)
A=K-meansGMMHCASpectral.

These were selected based on their diverse natures. Ideally, one wants to choose a collection of algorithms that work well on different types of data. In this case, K-means is a fast and reasonably accurate algorithm for convex datasets, HCA is fundamentally different and does not take cluster numbers as inputs, GMM is well-suited to data with a roughly Gaussian distribution and Spectral has been known to do well with non-convex datasets.

Now, the set of hyperparameters can get rather cumbersome to write out, but let’s explicitly list those ranges for K-means:

(10)
init:kmeans++random
and
(11)
reassignment_ratio:np.geomspace1e40.58
which is
[1.000e-04 3.376e-04 1.139e-03 3.848e-03 1.299e-02 4.386e-02 1.480e-01 5.000e-01].

This gives us the following set of hyperparameters for K-means:

(12)
0=h00h160=kmeans++1.000e04kmeans++3.376e04,
(13)
random5.000e01.

For brevity, we present the remaining sets based on their base ranges:

GMMcovariance_type: [diag, tied, spherical],reg_covar: np.geomspace(1e-8, 1e-2, 6)HCA method: [centroid, median, single, ward],metric: [euclidean]Spectral affinity:[laplacian, precomputed, rbf, sigmoid],metric: [cosine, l2, l1],n_neighbors: [5, 20, 100],gamma: [0.1, 1.0, 10.0]

(where, for Spectral, metric and n_neighbors are only used for precomputed, and gamma is ignored for precomputed).

In general, hyperparameters should be selected based on available information. If a researcher can somehow narrow the hyperparameter space through other knowledge, they should do so. In the absence of such information, as is our case here, we tried to choose hyperparameter ranges that span the space.

Obviously, some of these parameters can take on an infinite number of values (and we have limited computing resources), but we find it judicious to choose a smaller number of values across orders of magnitude to obtain a representative sample of reasonable values.

Now that we have our A and sets, we need . Per algorithm, we have:

K-means[aic, bic, inertia, silhouette_score]GMM[aic, bic, inertia, silhouette_score]HCA[elbow, inertia, silhouette_score, max_diff]Spectral[inertia, silhouette_score]

In other words, for K-means, we get:

(14)
P0=kmeans++1.000e04aicrandom5.000e01silhouette.

What remains now is to construct our ensemble collection .

Building the ensemble

We present two approaches for building the ensemble set in the following sections, and detail the rest of the workflow thereafter.

Raw

Given our C sets, the most natural way to construct our ensemble is simply to check every possible combination, essentially a cross product of our sets. This gives us our ensemble of values

(15)
=C0××CN
(16)
=A0ppP0××ANppPN
(17)
=Ajpijj0Ni[0HjMj]
(18)
=A0p00ANp0NA0pH0M00ANpHNMNN.

Following the example from the previous section, would be comprised of four-tuples spanning all possible combinations. Each tuple would contain a “guessed” cluster number given a specific algorithm-hyperparameter-metric choice.

If we structured as a matrix, the first few rows might look like

E = [3 3 2 3]    [3 3 2 4]    [3 3 2 2]    [3 3 2 5]

where the first column corresponds to guesses from K-means, the second from GMM, then HCA and Spectral algorithms.

Mode

While the Raw approach detailed above is the simplest, we consider the fact that it ignores a potentially important point. The choice of metric, while critical to the workflow, is intrinsically an "elbow-method" parameter. This sets it apart from the choice of algorithm and hyperparameters, which would be necessary regardless of approach.

With this in mind, we consider another formulation which first takes the mode of the results across metrics. That is, for a given algorithm and hyperparameter configuration, we take as a result the most commonly guessed number of clusters across all metric choices. We define

(19)
C¯j=ModeAjh0jmmjModeAjhHjjmmj
(20)
=ModeAjhmmjhj.

From this we can define our ensemble ¯ in the same way as ,

(21)
¯=C¯0××C¯N,
which might give us an example matrix of
E = [3 3 3 3]    [3 2 2 4]    [3 3 2 2]    [3 3 2 3]

Here, our matrix will be smaller than in the Raw approach. Each column still corresponds to an algorithm, but each entry is now the mode of the guesses produced by a set of hyperparameters.

Voting

Now, given our matrix E, there are a few ways to combine the results and vote on them. As a toy example, consider this result for a three-cluster dataset:

E = [2 2 2 2]    [2 2 2 2]    [3 2 3 3]    [3 3 2 3]    [3 3 3 2]

Full

The simplest approach would be to simply take the most common cluster number found in our ensemble. While straightforward, it doesn’t allow us to capture any additional information, nor filter out errors or biases in any way. Our toy example contains 11 2s and nine 3s; we would get an incorrect final result of R=2.

Column-First

Another naive approach would be to vote along algorithms, giving us 4 results, and then voting for the most common answer within those 4. One possible issue with this approach is the case where we have a few particularly ill-suited algorithms. Looking at the same example, we get

3 2 2 2

and an incorrect final result of R=2.

Row-First

Finally, we can look to capture what we are calling the cohesion between the results, essentially, favoring their agreement. By first looking at the individual rows of our example we would have

22333

and therefore R=3.

Given that our set (or ¯) covers all combinations of results, we are choosing to prioritize those cases where our different algorithm-parameter-metric results are cohesive (row-wise), before looking at their actual value (column-wise).

We present results for the three approaches in the next section.

Results

Now, we arrive at our results. In the following sections, we present the results of the six different approaches (Raw/Mode with Row/Column/Full) on 100 simulated datasets from Section 1. We compared our results to two benchmarks.

The first is the expected value from randomly sampling our result set 100 times; essentially the accuracy we could expect from choosing an algorithm, hyperparameters and metrics beforehand.

The second is the consensus clustering approach put forward by Monti et al.8 In this case, we used K-means and GMM and attempted to pass in both default hyperparameters (D) and the best performing hyperparameters (B) as determined by our method.

Performance

Here, we define accuracy by comparing the predicted number of clusters for each of the 100 datasets - based on voting on the or ¯ set - to the actual number of clusters (three in every case).

Overall results are shown in Table 4.

Table 4. Accuracy on 100 simulated datasets.

RawMode
Full86.0093.00
Row89.0092.00
Col88.0090.00

While there were differences in performance between the voting methods, the signal was somewhat muddled. In the case of the Raw construction the Row-first approach was best, while for a Mode construction a Full vote was preferable. Overall, the differences in voting performance were also small, providing at most a 3% increase. We don’t consider it prudent to declare one voting approach more beneficial than another.

On the other hand, the choice between using a Raw ensemble construction or a mode-based ¯ set seems to be clearer. Using a mode construction improved performance across the board, yielding a 28% increase in accuracy.

Given a mode-based ¯ set, the naive voting took the lead: it was overall the best performer with 93% accuracy. Additionally, we find it important to note that we actually outperformed even the maximum performance we saw in Table 1, which was 91%.

Given that the latter could only occur given a prefect guess as to which algorithm-hyperparameter-metric combination to use, we find it even more satisfying.

Benchmarks

Table 5 details our benchmark performance as defined in this section: consensus clustering and expected value from random sampling.

Table 5. Benchmark accuracy on 100 simulated datasets.

Expected valueConsensus
57.29
KMeans (D)82.00
KMeans (B)87.00
GMM (D)66.00
GMM (B)46.00

Not unexpectedly, randomly guessing at possible solutions yielded unsatisfactory results. We note that it consistently scored above 50%, likely due to the fact that even poor algorithm configurations can still pick up some signal.

Consensus clustering gave very unpredictable results, with a very large variance in performance - though it did peak at a respectable 87%. We reiterate our previous point, however, that it requires a choice of algorithm and hyperparameters as inputs, greatly reducing its effectiveness in practice.

Perhaps the most interesting result to come out of benchmarking was the performance of consensus clustering with respect to hyperparameter selections that did very well in our method. In the case of GMM this led to a drastic drop in performance (30%), while for K-means we saw a 6% increase. This would indicate that while it is likely a non-trivial exercise to combine the approaches in a reasonable way, it could be worth further investigation.

Discussion

Usage

Now that we have examined this problem and our proposed solution, we’d like to discuss some other elements. Namely, typical usage setups, and some simple approaches to selecting the best algorithm-hyperparameter combination in each case. As our colleagues in industry would ask, how do we use this in production?

We would like to note that the following algorithm selection methods in particular are merely simple approaches to get things off the ground. There are undoubtedly other ways to solve this and we encourage further work in this area.

We believe there to be two general use-cases depending mainly on computational constraints: the case where all our data can be processed at once, and the case where we must slice our data into subsets first.

Data subsets

If we look at the case where we must partition our data due to computational limitations, we find ourselves in essentially the same framework that we did throughout the paper. While we will leave it to the reader to determine the best way to sample subsets of their data while capturing all clusters, let’s examine how this ensemble framework would work.

Throughout, we have looked at 100 simulated datasets as a means of getting accuracy metrics. Suppose now we take our large dataset and split it into 100 smaller, more manageable, datasets. We are now in the same situation as we were earlier in the paper: we would expect the number of clusters to be the same for each of the smaller datasets, and we would aggregate the results of 100 ensemble methods (albeit not with “accuracy”).

Estimating the number of clusters

Now that we have our 100 subsets, we can construct our E matrix for each one of them and vote on the number of clusters. This gives us 100 answers, one cluster number estimate per subset.

From there, we could select the most common answer (i.e., the mode) as our global estimated number of clusters (instead of computing accuracy as we did in this paper). Having determined how many clusters our dataset has, we arrive at the question of choosing an algorithm.

Algorithm-hyperparameter selection

Still within our 100 subset context, let’s examine how we could choose a best algorithm-hyperparameter combination. We reiterate that this is only one of what is likely many possible approaches.

Given our estimated number of clusters, all the answers estimated by each algorithm-hyperparameter-metric combination found in the subset E matrices can be stored. From there, the accuracy of each algorithm-hyperparameter-metric combination (relative to our global estimated number of clusters) can be computed. The best-performing combination can be taken as a reasonable way of clustering future data from the same dataset.

Indeed, in our simulated case, this approach correctly identified that GMM is the best choice, and more specifically that the combination

algorithm:GMMhyperparameters:covariance_type: diag,reg_covar: 1e-8metric:AIC

achieved the best results, with 91% accuracy. Readers will note this is indeed the top performance we can achieve as per Table 1.

It seems reasonable to assume that given more data drawn from the same dataset, this algorithm with these hyperparameters would do well at clustering it in a sensible way. We do note that it is possible to apply the logic presented in the Algorithm-hyperparameter selection section under Methods; in this case as well, such logic is included in the code library by default.

Full dataset

Now let’s look at the simpler case where all of our data can be processed together as a single dataset.

Estimating the number of clusters

In this happy scenario, estimating the number of clusters is relatively straightforward: we run a single workflow. We begin by constructing our E matrix. Then, we compute the results of voting, which directly gives us the predicted number of clusters in our data.

In this case there is no need for any aggregation as we have a single outcome from the vote.

Algorithm-hyperparameter selection

When it comes to identifying the right choice of combination, however, we can’t proceed as we did in the subset case. Given that we have no way of computing accuracy, we will instead look for the “most stable” choice. Once more, this is simply a first approach and that future work could likely result in improvements.

For each algorithm-hyperparameter combination, we look at its predictions across metrics (which are not needed for future clustering). For example, suppose that on the first dataset from our simulated data the GMM combination

algorithm:GMMhyperparameters:covariance_type: diag,reg_covar: 1e-8

obtained results of [3, 3, 3, 3] across its four metrics, whereas

algorithm:GMMhyperparameters:covariance_type: spherical,reg_covar: 1e-2

obtained. [3, 4, 2, 3]

In this example, we would favor the combination that was most often correct: [3, 3, 3, 3]. Note that we are essentially looking for stability and insensitivity to metric choice.

If we extend this comparison to every algorithm-hyperparameter combination, we arrive at a suitable combination choice to use for future clustering. We note, however, that given the small number of metric choices, this approach is less likely to yield a unique best choice.

If that is the case, any of the top-ranked combinations may be selected, as they are equally likely to achieve desirable results - as per this stability framework.

Conclusion

We have developed a workflow with six possible configurations for determining the number of clusters in an unlabeled dataset, while offering a flexible basis for determining the optimal choice of algorithm and associated hyperparameters. While certain methods already exist, such as for agglomerative hierarchical clustering and consensus clustering, they each present difficulties in the field that our approach addresses.

Fistly, we no longer require researchers - who may not be subject matter experts in unsupervised learning, but rather their own domains - to provide such specific inputs as particular algorithms and hyperparameter-metric configurations. Instead, given reasonably spanning ranges of hyperparameters, and a diverse selection of algorithms, we can reliably predict the number of clusters present with more than 90% accuracy.

We also obtained - at very little cost - a reasonably suitable choice for which algorithm and which hyperparameters to use to further cluster the data. While not every situation will require clustering of additional incoming data from the same distribution, combination performance findings are sure to be beneficial to researchers.

Lastly, it is our hope that the simple algorithmic structure of this approach can lead to reliably simple integration into commonly used software libraries - thereby removing another barrier to entry. The code used for this framework can be found on GitHub.

In the future we hope to explore several avenues of work related to this approach. This includes studying other means of measuring cohesion (other than a row-first approach). We would also like a process to automatically track the ensemble set as it is populated and to adjust the hyperparameter space dynamically. This could allow for faster computations and less noise in the resulting ensemble. Finally, a careful and constructive combination of consensus clustering and our ensemble method.

Data availability

Zenodo: antoinezambelli/ensemble-clustering: v1.0.0, https://github.com/antoinezambelli/ensemble-clustering/tree/v1.0.0

Analysis code

Analysis code available from: https://github.com/antoinezambelli/ensemble-clustering

Archived analysis code as at time of publication: https://github.com/antoinezambelli/ensemble-clustering/tree/v1.0.0

License: MIT

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 25 May 2022
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Zambelli A. Ensemble method for cluster number determination and algorithm selection in unsupervised learning [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2022, 11:573 (https://doi.org/10.12688/f1000research.121486.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 25 May 2022
Views
8
Cite
Reviewer Report 26 Jan 2024
Luca Coraggio, University of Naples Federico II, Naples, Italy 
Not Approved
VIEWS 8
The work addresses a relevant research question in cluster analysis, which has long been and stays an open-question, namely determining the optimal clustering solution, with a focus on the selection of a clustering algorithm along with its hyperparameters.

... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Coraggio L. Reviewer Report For: Ensemble method for cluster number determination and algorithm selection in unsupervised learning [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2022, 11:573 (https://doi.org/10.5256/f1000research.133349.r235165)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
8
Cite
Reviewer Report 17 Dec 2023
Ana Estela Antunes da Silva, University of Campinas (Unicamp), Campinas, Brazil 
Approved with Reservations
VIEWS 8
The paper presents a real problem for users of machine learning algorithms which apply the task of clustering. One of the problems is choosing the number of groups, as well as the best algorithm to be applied to the problem.
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Antunes da Silva AE. Reviewer Report For: Ensemble method for cluster number determination and algorithm selection in unsupervised learning [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2022, 11:573 (https://doi.org/10.5256/f1000research.133349.r220274)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
11
Cite
Reviewer Report 11 Aug 2022
Dong Huang, College of Mathematics and Informatics, South China Agricultural University, Guangzhou, China 
Approved with Reservations
VIEWS 11
This paper proposes a new ensemble clustering framework that is able to determine the number of clusters and the suitable choice of the algorithm to use for a given dataset.

This work aims to automatically find the number ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Huang D. Reviewer Report For: Ensemble method for cluster number determination and algorithm selection in unsupervised learning [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2022, 11:573 (https://doi.org/10.5256/f1000research.133349.r147137)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 25 May 2022
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.