Ensemble Method for Cluster Number Determination and Algorithm Selection in Unsupervised Learning

Unsupervised learning, and more specifically clustering, suffers from the need for expertise in the field to be of use. Researchers must make careful and informed decisions on which algorithm to use with which set of hyperparameters for a given dataset. Additionally, researchers may need to determine the number of clusters in the dataset, which is unfortunately itself an input to most clustering algorithms. All of this before embarking on their actual subject matter work. After quantifying the impact of algorithm and hyperparameter selection, we propose an ensemble clustering framework which can be leveraged with minimal input. It can be used to determine both the number of clusters in the dataset and a suitable choice of algorithm to use for a given dataset. A code library is included in the Conclusion for ease of integration.


I. INTRODUCTION
Unsupervised learning -one of the main branches of machine learning -is the study of previously unlabeled datasets.With the hope of gaining new insights into their data and its structure, researchers can attempt to group or segment their data based on similarities between the data points in an exercise called clustering.
Clustering problems have been studied in-depth, with applications in the fields of Computational Biology, Operations Research and Social Sciences [1], [8], [13] to name a few.Quite a few algorithms are available for use -including in popular open-source libraries [5], [12].However, these problems have often required human intervention on the part of the researcher.Naturally, this severely limits automation and is prone to human-error.
The need for researcher input is due mainly to two problems: determining the number of clusters in the dataset, and choosing an algorithm to cluster with.Both can produce highly inaccurate results if poorly selected.

A. The Landscape
The canonical approach to determining the number of clusters in a dataset is to take an "elbow method" approach.We detail this in Section II.In short, attempt to cluster the dataset with different numbers of clusters and compare the outputslooking for an elbow in the curve of results.However, this has inherent shortcomings.A researcher must still choose which algorithm and which set of hyperparameters to use.*This is a preprint.
While Hierarchical Clustering Algorithms (HCA) can be used to automatically find the number of clusters with reasonable accuracy in some cases [16], using this method means we miss out on using the many other algorithms that have been developed -even if they would be better suited to our dataset.
A large number of sophisticated methods have been explored as well [15].One of the more famous approaches was developed by Monti,et. al. [9] -called Consensus Clustering.Many of these methods (including Monti) suffer from a high level of complexity and abstraction -often based on the idea of partitioning the data [3], [4].Essentially they attempt to cluster many different subsets of the data under different cluster numbers and then select the most stable.
Apart from usability stemming from their complexity, many of these methods can get computationally intensive (including memory requirements).Lastly, as noted in [7], they do not generally perform well when it comes to estimating the number of clusters.As a final note, these approaches also suffer from the need for researcher input as to the choice of algorithm and parameters.We consider it prudent, therefore, to explore the topic further.
In this paper, we propose an ensemble approach to answering the following questions: How many clusters are in the dataset, and which algorithm-hyperparameter choice is best for this data?Our approach outputs the number of clusters, as well as both a model choice and a set of hyperparameters to use with the model.Note that we do not define ensemble in the sense of a collection of partitions, but rather separate algorithms, as seen in [11].
Luckily, several of the graph-based methods mentioned above -including Consensus Clustering and its improvements -are actually complementary to our proposed method, and we see no reason why they could not be combined, albeit with a bit of work.Consensus Clustering takes as input a model and its hyperparameters, but has no framework for choosing a suitable model.Likewise, this first discovery we are presenting does not account for any data partitions.
While we leave the task of truly combining our approach with existing methods to future work, we present evidence of the benefits of accounting for model and hyperparameter choices in Monti's Consensus Clustering, and we invite the reader to consider the possibilities as they read through our work.In Section II, we will look at the problem in more detail, exploring the three main areas that must be addressed.In Section III we will define our algorithm or workflow.We will then discuss results in Section IV, present typical usage in Section V, and finally summarize our approach and findings in Section VI.

II. THE PROBLEM
As mentioned, the central issue we are facing is that of finding the number of clusters in our dataset.While we have previously stated that the choice of algorithm and hyperparameter set is important, we found it was often overlooked in the literature -being taken as a given or covered at a high level [10].So, we set out to compute some baseline performance changes that choosing the algorithm and hyperparameter set can have.We found that, in fact, this choice had a very large impact on predicting the number of clusters in a dataset -and were ourselves quite startled by the magnitude of these variations.
Perhaps the most common approach to determining the number of clusters is to use the elbow method [2], [14].This involves making many attempts at clustering, and then picking the one that seems to fit best.More directly, the workflow is: 1) Choose a clustering algorithm and parameter set.
3) For each N − 1 attempts, compute a metric (ex: BIC).4) Find the elbow in the curve of metric values, the x-axis value is our number of clusters.(note that automatically finding the elbow can be done in several ways such as the minimum absolute second derivative, the point of best linear fit, or as we will use here the triangle method).
Intuitively, we can view this method through the lens of Information Theory.Namely, the curve represents the amount of information explained as we increase the number of clusters.As we begin getting diminishing returns, we say that we are no longer explaining the data, and have too many clusters (hence the cutoff at the elbow).
Our approach will rest on this method and principle, but will tackle its three weaknesses -the choices that were implicitly made: algorithm, hyperparameters and metric.Many of the difficulties are discussed at a high-level in [10].We are essentially tackling Step 2 (and partially Step 1) in Fig 1  For now, let's quantify exactly how important these choices are.Using the elbow method as a baseline, we'll compute some accuracy statistics on 100 randomly-generated 3-cluster datasets, detailed in Section II-A.We define accuracy as correctly determining the number of clusters we have (in this case 3).For each algorithm, we look at performance across a broad range of hyperparameters and metrics -detailed in Section III-B.

B. Choice of Algorithm
The first step many researchers will take to successfully cluster their data is to choose a clustering algorithm (step 1 in the traditional workflow above).A variety of inherently distinct algorithms exist, from Spectral methods to HCA and DBSCAN.The issue at this step is that different algorithms can be better suited to different datasets [6], and this can be very difficult to determine ahead of time.While there are some generally accepted behaviors -ie, Spectral clustering works well on non-convex datasets [12] -we can also see this experimentally.
For each algorithm, hyperparameter and metric choice, we compute the accuracy of our predicted number of clusters on the 100 datasets, giving us M accuracy readings.For example, for 1 algorithm with 2 possible metrics and 3 hyperparameter values, we would have 6 accuracy readings for the clustering algorithm.If the choice of algorithm did not matter, then we would get the same statistics across different algorithms.The table below shows the statistics for those M readings for Kmeans, Gaussian Mixture Model (GMM), HCA and Spectral: As we can see, different algorithms obtain rather different results (whether the mean or max performance).Particularly interesting is the performance of the Spectral and HCA algorithms, which are clearly ill-suited to our datasets.On the other hand, we can see that there is some combination of metrics and hyperparameters for which GMM does quite well with a 91% accuracy.Its standard deviation is quite high though, suggesting that different metrics and hyperparameters can lead to quite different results.

C. Choice of Hyperparameters
The second choice traditionally faced by researchers is hyperparameters (implicitly contained in step 2 of the traditional workflow).Generally referred to as hyperparameter tuning, this can greatly improve the performance of a model.Let's examine the effect of hyperparameters in our experimental setup from Section 2.1.Let's hold the choice of metric constant (selecting inertia I -more on this in Section 2.4).This gives us: Table 2 shows us that the choice of hyperparameters, independent of other choices, can lead to large differences in performance.For instance, a judicious choice of hyperparameters in our K-means algorithm can lead to an 18% increase in performance.Similarly, a poor choice for Spectral leads to a staggering 86% drop.Unfortunately, we have no way of knowing ahead of time which parameter selection will yield the best results in a clustering problem (given an absence of ground truth).
Further, hyperparameters explain some of the variation we saw in Table 1 but not all.A quick look at the various statistics shows us we are missing another piece: the minimum performance seen by the K-means algorithm is now 71% instead of the rather shocking 8%.This indicates one more component to the problem.

D. Choice of Metric
Finally, we arrive at the last choice we have to makewhich metric to use (step 3 in the aforementioned workflow).Previous work by the author showed that in the case of HCA different metrics performed differently [16], but there hasn't been much work on this topic in general.However, we can once again examine this experimentally.In the same experimental setup as we used above, let's examine the performance of algorithms for a fixed selection of hyperparameters across different metrics.
Every algorithm used the Inertia and Silhouette Score metrics [12].HCA also used the Maximum Difference and Elbow metrics from [16].K-means and GMM also used the AIC and BIC metrics [12].For the table of results, we take the first set of hyperarameters for that algorithm, denoted by the superscript 0. Once again, we can see variations in performance within each algorithm-hyperparameter choice, here based solely on the choice of metrics.It turns out that much like we saw in Table 1 with HCA and Spectral, there is a weak element: the Silhouette Score.
Though not obvious from this table, further investigation shows that it has an average accuracy across all algorithms and hyperparameters of only 12%.By cutting the number of algorithm-hyperparmeter combinations we've amplified its effect, and are seeing it drag performance down across the board -something we couldn't know ahead of time.
The problem is now apparent: we need to find a way to filter out the many possibly bad choices in algorithms, hyperparameters and metrics if we are to find the number of clusters in a dataset with reasonable accuracy.

III. ENSEMBLE APPROACH
As we saw in the previous section, we know there are some winning combinations of algorithm-hyperparameter-metric, but we must now figure out how to find them ahead of time.At a high level, our approach will use a fairly simple ensemble method.We will cluster the dataset using all the combinations we can think of and select our predicted number of clusters from all the results combined.This section's structure will follow the workflow of our algorithm, split into subsections 3.2 through 3.4.While we have found a preferred approach on our test dataset, we present several alternatives to each step of the workflow.First, however, the reader must endure some exposition of notation to be used throughout.
Suppose we work with an ensemble of algorithms, denoted by the set These would be the clustering algorithms such as K-means, GMM, etc.Each algorithm can have a set of hyperparameters associated with it.Let that be written as where h j i denotes the ith hyperparameter selection for the jth algorithm.Note that different algorithms can have different numbers of hyperparameter selections (denoted by H j ).
Further, each h j i can be comprised of several elements.For instance for GMM, h 1 0 could be the covariance type and regularization parameter: (DIAGONAL, 10 −6 ), and h 1 1 = (DIAGONAL, 10 −5 ).If it helps, think of each h j i as a set of KWARGS passed into a model object.
Lastly, let's write the set of metrics (inertia, AIC, etc) used as where m j i is the ith metric used to evaluate the jth algorithm.Now, given A, H j and M j , we have ∀j ∈ [0, N ] and We admit the notation is somewhat opaque, but by constructing our actual test sets it should be illustrated nicely.

A. Workflow
Now that we have defined our A, H and M sets -and obtained our P sets -we must compute the clusterings.This is, quite simply, an exhaustive loop over all the elements of the respective P set, where we apply the elbow method as described in Section 2. ∀j ∈ [0, N ], and ∀p ∈ P j , 1) We take an element p ∈ P j .
2) We use the hyperparameter values from p to compute clusterings for a range of cluster numbers.3) We then use the metric found in p to get an elbow curve.4) We find the elbow in the curve.For each p, we have now found a suitable number of clusters for our data.More formally, we have just computed Note that C j is simply a set of integers that map back to specific elements in P j .For K-means clustering a 3-cluster dataset, we might get We can then combine the results into a collection and find the number of clusters and best algorithm-hyperparameter selection from there: 1) Construct the ensemble set E (we develop two approaches detailed in Section III-C).2) Vote on the number of clusters (we develop three approaches detailed in Section III-D).

B. Set Construction
The first step in the workflow is to construct our A, H, and M sets.Let's build some actual test sets to illustrate the structure.These will be used for computational results in Section 4, and might help clarify the notation in the meantime.These models are built from the scikit-learn and fastcluster libraries.
First, let's define the algorithms to look at: These were selected based on their diverse natures.Ideally, one wants to choose a collection of algorithms that work well on different types of data.In this case, K-means is fast and reasonably accurate algorithm for convex datasets, HCA is fundamentally different and does not take cluster numbers as inputs, GMM is well-suited to data with a roughly Gaussian distribution and Spectral has been known to do well with nonconvex datasets.Now, the set of hyperparameters can get rather cumbersome to write out, but let's explicitly list those ranges for K-means: and reassignment_ratio : np.geomspace(1e-4, 0.5, 8) which is This gives us the following set of hyperparameters for Kmeans: For brevity, we present the remaining sets based on their base ranges: (where, for Spectral, metric and n_neighbors are only used for precomputed, and gamma is ignored for precomputed).
In general, hyperparameters should be selected based on available information.If a researcher can somehow narrow the hyperparameter space through other knowledge, they should do so.In the absence of such information, as is our case here, we try to choose hyperparameter ranges that span the space.
Obviously, some of these parameters can take on an infinite number of values (and we have limited computing resources), but we find it judicious to choose a smaller number of values across orders of magnitude to obtain a representative sample of reasonable values.Now that we have our A and H sets, we need M. Per algorithm, we have: In other words, for K-means, we get: What remains now is to construct our ensemble collection E.

C. Building the Ensemble
We present two approaches for building the ensemble set E in the following sections, and detail the rest of the workflow thereafter.
1) Raw: Given our C sets, the most natural way to construct our ensemble is simply to check every possible combination -essentially a cross product of our sets.This gives us our ensemble of values Following the example from the previous section, E would be comprised of 4-tuples spanning all possible combinations.Each tuple would contain a "guessed" cluster number given a specific algorithm-hyperparameter-metric choice.
If we structured E as a matrix, the first few rows might look like where the first column corresponds to guesses from K-means, the second from GMM, then HCA and Spectral algorithms.
2) Mode: While the Raw approach detailed above is the simplest, we consider the fact that it ignores a potentially important point.The choice of metric, while critical to the workflow, is intrinsically an "elbow-method" parameter.This sets it apart from the choice of algorithm and hyperparameters, which would be necessary regardless of approach.
With this in mind, we consider another formulation which first takes the mode of the results across metrics.That is, for a given algorithm and hyperparameter configuration, we take as a result the most commonly guessed number of clusters across all metric choices.Define From this we can define our ensemble E in the same way as which might give us an example matrix of Here, our matrix will be smaller than in the Raw approach.Each column still corresponds to an algorithm, but each entry is now the mode of the guesses produced by a set of hyperparameters.

D. Voting
Now, given our matrix E, there are a few ways to combine the results and vote on them.As a toy example, consider this result for a 3-cluster dataset: 1) Full: The simplest approach would be to simply take the most common cluster number found in our ensemble.While straightforward, it doesn't allow us to capture any additional information, nor filter out any errors or biases in any way.Our toy example contains 11 2s and 9 3s, we would get an incorrect final result of R = 2.
2) Column-First: Another naive approach would be to vote along algorithms, giving us 4 results, and then voting for the most common answer within those 4.One possible issue with this approach is the case where we have a few particularly ill-suited algorithms.Looking at the same example, we get and an incorrect final result of R = 2.
3) Row-First: Finally, we can look to capture what we are calling the cohesion between the results -essentially, favoring their agreement.By first looking at the individual rows of our example we would have Given that our set E (or E) covers all combinations of results, we are choosing to prioritize those cases where our different algorithm-parameter-metric results are cohesive (rowwise), before looking at their actual value (column-wise).
We present results for the three approaches in Section 4.

IV. RESULTS
Now, we arrive at our results.In the following sections, we present the results of the 6 different approaches (Raw/Mode with Row/Column/Full) on 100 simulated datasets from Section II-A.We compare our results to 2 benchmarks.
The first is the expected value from randomly sampling our result set 100 times -essentially the accuracy we could expect from choosing an algorithm, hyperparameters and metrics beforehand.
The second is the Consensus Clustering approach put forward by Monti et al [9].In this case, we used K-means and GMM and attempted to pass in both default hyperparameters (D) and the best performing hyperparameters (B) as determined by our method.

A. Performance
Here, we define accuracy by comparing the predicted number of clusters for each of the 100 datasets -based on voting on the E or E set -to the actual number of clusters (3 in every case).
Overall results are shown in Table 4.While there are differences in performance between the voting methods, the signal is somewhat muddled.In the case of the Raw construction the Row-first approach is best, while for a Mode construction a Full vote is preferable.Overall, the differences in voting performance is also small providing at most a 3% increase.We don't consider it prudent to declare one voting approach more beneficial than another.
On the other hand, the choice between using a Raw ensemble construction E or a mode-based E set seems to be clearer.Using a mode construction improved performance across the board, yielding a 2 − 8% increase in accuracy.
Given a mode-based E set, the naive voting takes the lead -it is overall the best performer with 93% accuracy.Additionally, we find it important to note that we have actually outperformed even the maximum performance we saw in Table 1 -which was 91%.
Given that the latter could only occur given a prefect guess as to which algorithm-hyperparameter-metric combination to use, we find it even more satisfying.

B. Benchmarks
Table 5 details our benchmark performance as defined in this section -Consensus Clustering and expected value from random sampling.
Not unexpectedly, randomly guessing at possible solutions yields unsatisfactory results.We note that it consistently scores above 50%, likely due to the fact that even poor algorithm configurations can still pick up some signal.
Consensus Clustering gave very unpredictable results, with a very large variance in performance -though it did peak at a respectable 87%.We reiterate our previous point, however, that it requires a choice of algorithm and hyperparameters as inputs -greatly reducing its effectiveness in practice.
Perhaps the most interesting result to come out of benchmarking was the performance of Consensus Clustering with respect to hyperparameter selections that did very well in our method.In the case of GMM this lead to a drastic drop in performance (30%), while for K-means we saw a 6% increase.This would indicate that while it is likely a non-trivial exercise to combine the approaches in a reasonable way, it could be worth further investigation.
V. USAGE Now that we have examined this problem and our proposed solution, we'd like to discuss some other elements.Namely, typical usage setups, and some simple approaches to selecting the best algorithm-hyperparameter combination in each case.As our colleagues in industry would say -how do we use this in production?
We would like to note that the following algorithm selection methods in particular are merely simple approaches to get things off the ground.There are undoubtedly other ways to solve this and we encourage further work in this area.
We believe there to be two general use-cases depending mainly on computational constraints: the case where all our data can be processed at once, and the case where we must slice our data into subsets first.

A. Data Subsets
If we look at the case where we must partition our data due to computational limitations, we find ourselves in essentially the same framework that we had throughout the paper.While we will leave it to the reader to determine the best way to sample subsets of their data while capturing all clusters, let's examine how this ensemble framework would work.
Throughout, we have looked at 100 simulated datasets as a means of getting accuracy metrics.Suppose now we take our large dataset and split it into 100 smaller, more manageable, datasets.We are now in the same situation as we were earlier in the paper: we would expect the number of clusters to be the same for each of the smaller datasets, and we would aggregate the results of 100 ensemble methods (albeit not with "accuracy").
1) Estimating the Number of Clusters: Now that we have our 100 subsets, we can construct our E matrix for each one of them and vote on the number of clusters.This will give us 100 answers, one cluster number estimate per subset.
From there, we could select the most common answer (ie, the mode) as our global estimated number of clusters (instead of computing accuracy as we did in this paper).Having determined how many clusters our dataset has, we arrive at the question of choosing an algorithm.
2) Algorithm-Hyperparameter Selection: Still within our 100 subset context, let's examine how we could choose a best algorithm-hyperparameter combination.We reiterate that this only one of what is likely many possible approaches.
Given our estimated number of clusters, store all the answers estimated by each algorithm-hyperparameter-metric combination found in the subset E matrices.From there, compute the accuracy of each algorithm-hyperparameter-metric combination (relative to our global estimated number of clusters).The best-performing combination can be taken as a reasonable way of clustering future data from the same dataset.
Indeed, in our simulated case, this approach correctly identifies that GM M is the best choice, and more specifically that the combination algorithm: GMM hyperparameters: covariance_type: diag, reg_covar: 1e-8 metric: AIC achieves the best results, with 91% accuracy.Readers will note this is indeed the top performance we can achieve as per Table 1.
It seems reasonable to assume that given more data drawn from the same dataset, this algorithm with these hyperparameters would do well at clustering it in a sensible way.We do note that it is possible to apply the logic presented in Section 5.2.2 in this case as well, such logic is included in the code library by default.

B. Full Dataset
Now let's look at the simpler case where all of our data can be processed together as a single dataset.

1) Estimating the Number of Clusters:
In this happy scenario, estimating the number of clusters is relatively straightforward -we run a single workflow.We begin be constructing our E matrix.Then, we compute the results of voting -which directly gives us the predicted number of clusters in our data.
In this case there is no need for any aggregation as we have a single outcome from the vote.
2) Algorithm-Hyperparameter Selection: When it comes to identifying the right choice of combination, however, we can't proceed as we did in the subset case.Given that we have no way of computing accuracy, we will instead look for the "most stable" choice.Once more, this is simply a first approach and that future work could likely result in improvements.
For each algorithm-hyperparameter combination, look at its predictions across metrics (which are not needed for future clustering).For example, suppose that on the first dataset from our simulated data the GM M combination algorithm: In this example, we would favor the combination that was most often correct: [3,3,3,3].Note that we are essentially looking for stability and insensitivity to metric choice.
If we extend this comparison to every algorithmhyperparameter combination, we arrive at a suitable combination choice to use for future clustering.We note, however, that given the small number of metric choices, this approach is less likely to yield a unique best choice.
If that is the case, any of the top-ranked combinations may be selected, as they are equally likely to achieve desirable results -as per this stability framework.

VI. CONCLUSION
We have developed a workflow with 6 possible configurations for determining the number of clusters in an unlabeled dataset, while offering a flexible basis for determining the optimal choice of algorithm and associated hyperparameters.While certain methods already exist, such as for agglomerative hierarchical clustering and Consensus Clustering, they each present difficulties in the field that our approach addresses.
Fistly, we no longer require researchers -who may not be subject matter experts in unsupervised learning, but rather their own domains -to provide such specific inputs as particular algorithms and hyperparameter-metric configurations.Instead, given reasonably spanning ranges of hyperparameters, and a diverse selection of algorithms, we can reliably predict the number of clusters present with more than 90% accuracy.
We also obtain -at very little cost -a reasonably suitable choice for which algorithm and which hyperparameters to use to further cluster the data.While not every situation will require clustering of additional incoming data from the same distribution, combination performance findings are sure to be beneficial to researchers.
Lastly, it is our hope that the simple algorithmic structure of this approach can lead to reliably simple integration into commonly used software libraries -thereby removing another barrier to entry.The code used for this framework can be found at HTTPS://GITHUB.COM/ANTOINEZAMBELLI/ENSEMBLE-CLUSTERING.
In the future we hope to explore several avenues of work related to this approach.This includes studying other means of measuring cohesion (other than a row-first approach).We would also like a process to automatically track the ensemble set as it is populated and to adjust the hyperparameter space dynamically.This could allow for faster computations and less noise in the resulting ensemble.Finally, a careful and constructive combination of Consensus Clustering and our ensemble method.
in their Conclusion.

TABLE V BENCHMARK
ACCURACY ON 100 SIMULATED DATASETS.