K-Medoids clustering in BetaML.jl

I am trying to use BetaML.jl package to cluster a set of data using K-Medoids algorithm; however, it seems to not classify the data into all classes. For example:

using Random, BetaML

X = rand(MersenneTwister(1234), 10, 5)

mod = KMedoidsClusterer(n_classes = 10)
classes = fit!(mod, X)

#returns 
# 8
# 5
# 5
# 6
# 6
# 3
# 3
# 4
# 3
# 7]

Since I am trying to classify it into as many classes as the data points, there should be one data point in each class; however, we see that nothing gets assigned to classes 1, 2, 9, and 10. What’s the issue here?

PS: The above example is just to illustrate the issue. I am obviously trying to classify it in way less number of classes, but I see the same issue every time.

Presumably whatever optimisation algorithm it’s using is stuck in a local optimum. Note that there are very few ML algorithms which can guarantee the optimal output - but nevertheless tend to work very well on real-world data.

You should try the algorithm on the data you actually care about and see if there is still a problem.

The issue here is that the number of clusters and the number of data points is close. Depending on how the medoids are initialized and move with each iteration, you may terminate with the medoids in a position where no points in the dataset are closest to a particular one, in which case you have an empty cluster. If you increase the data size (say, by a factor of 10), then all clusters are nonempty:

julia> X = rand(MersenneTwister(1234), 100, 5)
100Ă—5 Matrix{Float64}:
 0.590845   0.291069   0.149005   0.0819584  0.46522
 0.766797   0.612628   0.538435   0.426769   0.0502662
 0.566237   0.705766   0.369212   0.211718   0.775001
 0.460085   0.508363   0.500337   0.707441   0.105312
 0.794026   0.47255    0.344757   0.324088   0.113709
 0.854147   0.612554   0.929902   0.23267    0.321296
 0.200586   0.192634   0.520697   0.98406    0.59285
 0.298614   0.85107    0.613      0.600819   0.708954
 0.246837   0.952068   0.144613   0.73273    0.64771
 0.579672   0.795041   0.465207   0.653116   0.221446
 0.648882   0.468079   0.0273653  0.461496   0.212507
 0.0109059  0.0951815  0.0569291  0.0408179  0.448741
 0.066423   0.727763   0.840961   0.537369   0.84971
 0.956753   0.982502   0.969741   0.276042   0.958682
 0.646691   0.427034   0.274452   0.0136942  0.600033
 0.112486   0.467939   0.0843292  0.672485   0.59149
 0.276021   0.848927   0.162723   0.54876    0.194972
 â‹®                                           
 0.255054   0.170987   0.0586237  0.18915    0.483002
 0.498734   0.481834   0.112467   0.292921   0.407101
 0.0940369  0.864876   0.94572    0.900891   0.621091
 0.52509    0.264359   0.447162   0.849681   0.524288
 0.265511   0.861325   0.117027   0.817988   0.705856
 0.110096   0.843625   0.508327   0.0779083  0.912949
 0.834362   0.493895   0.999031   0.204182   0.354703
 0.633427   0.324266   0.849218   0.12479    0.694663
 0.337865   0.746908   0.192945   0.410859   0.351558
 0.112987   0.509678   0.704772   0.0967083  0.197417
 0.78299    0.919232   0.284333   0.698761   0.124391
 0.838042   0.168084   0.701562   0.338277   0.840508
 0.0878598  0.605675   0.290866   0.206142   0.53664
 0.386568   0.778229   0.487151   0.0786884  0.888034
 0.330579   0.150454   0.390493   0.470638   0.0462201
 0.748041   0.344408   0.330187   0.540366   0.887934
 0.265595   0.377464   0.0441519  0.828733   0.431977
julia> classes = fit!(mod, X); Set(classes)
Set{Int64} with 10 elements:
  5
  6
  4
  7
  2
  10
  9
  8
  1
  3

TL;DR: There is no bug in the code, this just happens sometimes with clustering when the ratio of clusters to data points is too high.

1 Like

Actually, upon further inspection, something funky is going on here. Reducing the problem to 2D for plotting and running the code

using BetaML, Random, Plots

X = rand(MersenneTwister(1234), 10, 2)

model = KMedoidsClusterer(n_classes = 10)
fit!(model, X)

scatter(Tuple.(eachrow(X)), alpha = 0.5, label = "Data")
scatter!(Tuple.(eachrow(parameters(model).representatives)), alpha = 0.5, label = "Medoids")

produces the plot

The raw data here is

julia> parameters(model).representatives
10Ă—2 Matrix{Float64}:
 0.246837  0.0566425
 0.200586  0.276021
 0.766797  0.0109059
 0.854147  0.112486
 0.494688  0.436537
 0.298614  0.651664
 0.590845  0.648882
 0.794026  0.646691
 0.460085  0.956753
 0.821469  0.909461

julia> X
10Ă—2 Matrix{Float64}:
 0.590845  0.648882
 0.766797  0.0109059
 0.566237  0.066423
 0.460085  0.956753
 0.794026  0.646691
 0.854147  0.112486
 0.200586  0.276021
 0.298614  0.651664
 0.246837  0.0566425
 0.579672  0.842714

For K-medoids the medoids should all be elements of the original data, and that’s not happening here… something is amiss, I think. Maybe @sylvaticus can offer some insight here?

I see. Also, here are some additional questions:

  • Does BetaML.jl use the PAM algorithm for K-medoids? I don’t see it mentioned in the docs. I was previously using Clustering.jl and they do mention in the docs that they don’t use the PAM algorithm and thus might give poor loss (which is why I am more inclined to use BetaML.jl)
  • Also, is there an inbuilt function that returns the final loss? Something like model.loss or model.totalcost? I don’t see anything mentioned in the docs, or I probably missed it.
  • I never see this issue of some classes not getting assigned when I use kmedoids function from Clustering.jl. Any pointers on what could be the reason for this?

Thanks!!

Hello, I am the author of BetaML.
I will investigate this issue tomorrow, but I think to remember there are no guarantees that all clusters are nonempty, but again, I’ll look on it in deep tomorrow…

2 Likes

Try with initialisation_strategy = "shuffle"… (untested, I am on a mobile)…

1 Like

I can confirm that adding that keyword fixes the issue on my machine. I read through the source code and expected that this was the default setting though…

1 Like

This fixes the issue on my machine as well. Also, for the original example I used as MWE, all clusters are non-empty now.

2 Likes

We also use k-means-style iterations when on each iteration we first find the constituency of each cluster and then independently on each cluster we compare the pairwise distances to find the “next” medoids of the specific cluster as the record with the shorter intra-cluster distance. This has a complexity on each iteration at best O(n²/k) when clusters have all the same number of records.
If you need the PAM algorithm, or the distances eventually cached, I could implement it, let’s say in January… or you could try by yourself to add it… the nice thing in Julia is that it is relatively easy to code the algorithms… In the first case, please open a issue so I can track it, in the second case you could open directly a pull request…

On master I have added an av_distance_last_fit record to the info dictionary that you can retrieve with info(model)

Still on master, I have decoupled the hyperparameters object between k-means and k-medoids and I have now set the default initialisation to k-medoids on "shuffle", addinga warning that in some circumstances using different initialisations may lead to some rapresentatives not being actual training records.
This by the way has also the advantage that we could add a further hyper-parameter “algorithm” to allow PAM and others variants.

1 Like

I did some work on the PAM algorithm a couple of years ago. See:

1 Like