Deterministic Clustering in Julia?

Are there any deterministic clustering methods implemented in Julia? For example:

using Clustering
using DataFrames

df = DataFrame(subject=1:10, height=rand(10), weight=rand(10))

julia> groups = kmeans(Matrix(df[:, 2:3])', 2).assignments
10-element Array{Int64,1}:
 1
 1
 2
 1
 1
 1
 2
 1
 1
 1

julia> groups = kmeans(Matrix(df[:, 2:3])', 2).assignments
10-element Array{Int64,1}:
 1
 1
 1
 2
 2
 2
 1
 1
 1
 1

Is there a clustering algorithm in Julia that will result in the same clusters for this problem each time? Or is there a way to set the seed via Random.seed!() that would result in the same clusters with k-means?

Any recommendations as to how to approach a problem like this where a deterministic outcome is very important would be much appreciated!

1 Like

So I looked at the Clustering.jl docs… and found that you can control the Seeding algorithm:
https://juliastats.org/Clustering.jl/stable/kmeans.html
https://juliastats.org/Clustering.jl/stable/init.html#Seeding-1

You can provide your own vector of k indices of points to use for the seeds. So if you have some way to assign cluster seeds that way, it should be deterministic after that.

Have you tried setting the random seed methodology?

I think that hdbscan is deterministic
I have a wrapper around the Python package here
https://github.com/baggepinnen/HDBSCAN.jl

1 Like

Sure there is :slight_smile:

There is also kmeans in BetaML (disclaimer: I am the author), chose as initStrategy parameter grid (default) or given.

In given you provide your own init points, in grid it scans the input space and starts at regular grid intervals:

julia> (dataAllocations,clusterMeans) = kmeans([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3)

Also em by default is deterministic, although initiating the mixtures with the result of kmeans is much a better init approach.

(If kmeans doesn’t converge, try master… I just committed yesterday a correction for a corner case)

1 Like

You can also use ParallelKMeans.jl. All algorithms support rng parameter, so you can call it like that:

using Random, ParallelKMeans

rng = Random.seed!(2020)
kmeans(X, 10; rng = rng)

or you can use StableRNGs.jl if you want to have the same results across all julia versions

using StableRNGs, ParallelKMeans

rng = StableRNG(2020)
kmeans(X, 10; rng = rng)

Also, for Lloyd(), Hammerly() and Elkan() algorithms, random generation is used only for initial seeding, so if you can prepare init, you can get reproducible results with

kmeans(Hammerly(), X, 10; init = init)

In addition, this implementation is really fast, which can be useful.

1 Like