# Deterministic Clustering in Julia?

Are there any deterministic clustering methods implemented in Julia? For example:

``````using Clustering
using DataFrames

df = DataFrame(subject=1:10, height=rand(10), weight=rand(10))

julia> groups = kmeans(Matrix(df[:, 2:3])', 2).assignments
10-element Array{Int64,1}:
1
1
2
1
1
1
2
1
1
1

julia> groups = kmeans(Matrix(df[:, 2:3])', 2).assignments
10-element Array{Int64,1}:
1
1
1
2
2
2
1
1
1
1
``````

Is there a clustering algorithm in Julia that will result in the same clusters for this problem each time? Or is there a way to set the seed via `Random.seed!()` that would result in the same clusters with k-means?

Any recommendations as to how to approach a problem like this where a deterministic outcome is very important would be much appreciated!

1 Like

So I looked at the Clustering.jl docsâ€¦ and found that you can control the Seeding algorithm:
https://juliastats.org/Clustering.jl/stable/kmeans.html
https://juliastats.org/Clustering.jl/stable/init.html#Seeding-1

You can provide your own vector of k indices of points to use for the seeds. So if you have some way to assign cluster seeds that way, it should be deterministic after that.

Have you tried setting the random seed methodology?

I think that hdbscan is deterministic
I have a wrapper around the Python package here

1 Like

Sure there is

There is also kmeans in BetaML (disclaimer: I am the author), chose as `initStrategy` parameter `grid` (default) or `given`.

In `given` you provide your own init points, in `grid` it scans the input space and starts at regular grid intervals:

``````julia> (dataAllocations,clusterMeans) = kmeans([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3)
``````

Also em by default is deterministic, although initiating the mixtures with the result of kmeans is much a better init approach.

(If kmeans doesnâ€™t converge, try masterâ€¦ I just committed yesterday a correction for a corner case)

1 Like

You can also use ParallelKMeans.jl. All algorithms support `rng` parameter, so you can call it like that:

``````using Random, ParallelKMeans

rng = Random.seed!(2020)
kmeans(X, 10; rng = rng)
``````

or you can use StableRNGs.jl if you want to have the same results across all julia versions

``````using StableRNGs, ParallelKMeans

rng = StableRNG(2020)
kmeans(X, 10; rng = rng)
``````

Also, for `Lloyd()`, `Hammerly()` and `Elkan()` algorithms, random generation is used only for initial seeding, so if you can prepare `init`, you can get reproducible results with

``````kmeans(Hammerly(), X, 10; init = init)
``````

In addition, this implementation is really fast, which can be useful.

1 Like