Deterministic Clustering in Julia?

mthelm85 · June 25, 2020, 9:05pm

Are there any deterministic clustering methods implemented in Julia? For example:

using Clustering
using DataFrames

df = DataFrame(subject=1:10, height=rand(10), weight=rand(10))

julia> groups = kmeans(Matrix(df[:, 2:3])', 2).assignments
10-element Array{Int64,1}:
 1
 1
 2
 1
 1
 1
 2
 1
 1
 1

julia> groups = kmeans(Matrix(df[:, 2:3])', 2).assignments
10-element Array{Int64,1}:
 1
 1
 1
 2
 2
 2
 1
 1
 1
 1

Is there a clustering algorithm in Julia that will result in the same clusters for this problem each time? Or is there a way to set the seed via Random.seed!() that would result in the same clusters with k-means?

Any recommendations as to how to approach a problem like this where a deterministic outcome is very important would be much appreciated!

dlakelan · June 26, 2020, 2:07am

So I looked at the Clustering.jl docs… and found that you can control the Seeding algorithm:
https://juliastats.org/Clustering.jl/stable/kmeans.html
https://juliastats.org/Clustering.jl/stable/init.html#Seeding-1

You can provide your own vector of k indices of points to use for the seeds. So if you have some way to assign cluster seeds that way, it should be deterministic after that.

Have you tried setting the random seed methodology?

baggepinnen · June 26, 2020, 3:26am

I think that hdbscan is deterministic
I have a wrapper around the Python package here
https://github.com/baggepinnen/HDBSCAN.jl

sylvaticus · June 26, 2020, 6:05am

Sure there is

There is also kmeans in BetaML (disclaimer: I am the author), chose as initStrategy parameter grid (default) or given.

In given you provide your own init points, in grid it scans the input space and starts at regular grid intervals:

julia> (dataAllocations,clusterMeans) = kmeans([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3)

Also em by default is deterministic, although initiating the mixtures with the result of kmeans is much a better init approach.

(If kmeans doesn’t converge, try master… I just committed yesterday a correction for a corner case)

Skoffer · June 26, 2020, 6:54am

You can also use ParallelKMeans.jl. All algorithms support rng parameter, so you can call it like that:

using Random, ParallelKMeans

rng = Random.seed!(2020)
kmeans(X, 10; rng = rng)

or you can use StableRNGs.jl if you want to have the same results across all julia versions

using StableRNGs, ParallelKMeans

rng = StableRNG(2020)
kmeans(X, 10; rng = rng)

Also, for Lloyd(), Hammerly() and Elkan() algorithms, random generation is used only for initial seeding, so if you can prepare init, you can get reproducible results with

kmeans(Hammerly(), X, 10; init = init)

In addition, this implementation is really fast, which can be useful.

Topic		Replies	Views
How to seed the KMeans algorithm in Clustering.jl? Data question , data , clustering	3	877	August 26, 2022
RFC: ClusteringAPI.jl Statistics statistics , data , cluster , clustering	27	1066	April 18, 2024
Seemingly nondeterministic behavior despite setting Random.seed!()? General Usage question , rng	7	454	August 18, 2022
HELP: Implementing K-means from scratch with Julia General Usage question	4	1497	February 6, 2020
A question about Clustering.jl General Usage	1	720	February 8, 2022

Deterministic Clustering in Julia?

Related topics