Are there any deterministic clustering methods implemented in Julia? For example:
using Clustering
using DataFrames
df = DataFrame(subject=1:10, height=rand(10), weight=rand(10))
julia> groups = kmeans(Matrix(df[:, 2:3])', 2).assignments
10-element Array{Int64,1}:
1
1
2
1
1
1
2
1
1
1
julia> groups = kmeans(Matrix(df[:, 2:3])', 2).assignments
10-element Array{Int64,1}:
1
1
1
2
2
2
1
1
1
1
Is there a clustering algorithm in Julia that will result in the same clusters for this problem each time? Or is there a way to set the seed via Random.seed!()
that would result in the same clusters with k-means?
Any recommendations as to how to approach a problem like this where a deterministic outcome is very important would be much appreciated!
1 Like
So I looked at the Clustering.jl docs… and found that you can control the Seeding algorithm:
https://juliastats.org/Clustering.jl/stable/kmeans.html
https://juliastats.org/Clustering.jl/stable/init.html#Seeding-1
You can provide your own vector of k indices of points to use for the seeds. So if you have some way to assign cluster seeds that way, it should be deterministic after that.
Have you tried setting the random seed methodology?
I think that hdbscan is deterministic
I have a wrapper around the Python package here
https://github.com/baggepinnen/HDBSCAN.jl
1 Like
Sure there is
There is also kmeans in BetaML (disclaimer: I am the author), chose as initStrategy
parameter grid
(default) or given
.
In given
you provide your own init points, in grid
it scans the input space and starts at regular grid intervals:
julia> (dataAllocations,clusterMeans) = kmeans([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3)
Also em by default is deterministic, although initiating the mixtures with the result of kmeans is much a better init approach.
(If kmeans doesn’t converge, try master… I just committed yesterday a correction for a corner case)
1 Like
You can also use ParallelKMeans.jl. All algorithms support rng
parameter, so you can call it like that:
using Random, ParallelKMeans
rng = Random.seed!(2020)
kmeans(X, 10; rng = rng)
or you can use StableRNGs.jl if you want to have the same results across all julia versions
using StableRNGs, ParallelKMeans
rng = StableRNG(2020)
kmeans(X, 10; rng = rng)
Also, for Lloyd()
, Hammerly()
and Elkan()
algorithms, random generation is used only for initial seeding, so if you can prepare init
, you can get reproducible results with
kmeans(Hammerly(), X, 10; init = init)
In addition, this implementation is really fast, which can be useful.
1 Like