I’m currently trying to solve the following problem
- Let P be a set of points in 3D space
- Split P into smaller sets of approximately equal size, the maximum size being a hard constraint
Some properties about P
- P is generally not convex of spherical shaped, and contains branch-like structures
- The spatial density of point in P is not homogeneous.
Here is a 2D example
using GeometryBasics, Clustering, GLMakie
Pts = reduce(vcat,[
Point2f.(0.5.*rand(100),5*rand(100)),
Point2f.(1 .-log.(2 .* rand(100)), 1 * rand(100)),
])
nmax = 20
nmin = 16
nclust = ceil(Int,length(Pts) / nmax)
scatter(Pts,axis=(aspect=DataAspect(),))
coords = reduce(hcat,Pts)
clusters = kmeans(coords, nclust)
assign = clusters.assignments
corder = [findall(isequal(i), assign) for i in unique(assign)]
f = Figure()
ax = Axis(f[1, 1])
cmap = cgrad(:tab20, length(corder), categorical=true);
f
for (p, part) in enumerate(corder)
pts = @view Pts[part]
mk = length(part) > nmax ? :xcross : length(part) < nmin ? :utriangle : :circle
plt = scatter!(ax, pts, color=cmap[p], marker=mk, label="Partition $p | size = $(length(part))")
# sleep(0.1)
end
l = Legend(f[1, 2], ax)
f
which returns
In this example, I tried a naive k-means to try and form the different clusters. However, as there is no size constraint it fails as expected, although the general shape of the clusters is the expected one.
I looked out for some specialized technique and found information about balanced clustering (e.g. GitHub - uef-machine-learning/Balanced_k-Means_Revisited: Balanced k-Means Revisited algorithm) or balanced assignment GitHub - matthewghgriffiths/Assignment.jl: Routines to solve the linear assignment problem, and the k-best assignment problem in pure Julia. and even graph partitioning or cuts Cuts · Graphs.jl.
However, after testing multiple methods, I could not really make progress to solve my problem. Clustering techniques in Clustering
seem to fail due to the variation of density of the points, and/or inappropriate seeding. And no graph based technique I tested seem to work.
Also I used the Euclidean
distance from Distances
to build relations between points, but it is in the end more important to evenly split the clusters that to reduced their surface, so it may not be the proper measure to use.
Would someone have a suggestion about this ?