Balanced clustering or assignment of 3D points in space

I’m currently trying to solve the following problem

  • Let P be a set of points in 3D space
  • Split P into smaller sets of approximately equal size, the maximum size being a hard constraint

Some properties about P

  • P is generally not convex of spherical shaped, and contains branch-like structures
  • The spatial density of point in P is not homogeneous.

Here is a 2D example

using GeometryBasics, Clustering, GLMakie

Pts = reduce(vcat,[
    Point2f.(0.5.*rand(100),5*rand(100)),
    Point2f.(1 .-log.(2 .* rand(100)), 1 * rand(100)),
])

nmax = 20
nmin = 16
nclust = ceil(Int,length(Pts) / nmax)

scatter(Pts,axis=(aspect=DataAspect(),))

coords = reduce(hcat,Pts)

clusters = kmeans(coords, nclust)
assign = clusters.assignments
corder = [findall(isequal(i), assign) for i in unique(assign)]

f = Figure()
ax = Axis(f[1, 1])
cmap = cgrad(:tab20, length(corder), categorical=true);
f
for (p, part) in enumerate(corder)
    pts = @view Pts[part]
    mk = length(part) > nmax ? :xcross : length(part) < nmin ? :utriangle : :circle
    plt = scatter!(ax, pts, color=cmap[p], marker=mk, label="Partition $p | size = $(length(part))")
    # sleep(0.1)
end
l = Legend(f[1, 2], ax)
f

which returns
image

In this example, I tried a naive k-means to try and form the different clusters. However, as there is no size constraint it fails as expected, although the general shape of the clusters is the expected one.

I looked out for some specialized technique and found information about balanced clustering (e.g. GitHub - uef-machine-learning/Balanced_k-Means_Revisited: Balanced k-Means Revisited algorithm) or balanced assignment GitHub - matthewghgriffiths/Assignment.jl: Routines to solve the linear assignment problem, and the k-best assignment problem in pure Julia. and even graph partitioning or cuts Cuts · Graphs.jl.

However, after testing multiple methods, I could not really make progress to solve my problem. Clustering techniques in Clustering seem to fail due to the variation of density of the points, and/or inappropriate seeding. And no graph based technique I tested seem to work.
Also I used the Euclidean distance from Distances to build relations between points, but it is in the end more important to evenly split the clusters that to reduced their surface, so it may not be the proper measure to use.

Would someone have a suggestion about this ?

Did you trydensity-based clustering methods such as DBSCAN? It does not assume cluster shape, and may be the exact thing you are looking for.

I guess, there is something missing in your problem description. Or what do mean with size of the subsets? Isn’t it just the number of points? Or put it differently, why do you expect the outcome to look like your picture?

Given your problem description, an easy solution would be to assign the points (pseudo)randomly. E.g., if your maximum size is M just assign the i th point to subset j given by \text{mod}(i, M) = j.

On a more serious note: clustering thrives for maximizing the ration between inter cluster distances to intra cluster distances.
If you don’t care about this, maybe a similar approach like in a (balanced) K-D-tree would help?

I think that DBSCAN is really not a good choice in this situation. Since I’m looking for a balanced amount of points, I know how many clusters I’m looking for in advance. Moreover, since DBSCAN uses a density approach, since my points are not uniformly distributed, it won’t work really well. HDBSCAN could be a better alternative, although I tested hierarchical clustering with somewhat disappointing results. In fact, what I would like is similar to cut the tree based on cluster size instead of cluster height.

The size of the subsets would be the amount of points contained in the subsets, I think the proper term in clustering is cardinality (correct me if I’m wrong).
Indeed I think what is missing (but was definitely present in my mind) is : there should be no overlapping between clusters, equivalently in think, the clusters should be convex.

A balanced KD tree would indeed be something interesting, although I’m not sure where to start. I just had a look at Agglomerative clustering with and without structure — scikit-learn 1.5.2 documentation and it looks like it consists in combining a KNN search first then clustering, but I’m not sure I totally understand what’s explained in there.

No, convexity and being disjoint are not equivalent. I still struggle to understand, what you mean by overlapping. If you assign a point only to one subset than the subsets are trivially non overlapping, no? Are you thinking about the convex hulls of the subsets?
In any case, I think that clustering is not the approach that you are looking for, It sounds much more like a partitioning. Maybe a good entry would be Space partitioning - Wikipedia.
For the K-D-trees you could have a look into GitHub - KristofferC/NearestNeighbors.jl: High performance nearest neighbor data structures (KDTree and BallTree) and algorithms for Julia.

3 Likes

Yes, the convex hull is probably the property that should not overlap.

I’ve just tested what you proposed and now have a KDTree. However, I don’t see what I can do with it, since in the end a coordinate should be provided to the knn or nn search.

I have to admit, I had a look yesterday at Partition · Meshes.jl and hoped to find help there, but I was not able to take advantage of this.

Ok, I finally understood that KDTree.indices is actually the spatially ordered list of indices. So the equivalent of clusters could be retrieved with order = collect(Iterators.partition(kdtree.indices, nmin))

The KDTree sorted points give
image
which is pretty close to what I’m looking for. The only issue with this sort is that some subset are elongated, such as the 4 last ones.
I tried to perform a new step of clustering based on that sorted coordinates but that did not help.
Would you have a idea to get more globular partitions ?

It might interest you our work, AdaptiveHierarchicalRegularBinning.jl, AHRB for short.

AHRB bins the input points into hierarchical non-overlapping (hyper-)cubes that can be seen as a tree. In 3-d it is an octree. The process is adaptive in the sense that the empty cubes are never formed or referenced, while the cubes with too many points are further subdivided.

Along with the input points, you specify the maximum depth of the subdivision and the maximum number of points per bin. The computation starts with a root cube enclosing all points. The root is subdivided in up to 8 subcubes. The subcubes with too many points are further subdivided, and so on.

The result is a tree that provides access to continuous subvectors of your points. The points have been reordered to comply with the hierarchy.
The tree may be annotated with additional information in all nodes. Standard tree algorithms can be used to efficiently traverse the tree or just the leaves.

You won’t have “elongated” groups with an octree, but you may have many leaves with very few points.

There are other similar Julia packages, use “octree” in your search to find them.

1 Like

Thanks for the proposal. I think there is something I’m missing though. If I compare the results of the partitioned points using the indicies provided by KDTree, BallTree and ahrb, I get the following results


I’m very surprised by the mixed points sets in the AHRB results. I’m wondering if I’m using the indices properly for the partition.
The code is the following.

using GLMakie
using GLMakie.GeometryBasics, NearestNeighbors, AdaptiveHierarchicalRegularBinning

Pts = reduce(vcat, [
    Point2f.(0.5 .* rand(100), 5 * rand(100)),
    Point2f.(1 .- log.(2 .* rand(100)), 1 * rand(100)),
])

nmax = 20
nmin = 16
nclust = ceil(Int, length(Pts) / nmax)

scatter(Pts, axis=(aspect=DataAspect(),))

coord = reduce(hcat, Pts)

kdtree = KDTree(coord; leafsize=nmin)
kd_corder = collect(Iterators.partition(kdtree.indices, nmin))

balltree = BallTree(coord; leafsize=nmin)
ball_corder = collect(Iterators.partition(balltree.indices, nmin))

ahrbtree = ahrb(Float64.(coord), nclust, nmax);
ahrb_corder = collect(Iterators.partition(ahrbtree.info.perm, nmin))

f = Figure()
for (c, (corder, label)) in enumerate(zip((kd_corder, ball_corder, ahrb_corder), ("KDTree", "BallTree", "AHRB")))
    ax = Axis(f[c, 1])
    cmap = cgrad(:tab20, length(corder), categorical=true)
    for (p, part) in enumerate(corder)
        pts = @view Pts[part]
        mk = length(part) > nmax ? :xcross : length(part) < nmin ? :utriangle : :circle
        plt = scatter!(ax, pts, color=cmap[p], marker=mk, label="Partition $p | size = $(length(part))")
        # sleep(0.1)
    end
    l = Legend(f[c, 2], ax, label; nbanks=2)
end
f

I am sorry I cannot follow your example, but here is what I think you are trying to do with AHRB:

using AdaptiveHierarchicalRegularBinning, AbstractTrees
using GLMakie
using GLMakie.GeometryBasics


X = randn(2,10_000);

tree = ahrb(X, 3, 1000; QT=UInt32);
L = collect(Leaves(tree));

f = Figure()
ax = Axis(f[1,1], aspect = DataAspect())

cmap = cgrad(:tab20, length(L), categorical=true)

for (i,l) = enumerate(L)
  scatter!(ax,points(l), color=cmap[i], markersize=5)
end
f

The resulting plot demonstrates boxes at different levels.

1 Like

Hello,

It seems that you are using the ahrbtree.info.perm incorrectly. AdaptiveHierarchicalRegularBinning.jl provides an internal permutation of the input data, among other features. This internal permutation (in this case, ahrbtree.info.perm) can be used as a reference to retrieve clustering or segmentation information.

Pts = reduce(vcat,[
           Point2f.(0.5.*rand(10000),5*rand(10000)),
           Point2f.(1 .-log.(2 .* rand(10000)), 1 * rand(10000)),
])

coord = reduce(hcat, Pts)

nlevels = 3
npoints = 20
ahrbtree = ahrb(Float64.(coord), nlevels, npoints)

labels = Int.(ahrbtree.info.encoded[invperm(ahrbtree.info.perm)])

scatter(Pts, color=labels, markersize=0.5, markerstrokewidth=0, aspect_ratio=:equal)

In this example, each point is associated with a label, and nearby points are assigned the same or similar labels.

3-levels-ahrb

That said, it’s important to note that the primary use of AHRB is not clustering in the same sense as k-means. Instead, it is a tool for space segmentation (into regular hypercubes). To illustrate this, consider a similar example with a deeper recursion depth:

Pts = reduce(vcat,[
           Point2f.(0.5.*rand(10000),5*rand(10000)),
           Point2f.(1 .-log.(2 .* rand(10000)), 1 * rand(10000)),
])

coord = reduce(hcat, Pts)

nlevels = 6
npoints = 20
ahrbtree = ahrb(Float64.(coord), nlevels, npoints)

labels = Int.(ahrbtree.info.encoded[invperm(ahrbtree.info.perm)])

scatter(Pts, color=labels, markersize=0.5, markerstrokewidth=0, aspect_ratio=:equal)

The snippet is almost the same as the previous one, but this time the maximum tree depth (nlevels) is increased to 6. AHRB will still segment the space into squares, but it is now allowed to recurse deeper, forming more groups.

6-levels-ahrb

You should imagine that each little square is a cluster of its own. Any pattern that might arise (for example the blue rectangle that forms vertically is just an artifact of the colormap and not a real cluster)

Quick sidenote here, ahrb(X, nl, np) nl refers to the maximum tree depth, not the number of clusters. The number of clusters cannot be strictly controlled in AHRB. However, you can manage the population per node using np, which is the minimum number of points that any internal node is allowed to have.

You can select nl and np such that dense regions are subdivided extensively as opposed to sparse regions.

I’m also curious as to why you are grouping tree indices using Iterators.partition. Could you clarify this? Is it to ensure that each group has roughly the same size?

Let me know if my response helped or if you need further assistance with AHRB.

2 Likes

Thanks for trying and for the example, I definitely was not using the indices properly.

Yes, my two main constraints are that my groups should be of close to identical size (between nmin and nmax) and not overlapping.
Spatial partitioning with KDTree is not so bad but it fails in edge cases, like show above.
I also had a look at balanced clustering, but there does not seem to be a julia implementation of these.
Anyways, trying ahrb on my actual dataset shows that it is not a good solution to my problem given my constraints, but thanks a lot for your suggestions and help !

So, after further reading and coding, I managed to find a suitable strategy for my needs. As discussed above, although trees work nicely most of the time, the cutting strategies lead to inappropriate results with my actual dataset, including the fact that the groups returned by the tree can be quite elongated.

So I looked further into balanced clustering and found this very well detailed paper

Rieke de Maeyer, Sami Sieranoja, Pasi Fränti. Balanced k-means revisited[J]. Applied Computing and Intelligence, 2023, 3(2): 145-179. doi: 10.3934/aci.2023008

which has a reference implementation GitHub - uef-machine-learning/Balanced_k-Means_Revisited: Balanced k-Means Revisited algorithm (already mentioned above), although I did not usse it.

Thanks to the help provided regarding trees and spatial partition, I could go back to this with more insight and understanding of how to implement things.
So based on GitHub - JuliaStats/Clustering.jl: A Julia package for data clustering, I implemented the BKM+/++ algorithms described in the paper. This is only in a local branch at the moment, but I plan to do a PR soon.

Here are the demonstration results

obtained with the following — unreproducible at the moment — code

using GLMakie
using GLMakie.GeometryBasics, NearestNeighbors
using Clustering # Not reproducible at the moment

n = 1000
Pts = reduce(vcat, [
    Point2f.(0.5 .* rand(n), 5 * rand(n)),
    Point2f.(1 .- log.(2 .* rand(n)), 1 * rand(n)),
])

nmax = 100
nmin = 16
nclust = ceil(Int, length(Pts) / nmax)

scatter(Pts, axis=(aspect=DataAspect(),))

coord = reduce(hcat, Pts)

balltree = BallTree(coord; leafsize=nmax)
ball_corder = collect(Iterators.partition(balltree.indices, nmax))

ball_cind = [first(nn(balltree, mean(Pts[cord, :]))) for cord in ball_corder]

clusters = kmeans(coord, nclust, init=ball_cind)
kmeans_assign = clusters.assignments

clusters = Clustering.balanced_kmeans(coord, nclust, dnmax=10, swap=false, init=ball_cind)
bkmeansp_assign = clusters.assignments

clusters = Clustering.balanced_kmeans(coord, nclust, dnmax=10, swap=true, init=ball_cind)
bkmeanspp_assign = clusters.assignments

f = Figure()
l = Label(f[0, 1:2], "Dnmax = 10")
for (c, (assign, label)) in enumerate(zip((kmeans_assign, bkmeansp_assign, bkmeanspp_assign), ("Kmeans", "Balanced Kmeans + ", "Balanced Kmeans ++")))
    ax = Axis(f[c, 1], aspect=DataAspect())
    cmap = cgrad(:tab20, length(corder), categorical=true)
    corder = [findall(isequal(i), assign) for i in sort!(unique(assign))]
    for (p, part) in enumerate(corder)
        pts = @view Pts[part]
        mk = length(part) > nmax ? :xcross : length(part) < nmin ? :utriangle : :circle
        plt = scatter!(ax, pts, color=cmap[p], marker=mk, label="Partition $p | size = $(length(part))")
        # sleep(0.1)
    end
    l = Legend(f[c, 2], ax, label; nbanks=2)
end
f

clusters = kmeans(coord, nclust, init=ball_cind)
kmeans_assign = clusters.assignments

clusters = Clustering.balanced_kmeans(coord, nclust, dnmax=1, swap=false, init=ball_cind)
bkmeansp_assign = clusters.assignments

clusters = Clustering.balanced_kmeans(coord, nclust, dnmax=1, swap=true, init=ball_cind)
bkmeanspp_assign = clusters.assignments

l = Label(f[0, 3:4], "Dnmax = 1")
for (c, (assign, label)) in enumerate(zip((kmeans_assign, bkmeansp_assign, bkmeanspp_assign), ("Kmeans", "Balanced Kmeans + ", "Balanced Kmeans ++")))
    ax = Axis(f[c, 3], aspect=DataAspect())
    cmap = cgrad(:tab20, length(corder), categorical=true)
    corder = [findall(isequal(i), assign) for i in sort!(unique(assign))]
    for (p, part) in enumerate(corder)
        pts = @view Pts[part]
        mk = length(part) > nmax ? :xcross : length(part) < nmin ? :utriangle : :circle
        plt = scatter!(ax, pts, color=cmap[p], marker=mk, label="Partition $p | size = $(length(part))")
        # sleep(0.1)
    end
    l = Legend(f[c, 4], ax, label; nbanks=2)
end
f

NearestNeighbors is however very useful in this case to find good starting guesses for the clustering by finding the closest point to the centroids.

If somebody stumbles on this, e.g. @Datseris @juliohm @Rudi79 , please let me know if that can be of interest, for instance with GitHub - JuliaML/ClusteringAPI.jl: Common API for clustering algorithms in Julia, given the last discussions about Clustering.