K-Means with penalty on (weighted) cluster size

nilshg · November 25, 2019, 2:01pm

As the title suggests, I’m trying to cluster a set of points in space into a given number of clusters based on their proximity, however I also want the clusters to be of roughly equal size based on the “weight” of each location that ends up in a cluster.

I’ve knocked up the simplest possible algorithm to do this based on an idea I found on CrossValidated here, and it appears to converge relatively quickly, but I’m wondering whether this is a known problem with a better solution.

From a pure Julia perspective, there seem to quite a lot of allocations so if anyone’s got suggestions to get those down that’d be welcome.

using LinearAlgebra, Random

# Create mock data
loc = 10*rand(150, 2)
dist = zeros(150, 150)
for i ∈ 1:150
    x1, y1 = loc[i, :]
    for j ∈ i:150
        x2, y2 = loc[j, :]
        dist[i, j] = √((x2 - x1)^2 + (y2 - y1)^2)
    end
end
w = 5*rand(150)
w̄ = sum(w)/length(w)

# Objective function
function obj(ass, dist, w, w̄, cs; λ = 50.0)
    # Cumulative distance between all points in cluster
    ϵ = sum(sum(@view dist[ass .== i, ass .== i]) for i ∈ 1:cs)
    # Deviation from average weight by cluster
    σ = sum(sum(@view w[ass .== i]) for i ∈ 1:cs) .- w̄

    # Objective is total distance within all clusters and scale of variation of weight across 
    sum(ϵ) + λ*norm(σ)
end

# Minimization algorithm
function get_clusters(dist, w, w̄, clusters::Integer)
    
    ass = shuffle(
        repeat(collect(1:clusters), Int(round(length(w)/clusters, RoundUp)))
    )[1:length(w)]

    ϵ = obj(ass, dist, w, w̄, clusters)

    for t ∈ 1:5 # Seems to converge within 5 iterations in practice
        for loc1 in 1:length(w)
            for loc2 in loc1:length(w)
                # perform swap
                a1 = ass[loc1]; a2 = ass[loc2]
                ass[loc1] = a2; ass[loc2] = a1

                # Calculate assignment objective value
                ϵ′ = obj(ass, dist, w, w̄, clusters)

                # Keep swap if objective went down
                if ϵ′ < ϵ
                    ϵ = ϵ′
                else
                    ass[loc1] = a1; ass[loc2] = a2
                end
            end
        end
        @show ϵ
    end

    return ass
end

# Try it out with 5 clusters
clusters = 5

# Get new assignment into clusters
ass_new = get_clusters(dist_mat, w, w̄, clusters)

zgornel · November 25, 2019, 3:31pm

Any particular reason for not using Clustering.jl’s kmeans primitives? Dropping a lot of code and asking for a better solution is bound to be …well, suboptimal.

nilshg · November 25, 2019, 3:33pm

My understanding is that k-means doesn’t support weighted penalties on unqueal cluster sizes, but I might have missed something in Clustering.jl?

baggepinnen · November 26, 2019, 12:42am

This code allocates, I would try rewrite the sum to use a function as the first argument. It might not be faster though, as the logical indexing is known to be fast, but it might be worth a try.

Edit: at the very least, reuse the bitvector created by the statement

zgornel · November 26, 2019, 10:43am

You can have a look at
https://github.com/JuliaStats/Clustering.jl/blob/master/src/kmeans.jl#L206
and
https://github.com/JuliaStats/Clustering.jl/blob/d2500396611bc742a6e3c5a9947115fde9b56938/src/kmeans.jl#L268
These may be enough I believe to alter kmeans behavior accordingly.

Topic		Replies	Views
HELP: Implementing K-means from scratch with Julia General Usage question	4	1485	February 6, 2020
Using Cosine Similarity for KMeans clustering General Usage	2	744	February 9, 2021
Balanced clustering or assignment of 3D points in space Machine Learning question , clustering , geometry	15	224	October 27, 2024
How to cluster points with a maximum cluster size? Or find points with maximum diversity? Data	12	155	November 14, 2024
A question about Clustering.jl General Usage	1	719	February 8, 2022

K-Means with penalty on (weighted) cluster size

Related topics