K-Means with penalty on (weighted) cluster size

As the title suggests, I’m trying to cluster a set of points in space into a given number of clusters based on their proximity, however I also want the clusters to be of roughly equal size based on the “weight” of each location that ends up in a cluster.

I’ve knocked up the simplest possible algorithm to do this based on an idea I found on CrossValidated here, and it appears to converge relatively quickly, but I’m wondering whether this is a known problem with a better solution.

From a pure Julia perspective, there seem to quite a lot of allocations so if anyone’s got suggestions to get those down that’d be welcome.

using LinearAlgebra, Random

# Create mock data
loc = 10*rand(150, 2)
dist = zeros(150, 150)
for i ∈ 1:150
    x1, y1 = loc[i, :]
    for j ∈ i:150
        x2, y2 = loc[j, :]
        dist[i, j] = √((x2 - x1)^2 + (y2 - y1)^2)
    end
end
w = 5*rand(150)
w̄ = sum(w)/length(w)

# Objective function
function obj(ass, dist, w, w̄, cs; λ = 50.0)
    # Cumulative distance between all points in cluster
    ϵ = sum(sum(@view dist[ass .== i, ass .== i]) for i ∈ 1:cs)
    # Deviation from average weight by cluster
    σ = sum(sum(@view w[ass .== i]) for i ∈ 1:cs) .- w̄

    # Objective is total distance within all clusters and scale of variation of weight across 
    sum(ϵ) + λ*norm(σ)
end

# Minimization algorithm
function get_clusters(dist, w, w̄, clusters::Integer)
    
    ass = shuffle(
        repeat(collect(1:clusters), Int(round(length(w)/clusters, RoundUp)))
    )[1:length(w)]

    ϵ = obj(ass, dist, w, w̄, clusters)

    for t ∈ 1:5 # Seems to converge within 5 iterations in practice
        for loc1 in 1:length(w)
            for loc2 in loc1:length(w)
                # perform swap
                a1 = ass[loc1]; a2 = ass[loc2]
                ass[loc1] = a2; ass[loc2] = a1

                # Calculate assignment objective value
                ϵ′ = obj(ass, dist, w, w̄, clusters)

                # Keep swap if objective went down
                if ϵ′ < ϵ
                    ϵ = ϵ′
                else
                    ass[loc1] = a1; ass[loc2] = a2
                end
            end
        end
        @show ϵ
    end

    return ass
end

# Try it out with 5 clusters
clusters = 5

# Get new assignment into clusters
ass_new = get_clusters(dist_mat, w, w̄, clusters)

Any particular reason for not using Clustering.jl’s kmeans primitives? Dropping a lot of code and asking for a better solution is bound to be …well, suboptimal.

My understanding is that k-means doesn’t support weighted penalties on unqueal cluster sizes, but I might have missed something in Clustering.jl?

This code allocates, I would try rewrite the sum to use a function as the first argument. It might not be faster though, as the logical indexing is known to be fast, but it might be worth a try.

Edit: at the very least, reuse the bitvector created by the statement

2 Likes

You can have a look at
https://github.com/JuliaStats/Clustering.jl/blob/master/src/kmeans.jl#L206
and
https://github.com/JuliaStats/Clustering.jl/blob/d2500396611bc742a6e3c5a9947115fde9b56938/src/kmeans.jl#L268
These may be enough I believe to alter kmeans behavior accordingly.

1 Like