# K-Means with penalty on (weighted) cluster size

As the title suggests, I’m trying to cluster a set of points in space into a given number of clusters based on their proximity, however I also want the clusters to be of roughly equal size based on the “weight” of each location that ends up in a cluster.

I’ve knocked up the simplest possible algorithm to do this based on an idea I found on CrossValidated here, and it appears to converge relatively quickly, but I’m wondering whether this is a known problem with a better solution.

From a pure Julia perspective, there seem to quite a lot of allocations so if anyone’s got suggestions to get those down that’d be welcome.

``````using LinearAlgebra, Random

# Create mock data
loc = 10*rand(150, 2)
dist = zeros(150, 150)
for i ∈ 1:150
x1, y1 = loc[i, :]
for j ∈ i:150
x2, y2 = loc[j, :]
dist[i, j] = √((x2 - x1)^2 + (y2 - y1)^2)
end
end
w = 5*rand(150)
w̄ = sum(w)/length(w)

# Objective function
function obj(ass, dist, w, w̄, cs; λ = 50.0)
# Cumulative distance between all points in cluster
ϵ = sum(sum(@view dist[ass .== i, ass .== i]) for i ∈ 1:cs)
# Deviation from average weight by cluster
σ = sum(sum(@view w[ass .== i]) for i ∈ 1:cs) .- w̄

# Objective is total distance within all clusters and scale of variation of weight across
sum(ϵ) + λ*norm(σ)
end

# Minimization algorithm
function get_clusters(dist, w, w̄, clusters::Integer)

ass = shuffle(
repeat(collect(1:clusters), Int(round(length(w)/clusters, RoundUp)))
)[1:length(w)]

ϵ = obj(ass, dist, w, w̄, clusters)

for t ∈ 1:5 # Seems to converge within 5 iterations in practice
for loc1 in 1:length(w)
for loc2 in loc1:length(w)
# perform swap
a1 = ass[loc1]; a2 = ass[loc2]
ass[loc1] = a2; ass[loc2] = a1

# Calculate assignment objective value
ϵ′ = obj(ass, dist, w, w̄, clusters)

# Keep swap if objective went down
if ϵ′ < ϵ
ϵ = ϵ′
else
ass[loc1] = a1; ass[loc2] = a2
end
end
end
@show ϵ
end

return ass
end

# Try it out with 5 clusters
clusters = 5

# Get new assignment into clusters
ass_new = get_clusters(dist_mat, w, w̄, clusters)
``````

Any particular reason for not using `Clustering.jl`'s kmeans primitives? Dropping a lot of code and asking for a better solution is bound to be …well, suboptimal.

My understanding is that k-means doesn’t support weighted penalties on unqueal cluster sizes, but I might have missed something in Clustering.jl?

This code allocates, I would try rewrite the sum to use a function as the first argument. It might not be faster though, as the logical indexing is known to be fast, but it might be worth a try.

Edit: at the very least, reuse the bitvector created by the statement

2 Likes

You can have a look at

and

These may be enough I believe to alter kmeans behavior accordingly.

1 Like