To be honest, this optimizations quite the opposite of advanced in this case. Instead of writing sophisticated code, you go to more and more low level. This is rather sad, may be someone knows better way (and it should be, for sure)
So, somewhere at the beginning of your function you define
centroids_cnt = Vector{Float64}(undef, size(centroids, 1))
and instead of the comprehensions you use this
centroids .= 0.0
centroids_cnt .= 0.0
for i in axes(design_matrix, 1)
for j in axes(design_matrix, 2)
centroids[labels[i], j] += design_matrix[i, j]
centroids_cnt[labels[i]] += 1.0
end
end
for i in axes(centroids, 1)
for j in axes(centroids, 2)
centroids[i, j] /= centroids_cnt[i]
end
end
It may be a good idea, to pack into something like update_centroids!(centroids, X, labels, centroids_cnt).
Re-use of centroids_cnt is overkill of course, I just eliminate all allocations.
I think, that it is possible to get something like ~1s of execution time in fully optimized code.