Ha, never trust a random guy from internet. Yes, you are right, centroids_cnt should be moved to the outer loop. With this change results less amazing kind of ~9.5s which is still good.
That’s the last one
After implementing custom pairwise function, refactoring sum_of_squares, and reordering loops in order to respect memory layout, I was finally able to achieve ~5s of execution time (now for real). It’s in the same gist, so I wouldn’t repeat link here. Which is even more interesting, new version can beat Clustering implementation.
using Clustering
@btime kmeans(transpose($X), 2) # 187.700 ms (95 allocations: 83.93 MiB)
@btime Kmeans($X, 2, verbose = false, tol=1e-10) # 118.517 ms (27 allocations: 45.78 MiB)
The next interesting step is the parallelization of this problem, it should be easy enough. Even more interesting would be to try to parallelize it on GPU, something tells me that it is going to be spectacular.
Clustering.jl is an existing and popular package for doing clustering.
I suggest benchmarking against it’s current implementation, and if your code is faster, make a PR.
@Skoffer would be the best person for the PR since he came up with the optimizations. From all the benchmarking results, this implementation is faster and more memory efficient than the current implementation in Clustering.jl.