We are finally ready to announce ParallelKMeans is finally and stable enough for daily use. Before I go into the details of this package, I would like to apologize to @Skoffer for the little commotion my initial outburst about this idea caused 1 year ago. In hindsight, I could (and should have) handled it better.
Like the big bang, this package has not only evolved as the fastest KMeans implementation around but also gained me an amazing friendship. Enough about how amazing @Skoffer is!
The main features of this package are:
- Lightening fast implementation of K-Means clustering algorithm even on a single thread in native Julia.
- Support for multi-theading implementation of K-Means clustering algorithm.
- Implementation of classic & contemporary variants of the K-Means algorithm.
- Support for all distance metrics available at Distances.jl
- Supported interface as an MLJ model.
Current benchmark look really promising as this package is lightning fast (over 76x faster than Python’s implementation via scikit-learn) compared to other implementations in other languages.
There are still many features planned for future releases so feel free to contribute in any form or shape. This community created and pushed this idea beyond my wildest imagination.
Update: A minor bug has now been fixed in the plot for YingYang 100k sample
This is great timing! I’m just about to do some clustering for a project, so I’ll try this out.
That’s great to hear. Feedback will be greatly appreciated!
Hello! This looks super cool, also the YY implementation is great to have in Julia for sure. Is the code for the benchmark available somewhere? Last year we’ve implemented batch SOMs for huge datasets (which is basically the same as kMeans, sans one small (model) matrix multiplication per iteration), and I’d like to see how we would stand in this comparison. (The package is here: GitHub - LCSB-BioCore/GigaSOM.jl: Huge-scale, high-performance flow cytometry clustering in Julia )
Yes, the benchmarks can be found here
It would be also interesting to compare with the
Coreset approach (maybe with the internal YingYang algorithm). In my experiments,
Coreset gave high-quality clusters in a fraction of full Lloyd-like algorithms time.
noticed that R was not in the list of benchmarks.
wonder how it will perform.
We used R implementation of knor which is as far as I know faster than other R implementations of means.
Here’s a table of all the benchmark results including the languages.
The next set of expanded benchmarks will definitely have this benchmark.
Thanks for the link! What is the used dimensionality and the (final) number of clusters? I didn’t find that even from the notebook, and well, the performance varies wildly with that.
(We did something similar for GPUs here cuda-kmeans/results at master · krulis-martin/cuda-kmeans · GitHub – you can see the performance is basically ndk, so not reporting d*k makes the benchmark much less useful than it could be.)
The data used for this benchmark can be found here
The next iteration of benchmarks will give various dimension levels.
The elbow method was used as practical benchmarking criteria for the ‘selection of the final number of clusters’ as this is a common utility that practitioners use for such decisions.
In your benchmarks, all the algorithms get about 10x slower for a 10x increase in data, which makes sense, except that they each have a point where they get 1000x slower (except Knor which is at the slower level throughout). Do you know why this is? Is it a cache thing?
It looks like your implementations avoid this barrier for longer somehow, can you say why? Though even ignoring the jumps you appear to be about 8x faster anyway.
Can you please re-evaluate your observations given the update in the plot now?
I don’t understand, my observations are about the plot currently in the readme.
The plot has been updated hence the request for any potential re-evaluation.