[ANN] ParallelKMeans v1.0.0 - KMeans In Super Sonic Mode

PyDataBlog · May 25, 2021, 2:37pm

Hello everyone.

We are finally ready to announce ParallelKMeans is finally and stable enough for daily use. Before I go into the details of this package, I would like to apologize to @Skoffer for the little commotion my initial outburst about this idea caused 1 year ago. In hindsight, I could (and should have) handled it better.

Like the big bang, this package has not only evolved as the fastest KMeans implementation around but also gained me an amazing friendship. Enough about how amazing @Skoffer is!

The main features of this package are:

Lightening fast implementation of K-Means clustering algorithm even on a single thread in native Julia.
Support for multi-theading implementation of K-Means clustering algorithm.
Implementation of classic & contemporary variants of the K-Means algorithm.
Support for all distance metrics available at Distances.jl
Supported interface as an MLJ model.

Current benchmark look really promising as this package is lightning fast (over 76x faster than Python’s implementation via scikit-learn) compared to other implementations in other languages.

There are still many features planned for future releases so feel free to contribute in any form or shape. This community created and pushed this idea beyond my wildest imagination.

Update: A minor bug has now been fixed in the plot for YingYang 100k sample

Satvik · May 25, 2021, 3:13pm

This is great timing! I’m just about to do some clustering for a project, so I’ll try this out.

PyDataBlog · May 25, 2021, 9:15pm

That’s great to hear. Feedback will be greatly appreciated!

MirekKratochvil · May 25, 2021, 9:53pm

Hello! This looks super cool, also the YY implementation is great to have in Julia for sure. Is the code for the benchmark available somewhere? Last year we’ve implemented batch SOMs for huge datasets (which is basically the same as kMeans, sans one small (model) matrix multiplication per iteration), and I’d like to see how we would stand in this comparison. (The package is here: GitHub - LCSB-BioCore/GigaSOM.jl: Huge-scale, high-performance flow cytometry clustering in Julia )

PyDataBlog · May 25, 2021, 10:02pm

Yes, the benchmarks can be found here

Skoffer · May 26, 2021, 6:02am

It would be also interesting to compare with the Coreset approach (maybe with the internal YingYang algorithm). In my experiments, Coreset gave high-quality clusters in a fraction of full Lloyd-like algorithms time.

xiaodai · May 26, 2021, 6:22am

noticed that R was not in the list of benchmarks.

wonder how it will perform.

Skoffer · May 26, 2021, 6:29am

We used R implementation of knor which is as far as I know faster than other R implementations of means.

PyDataBlog · May 26, 2021, 7:42am

Here’s a table of all the benchmark results including the languages.

PyDataBlog · May 26, 2021, 8:01am

The next set of expanded benchmarks will definitely have this benchmark.

MirekKratochvil · May 26, 2021, 8:07am

Thanks for the link! What is the used dimensionality and the (final) number of clusters? I didn’t find that even from the notebook, and well, the performance varies wildly with that.

(We did something similar for GPUs here cuda-kmeans/results at master · krulis-martin/cuda-kmeans · GitHub – you can see the performance is basically ndk, so not reporting d*k makes the benchmark much less useful than it could be.)

PyDataBlog · May 26, 2021, 8:18am

The data used for this benchmark can be found here

The next iteration of benchmarks will give various dimension levels.

The elbow method was used as practical benchmarking criteria for the ‘selection of the final number of clusters’ as this is a common utility that practitioners use for such decisions.

cjdoris · May 26, 2021, 8:41am

Very cool!

In your benchmarks, all the algorithms get about 10x slower for a 10x increase in data, which makes sense, except that they each have a point where they get 1000x slower (except Knor which is at the slower level throughout). Do you know why this is? Is it a cache thing?

It looks like your implementations avoid this barrier for longer somehow, can you say why? Though even ignoring the jumps you appear to be about 8x faster anyway.

PyDataBlog · May 26, 2021, 9:47am

Can you please re-evaluate your observations given the update in the plot now?

cjdoris · May 26, 2021, 1:18pm

I don’t understand, my observations are about the plot currently in the readme.

PyDataBlog · May 26, 2021, 1:22pm

The plot has been updated hence the request for any potential re-evaluation.

cjdoris · May 26, 2021, 7:16pm

I don’t see anything different, my observations/questions pertain to this plot: ParallelKMeans.jl/benchmark_image.png at e14c139d68e530571b97931105da5bfcecc612b5 · PyDataBlog/ParallelKMeans.jl · GitHub

Topic		Replies	Views
Optimization tips for my julia code. Can I make it even faster and/or memory efficient? Performance question , python	24	4212	February 15, 2020
Issues with shared memory parallel k-means implementation Julia at Scale performance , parallel	1	723	December 28, 2018
Optimize code by parallelization/GPU Performance	8	556	October 12, 2022
Package for clustering data points Machine Learning question , clustering	8	699	June 23, 2022
K-Medoids in Julia - Results Quality Machine Learning clustering	7	1460	November 27, 2020

[ANN] ParallelKMeans v1.0.0 - KMeans In Super Sonic Mode

Related topics