Looking for Julia implementations of online clustering algorithms

… to no avail.

The use case is quite common: real time clustering of streamed data.
I envision a fusion between functionalities of the excellent OnlineStats.jl and Clustering.jl, but so far couldn’t find any ready implementations.

I’d appreciate any pointer to relevant Julia project(s). Thanks!

Probably not efficient for very large datasets (that I guess is the need for online training), but my BetaML.GaussianMixtureClustering model has some support for online fitting:

“Online fitting (re-fitting with new data) is supported by setting the old learned mixtrures as the starting values”

Give OnlineStats.Kmeans a try:

3 Likes

Yeah, the usual problem with k-Means is the need to specify the number of clusters beforehand. In my problem, the data stream is expected to bifurcate into several clusters which then can go stale, i.e., disappear for a while, reappear, etc.

OnlineStats.jl seems to offer many building blocks useful to tackle the problem and I will give it a try, but I hoped to find something ready made. Perhaps, something like Python’s DenStream - River or CluStream - River. I haven’t worked with those packages yet, but their features look promising.

Thanks for the reference!
The need for online clustering is not so the data size per se, but a low latency requirement of the real time system I’m developing.