Package for clustering data points

Any package recommendations for clustering data points? I see clustering.jl exist, however, I don’t see any recent activity on it. Is it still being maintained? Any other package that might be actively maintained?

My BetaML.jl package provides kmeans, kmedoids (hard clustering) and gmm (soft clustering), and these are available also through the MLJ interface…

Thanks. I was not aware of your package. This should work for my purpose. :slight_smile: So, I am assuming your package does a lot of ML stuff without focusing too much on computational efficiency. For more focused and tailored algorithms you refer to the alternative packages you list on your GitHub page. Is that right? (Just trying to get an idea of how your package differs from the existing ones).

Also, how do I access it through the MLJ interface? Any reference that I can look into? I have never used it before. :sweat_smile:

Yes that’s correct… then everything is relative.
I am out of my pc now, but I have some benchmark that eg for missing imputation compared to R mice, time is quite good.
RF are really slow compared to DecidionTrees.jl, but that because they use an algorithm where they accept almost everything, including unordered and missing data.

For the MLJ interface refer to its documentation… it adds a bunch of concepts to grasp (machine, scientific types,…) but then it allows a conmon api for all ml models…

1 Like

You may also like HorseML.jl by @QGMW22 ([ANN] HorseML.jl v0.4.0 and [ANN] HorseML.jlv0.4.1: Many ML algorithms)

1 Like

@sylvaticus I have to admit that I just went very briefly though a documentation of BetaML and there seems to be some areas that I have to do additional reading thus I would like to take some of your time and ask you directly if BetaML is automatically choosing the number of clusters?

Also @math_opt, I cannot comment on timings re large data sets but I took a look at my notes and I see that I have been also using ParallelKMeans.jl (GitHub - PyDataBlog/ParallelKMeans.jl: Parallel & lightning fast implementation of available classic and contemporary variants of the KMeans clustering algorithm). Hope it helps (maybe).

Hi, it doesn’t “automatically” provide the number of clusters, but GMM returns both the Bayesian information criterion (BIC) and the Akaike information criterion (AIC) criteria that can be used to choose it…

Hi, I got it, thanks! (BTW, also many thanks for Introduction to Scientific Programming and Machine Learning with Julia)

Thanks for directing me to these packages! I need to look into this. It looks like I am not aware of a lot of existing packages. Glad I asked here.

1 Like