Package for clustering data points

math_opt · June 21, 2022, 11:08pm

Any package recommendations for clustering data points? I see clustering.jl exist, however, I don’t see any recent activity on it. Is it still being maintained? Any other package that might be actively maintained?

sylvaticus · June 22, 2022, 10:20pm

My BetaML.jl package provides kmeans, kmedoids (hard clustering) and gmm (soft clustering), and these are available also through the MLJ interface…

math_opt · June 22, 2022, 10:57pm

Thanks. I was not aware of your package. This should work for my purpose. So, I am assuming your package does a lot of ML stuff without focusing too much on computational efficiency. For more focused and tailored algorithms you refer to the alternative packages you list on your GitHub page. Is that right? (Just trying to get an idea of how your package differs from the existing ones).

Also, how do I access it through the MLJ interface? Any reference that I can look into? I have never used it before.

sylvaticus · June 23, 2022, 4:02am

Yes that’s correct… then everything is relative.
I am out of my pc now, but I have some benchmark that eg for missing imputation compared to R mice, time is quite good.
RF are really slow compared to DecidionTrees.jl, but that because they use an algorithm where they accept almost everything, including unordered and missing data.

For the MLJ interface refer to its documentation… it adds a bunch of concepts to grasp (machine, scientific types,…) but then it allows a conmon api for all ml models…

j_u · June 23, 2022, 10:12am

You may also like HorseML.jl by @QGMW22 ([ANN] HorseML.jl v0.4.0 and [ANN] HorseML.jlv0.4.1: Many ML algorithms)

j_u · June 23, 2022, 1:08pm

@sylvaticus I have to admit that I just went very briefly though a documentation of BetaML and there seems to be some areas that I have to do additional reading thus I would like to take some of your time and ask you directly if BetaML is automatically choosing the number of clusters?

Also @math_opt, I cannot comment on timings re large data sets but I took a look at my notes and I see that I have been also using ParallelKMeans.jl (GitHub - PyDataBlog/ParallelKMeans.jl: Parallel & lightning fast implementation of available classic and contemporary variants of the KMeans clustering algorithm). Hope it helps (maybe).

sylvaticus · June 23, 2022, 1:17pm

Hi, it doesn’t “automatically” provide the number of clusters, but GMM returns both the Bayesian information criterion (BIC) and the Akaike information criterion (AIC) criteria that can be used to choose it…

j_u · June 23, 2022, 1:40pm

Hi, I got it, thanks! (BTW, also many thanks for Introduction to Scientific Programming and Machine Learning with Julia)

math_opt · June 23, 2022, 5:22pm

Thanks for directing me to these packages! I need to look into this. It looks like I am not aware of a lot of existing packages. Glad I asked here.

Topic		Replies	Views
K-Medoids clustering in BetaML.jl Data question , package , clustering	10	670	December 20, 2022
RFC: ClusteringAPI.jl Statistics statistics , data , cluster , clustering	27	1111	April 18, 2024
K-Medoids clustering using BetaML.jl package Data question , package , clustering	3	354	December 18, 2022
Question about optimal number of clusters General Usage question	3	696	August 23, 2020
[ANN] HorseML.jl v0.4.0 Package Announcements	3	902	March 1, 2022

Package for clustering data points

Related topics