Hi all,
this could go either in statistics, data, or machine learning domains, please share.
Alright, so I was trying to test various different clustering algorithms to see which one best clusters my data. In doing so, I encountered once again the odd/weird/messy interface that is currently in Clustering.jl: every single clustering algorithm has a different API: some expect input data as matric with each column a data point. So me expect the distance matrix. Some expect adjacency matrix. Some expect configuration options as arguments, some others as keywords.
There is no harmony. Having such a harmony would allow us to much more efficiently test different codes. The fact that the script of the benchmarks of clustering algorithms is so long is a testament that an established common interface would help. So I decided to solve Common clustering API (i.e., why aren't KShiftsClustering.jl, QuickShiftClustering.jl, QuickShiftClustering.jl SpectralClustering.jl here...?) · Issue #256 · JuliaStats/Clustering.jl · GitHub
I’ve coded up a starting ClusteringAPI.jl: GitHub - JuliaDynamics/ClusteringAPI: Common API for clustering algorithms in Julia . The idea is very simple. In essence there is always a single function result = cluster(algorithm, data)
and a second function cluster_labels(result)
. All configuration options are given as keyword arguments when constructing algorithm
. Full specification:
cluster(ca::ClusteringAlgortihm, data) → cr::ClusteringResults
Cluster input
data
according to the algorithm specified byca
.
All options related to the algorithm are given as keyword arguments when
constructingca
. The input data can be specified two ways:
- as a (d, m) matrix, with d the dimension of the data points and m the amount of
data points (i.e., each column is a data point).- as a length-m vector of length-d vectors (i.e., each inner vector is a data point).
The cluster labels are always the
positive integers1:n
withn::Int
the number of created clusters.The output is always a subtype of
ClusteringResults
,
which always extends the following two methods:
cluster_number(cr)
returnsn
.cluster_labels(cr)
returnslabels::Vector{Int}
a length-m vector of labels
mapping each data point to each cluster (1:n
).and always includes
ca
in the fieldalgorithm
.Other algorithm-related output can be obtained as a field of the result type,
or other specific functions of the result type.
This is described in the individual algorithm implementations.
In the folder ClusteringAPI/examples at main · JuliaDynamics/ClusteringAPI · GitHub I’ve coded up three example implementations. One is an automated HClust
taken from Tim Holy’s clustering benchmarks. The second is a heavily improved version of DBSCAN that we use in Attractors.jl but haven’t yet had the chance to put it in Clustering.jl, and the last is an update of QuickShiftClustering.jl. Unfortunately, the code of QuickShiftClustering.jl is incorrect. I updated it (it was written 9 years ago), and it runs, but it doesn’t give correct results (see comments at the end of the file).
My first RFC is: is this API enough? It is enough for all clustering algorithms I am aware of. But I don’t know everything.
My second RFC is a poll on how the community prefers to have the structure of the package when it comes to extending:
- ClusteringAPI.jl is pure without any algorithm source code. Packages that implement clustering algorithms have it as dependency and extend the methods.
- ClusteringAPI.jl adds extensions via the new Julia Package extensions system and therefore has only wrapper code to extend
cluster()
. There is anext
folder, and when a user doesusing ClusteringAPI, Clustering
, some algorithms of Clustering.jl become available viacluster()
. This is weird to make work nicely because it would require defining types in thesrc
folder but only extending thecluster
method if the corresponding package is loaded. - ClusteringAPI.jl becomes/replaces Clustering.jl and has all clustering algortihms source code from all packages, which would require cooperation from everyone. This is definitely my personal favorite, because it minimizes the overall amount of code by removing redundancies/duplications.
My third RFC is: where should ClusteringAPI.jl be? I started it in an org I trust but I am happy to move it to any other org it may be fitting (provided a team is made so that I also co-maintain admin access to improve the package, we will use it at Attractors.jl).