Hi all,

this could go either in statistics, data, or machine learning domains, please share.

Alright, so I was trying to test various different clustering algorithms to see which one best clusters my data. In doing so, I encountered once again the odd/weird/messy interface that is currently in Clustering.jl: every single clustering algorithm has a different API: some expect input data as matric with each column a data point. So me expect the distance matrix. Some expect adjacency matrix. Some expect configuration options as arguments, some others as keywords.

There is no harmony. Having such a harmony would allow us to much more efficiently test different codes. The fact that the script of the benchmarks of clustering algorithms is so long is a testament that an established common interface would help. So I decided to solve Common clustering API (i.e., why aren't KShiftsClustering.jl, QuickShiftClustering.jl, QuickShiftClustering.jl SpectralClustering.jl here...?) Â· Issue #256 Â· JuliaStats/Clustering.jl Â· GitHub

Iâ€™ve coded up a starting ClusteringAPI.jl: GitHub - JuliaDynamics/ClusteringAPI: Common API for clustering algorithms in Julia . The idea is very simple. In essence there is always a single function `result = cluster(algorithm, data)`

and a second function `cluster_labels(result)`

. All configuration options are given as *keyword* arguments when constructing `algorithm`

. Full specification:

`cluster(ca::ClusteringAlgortihm, data) â†’ cr::ClusteringResults`

Cluster input

`data`

according to the algorithm specified by`ca`

.

All options related to the algorithm are given as keyword arguments when

constructing`ca`

. The input data can be specified two ways:

- as a (d, m) matrix, with d the dimension of the data points and m the amount of

data points (i.e., each column is a data point).- as a length-m vector of length-d vectors (i.e., each inner vector is a data point).
The cluster labels are always the

positive integers`1:n`

with`n::Int`

the number of created clusters.The output is always a subtype of

`ClusteringResults`

,

which always extends the following two methods:

`cluster_number(cr)`

returns`n`

.`cluster_labels(cr)`

returns`labels::Vector{Int}`

a length-m vector of labels

mapping each data point to each cluster (`1:n`

).and always includes

`ca`

in the field`algorithm`

.Other algorithm-related output can be obtained as a field of the result type,

or other specific functions of the result type.

This is described in the individual algorithm implementations.

In the folder ClusteringAPI/examples at main Â· JuliaDynamics/ClusteringAPI Â· GitHub Iâ€™ve coded up three example implementations. One is an automated `HClust`

taken from Tim Holyâ€™s clustering benchmarks. The second is a heavily improved version of DBSCAN that we use in Attractors.jl but havenâ€™t yet had the chance to put it in Clustering.jl, and the last is an update of QuickShiftClustering.jl. Unfortunately, the code of QuickShiftClustering.jl is incorrect. I updated it (it was written 9 years ago), and it runs, but it doesnâ€™t give correct results (see comments at the end of the file).

My first RFC is: is this API enough? It is enough for all clustering algorithms I am aware of. But I donâ€™t know everything.

My second RFC is a poll on how the community prefers to have the structure of the package when it comes to extending:

- ClusteringAPI.jl is pure without any algorithm source code. Packages that implement clustering algorithms have it as dependency and extend the methods.
- ClusteringAPI.jl adds extensions via the new Julia Package extensions system and therefore has only wrapper code to extend
`cluster()`

. There is an`ext`

folder, and when a user does`using ClusteringAPI, Clustering`

, some algorithms of Clustering.jl become available via`cluster()`

. This is weird to make work nicely because it would require defining types in the`src`

folder but only extending the`cluster`

method if the corresponding package is loaded. - ClusteringAPI.jl becomes/replaces Clustering.jl and has
*all*clustering algortihms source code from all packages, which would require cooperation from everyone. This is definitely my personal favorite, because it minimizes the overall amount of code by removing redundancies/duplications.

My third RFC is: where should ClusteringAPI.jl be? I started it in an org I trust but I am happy to move it to any other org it may be fitting (provided a team is made so that I also co-maintain admin access to improve the package, we will use it at Attractors.jl).