RFC: ClusteringAPI.jl

Hi all,

this could go either in statistics, data, or machine learning domains, please share.

Alright, so I was trying to test various different clustering algorithms to see which one best clusters my data. In doing so, I encountered once again the odd/weird/messy interface that is currently in Clustering.jl: every single clustering algorithm has a different API: some expect input data as matric with each column a data point. So me expect the distance matrix. Some expect adjacency matrix. Some expect configuration options as arguments, some others as keywords.

There is no harmony. Having such a harmony would allow us to much more efficiently test different codes. The fact that the script of the benchmarks of clustering algorithms is so long is a testament that an established common interface would help. So I decided to solve Common clustering API (i.e., why aren't KShiftsClustering.jl, QuickShiftClustering.jl, QuickShiftClustering.jl SpectralClustering.jl here...?) · Issue #256 · JuliaStats/Clustering.jl · GitHub

I’ve coded up a starting ClusteringAPI.jl: GitHub - JuliaDynamics/ClusteringAPI: Common API for clustering algorithms in Julia . The idea is very simple. In essence there is always a single function result = cluster(algorithm, data) and a second function cluster_labels(result). All configuration options are given as keyword arguments when constructing algorithm. Full specification:

cluster(ca::ClusteringAlgortihm, data) → cr::ClusteringResults

Cluster input data according to the algorithm specified by ca.
All options related to the algorithm are given as keyword arguments when
constructing ca. The input data can be specified two ways:

  • as a (d, m) matrix, with d the dimension of the data points and m the amount of
    data points (i.e., each column is a data point).
  • as a length-m vector of length-d vectors (i.e., each inner vector is a data point).

The cluster labels are always the
positive integers 1:n with n::Int the number of created clusters.

The output is always a subtype of ClusteringResults,
which always extends the following two methods:

  • cluster_number(cr) returns n.
  • cluster_labels(cr) returns labels::Vector{Int} a length-m vector of labels
    mapping each data point to each cluster (1:n).

and always includes ca in the field algorithm.

Other algorithm-related output can be obtained as a field of the result type,
or other specific functions of the result type.
This is described in the individual algorithm implementations.

In the folder ClusteringAPI/examples at main · JuliaDynamics/ClusteringAPI · GitHub I’ve coded up three example implementations. One is an automated HClust taken from Tim Holy’s clustering benchmarks. The second is a heavily improved version of DBSCAN that we use in Attractors.jl but haven’t yet had the chance to put it in Clustering.jl, and the last is an update of QuickShiftClustering.jl. Unfortunately, the code of QuickShiftClustering.jl is incorrect. I updated it (it was written 9 years ago), and it runs, but it doesn’t give correct results (see comments at the end of the file).

My first RFC is: is this API enough? It is enough for all clustering algorithms I am aware of. But I don’t know everything.

My second RFC is a poll on how the community prefers to have the structure of the package when it comes to extending:

  • ClusteringAPI.jl is pure without any algorithm source code. Packages that implement clustering algorithms have it as dependency and extend the methods.
  • ClusteringAPI.jl adds extensions via the new Julia Package extensions system and therefore has only wrapper code to extend cluster(). There is an ext folder, and when a user does using ClusteringAPI, Clustering, some algorithms of Clustering.jl become available via cluster(). This is weird to make work nicely because it would require defining types in the src folder but only extending the cluster method if the corresponding package is loaded.
  • ClusteringAPI.jl becomes/replaces Clustering.jl and has all clustering algortihms source code from all packages, which would require cooperation from everyone. This is definitely my personal favorite, because it minimizes the overall amount of code by removing redundancies/duplications.
0 voters

My third RFC is: where should ClusteringAPI.jl be? I started it in an org I trust but I am happy to move it to any other org it may be fitting (provided a team is made so that I also co-maintain admin access to improve the package, we will use it at Attractors.jl).

10 Likes

Great! Thanks for the initiative! FYI, there has also been some discussion about clustering in this long discussion on ML API.

2 Likes

thanks for letting me know, but this other post is too long to parse in detail :frowning: I tried to search for comments containing “clustering” but I didn’t find something specific relatable, only general comments about the larger ecosystem. Would you mind linking specific comments you think are noteworthy for the current post?

I’d add a field about score / probability.
Many clustering algorithms are fuzzy, soft.

Additional point, some algorithms maps a point to an outlier. So the label can not be limited to 1:n.

What about clustering arbitrary objects, when one can compute distances between them?

An alternative API is the scikit-learn like API we employ in the clustering algorithms of BetaML:

model = ClusrerAlgo(options)
fit!(model, data)
classes = predict(model, [optional data])

The optional data is for algorithms that can generalize to new data.
For example, with new data you may want to continue fit and move the cluster centers/mixtures, or just predict the class of the new data.

Also for a common interface one could use MLJ…

1 Like

The returned results type may have arbitrary algorithm-specific fields that the algorithms describe, so this is covered. if we believe there is a universal quantity that should be API -supported, we can make a new function such as cluster_scores(results).

very good point. In Attractors.jl we follow the convention of mapping these points to the special label -1. I guess we could use the special integers -1 here or 0 as well. Otherwise, we can make the returned labels a Union{Int, Nothing} using nothing for points that can’t be clustered. I don’t think there is any benefit to either, it is just a convention we need to agree on. I prefer the -1 because it leaves the container concretely typed.

I refer to “vectors” in the general seense of objects of vector spaces, so this is covered. Any arbytrary object on which you can define distance from another object is likely to be representable as a “vector”. I will update the description to reflect that.

@sylvaticus I guess so, but why…? I’ve never used MLJ, and in fact I’ve never even considered clustering as machinle learning. I guess you could interpret clustering as machine learning, but, why use such a huge package instead of having a very focused API for a very specific task? I don’t see the benefit.

Some algorithms can generalize to new data, but to my knowledge most can’t (please educate me if I am wrong). So at best this needs to be an optional API adjustment, but cannot be an argument to drive the mandatory API.

A solution I see is that nothing forbids the algorithm type to be re-usable. When you do ca = ReusableAlg(...), this can initialize internal fields that are adjusted to new data. Hence, doing the first cluster(ca, data) is your training phase, which already has labels. Then, an additional function that is specific to this algorithm (as it can’t be used with must algorithms) recluster(ca, newdata) can utilize new data. If there is more than one algorihtms that use recluster then recluster becomes part of the API as an optional extension and is simply not implemented for the majority of the algorithms.

Thanks for organizing the house. Would you mind swapping the order of arguments to read as cluster(data, algo)? We adopted this convention in our stack, and it is spread across multiple packages in the ecosystem. As a non-native English speaker I tend to read it like “cluster the data with given algorithm”, not the other way around.

Man, I’d love to see a julian HDBSCAN…

I don’t mind at all. I actually thought that this was the language-wide convention: first have algorithm-deciding argument, then input data. This is what we have adopted on our side… I admit, I do not know now. Perhaps we could get more data on this on how other Julia-organizations handle this?

But I really don’t mind the arguments the other way, consistent API is the only thing I really care about!

HDBSCAN is coming soon in this PR: Add HDBSCAN from HorseML.jl by MommaWatasu · Pull Request #273 · JuliaStats/Clustering.jl · GitHub

(note our improvement of DBSCAN is not HDBSCAN. It automates and improves the optimal radius choice, which has a huge impact on the outcome)

1 Like

Recalling why we did this choice: it was a very long and intense design discussion over the course of 6+months or so when designing ComplexityMeasures.jl. The case with the algorithm first generalizes much better to multiple input arguments governing data: quantity(algorithm, x, y, ...) and quantity(algorithm, x) is just the version with 1 input data.

Great then!
Just keep in mind that “can compute distances” does not imply “vector space”.

As for argument order, totally agree with f(algorithm, datas...) – passing data last is very common in Julia for a variety of reasons.

2 Likes

SciML uses solve(problem, algorithm), so cluster(data, algorithm) would fit well.

1 Like

We usually wrap quantity(algo, x, y, z, …) into a concept represented with a struct:

quantify(MyConcept(x, y, z), algo)

I am not sure what its status is but I thought LearnAPI.jl was attempting to encompass clustering algorithms as well

Thanks for this unification initiative! I think the core concept of the proposed API is solid and matches what Clustering.jl has already (see Clustering.jl API). Some minor thoughts/suggestions:

  • cluster_number() – I think Julia pattern for naming such methods is nclusters() (already adopted by Clustering.jl)
  • cluster_labels() – it’s a bit confusing to call it “label”, one might expect the label is a string or a symbol. I think Julia pretty universally uses the term index for 1…n identifiers. Clustering.jl name for this method is assignments(), which I also find not ideal. What about cluster_indices()?
  • cluster() – unfortunately, I suspect it will be the source of confusion – whether it’s a verb (and so the method would do the clustering) or a noun (and the method accesses cluster information). Since the method would dispatch on ClusteringAlgorithm, the verb might be more generic, i.e. fit(), predict(), evaluate() etc.
  • cluster(algo, data) vs cluster(data, algo) – I think it makes sense to have algo first, since it’s what the method is dispatched on. This scheme is also adopted by many Julia packages, including the closely related MultivariateStats.jl. Also it fits the map(fun, collection) pattern of the core language.

Another consideration for the common API – what about fuzzy and hierarchical clustering, do you plan to support it?

2 Likes

Thanks @Datseris for starting this discussion. I’ve not followed all of it, but thought is might be useful to say how clustering is expected to look in the broader ML API proposal, Learn API.

First note that LearnAPI provides two methods for generating output, predict and transform. The first is for “target” variables broadly understood and the labels in clustering are an example of a target variable. The transform is for other types of output, which for models like KMeans and KMedoids could be dimension reduction.

The predict method dispatches on the kind of target you are predicting (using the singleton KindOfProxy types). For a regular classifier, the typical kinds are Distribution for probabilistic predictors and LiteralTarget for determinisitic (point) predictors. The target in clustering is different because the labels assigned are only defined up to a permutation, and so we have separate kinds for these, including:

type form of an observation
LearnAPI.LabelAmbiguous collections of labels (in case of multi-class target) but without a known correspondence to the original target labels (and of possibly different number) as in, e.g., clustering
LearnAPI.LabelAmbiguousDistribution pdf/pmf version of LabelAmbiguous

The second one is for “soft” labels. We probably need to add something here for fuzzy clusters (more than one label per point).

Note that clustering algorithms can be analyzed using a measure like the Rand index which only makes sense for these kinds of target.

So here’s a typical workflow for a clusterer that generalises to new data:

algorithm = KMeans() # a container for hyperparameters of the algorithm, like number of labels
model = fit(algorithm, X) # X is data

# to get labels on new data:
predict(model, LearnAPI.LabelAmbiguous(), Xnew)

To get labels on the training data (or any other byproduct of training, such as DBSCAN “boundary” points) you use an accessor function, in this case training_labels:

training_labels(model)

Some other algorithm-specific accessor functions might be needed.

For something with “soft” labels (GMM clustering), you’d do something like this:

# output soft labels:
predict(model, LearnAPI.LabelAmbiguousDistribution(), Xnew)

And a model could provide of course both “soft” and “hard” functionality with this API.

I’ve considered having a trait to implement a fallback value for the kind of target, to save needing to specify it every time.

Hierarchical clustering would probably pass the height as a keyword to predict but there needs to be fallback height, which could be controlled by a hyperparameter:

predict(hierarchal_model, X; height=2)

Unlike the current proposal (which looks like TableTransforms.jl’s “reapply” pattern) a model like DBSCAN which does not generalize to new data is currently handled this way:

algorithm = DBSCAN()
model = fit(algorithm) # no data provided, no operation performed
predict(model, LearnAPI.LabelAmbiguous(), X) # does the work

and here predict is allowed to mutate model to add byproducts of the calculation (e.g., boundary, core, noisy point classifications), to make them available to the accessor functions. To save the need for the second line, we might provide a fit_predict method.

The reason I prefer this pattern over the reapply pattern has to do with the needs, as I see them, for model composition (pipelines, and so forth). (I have some experience with such design in MLJ.)

Nothing in LearnAPI.jl is yet cast in stone, but there has already been quite a lot of discussion so far.

Problems with using integers for labels

While it might be okay for a low-level API, it may be preferable to avoid raw integers for label encoding in clustering algorithms. Sometimes you need to know the total number of labels but when subsampling not all labels may be manifest. This can lead to crashes, or silently wrong behavior. Better would be CategoricalArrays, which are well optimised, and which store the complete class pool in every view. While this can be mitigated by passing the total number of labels as metadata, this is difficult to accommodate in complicated pipelines.

I prefer “label” to “index”. In informal disucssion, one says “Clustering is for unlabelled data.” not “Clustering is for unindexed data.” Scik-learn uses this terminology, for example. However, I understand these naming discussions are often contentious.

1 Like

I don’t think that dispatch on first argument is very helpful in this case. The main benefit of placing the algorithm/model in the second argument is that we can omit it sometimes, with good defaults:

cluster(data) # picks algorithm based on data properties
cluster(data, algo) # be explicit about algorithm

@ablaom I believe that this API is not beginner-friendly, nor it helps new users to quickly get their jobs done. Maybe I am the only one with this opinion, but I’d rather stick to a simpler API for clustering models, which has two basic function names for deterministic versus probabilistic results:

cluster(data, algo) # returns vector of CategoricalValue (or CategoricalVector)
clusterprob(data, algo) # returns vector of categorical distribution

The return types of these functions are known before they are called. This is a feature that is underappreciated.

2 Likes