RFC: ClusteringAPI.jl

it may be important context that afaik LearnAPI.jl is not intended for direct use by beginners, but rather as a Tables.jl -like abstract API for package developers

From what I have seen in the past, these attempts to create large complex APIs for ML tend to freeze development of basic features for most users.

Iā€™d focus on getting a simple cluster/clusterprob pair working nicely with multiple models, and only then try to generalize things for more advanced use cases.

1 Like

The other thing missing in the existing clustering package, is an optimal clustering assignment. As we all know, different clustering algorithms can give different numbers of clusters, and can also assign a single datapoint to different clusters.

Here is the doco for NbClust,

The R package NbClust has a really nice feature where is runs a bunch of clustering algorithms over the data. I think there are like 20 different methods. And then the NbClust package spits out the optimal number of clusters and assignments, based upon different voting methods for each clustering method. This is one of the functions that I use a lot in R, and that I keep having to use RCall to pull into Julia.

I plan to go ahead and finish this soon. It is only a matter of deciding how to do it.

Letā€™s focus on the low level api for now. At the moment nobody stops you from wrapping the raw output in a cetegorical array. I would also second what @juliohm said. The focus here is just the pure clustering API itself. It is better to put priority into having this API down, than to rather try to future-design and cater to ML applications. ML applications can always use a good API no matter how it is designed, provided it is good. But going the other way, using a ML-catered API for non-ML things, is certainly not as simple.


cluster_labels I think is better for cluster_indices. Furthermore, as was pointed out already, there is no reason to limit the labels into the integers. They could be anything. Although, since it could be anything, it might as well be the integers. Specific algorithms can provide specific vectors of anything to match the integers should they wish so. Personally I donā€™t see the point yet.

The question is what to use for representing ā€œunclusteredā€ data points that the algorithm failed to assign? If using the positive integers for the labels, the integer -1 or 0 could be used for them. So this is a simple argument in favor of integers as labels.


At the moment the community seems split between cluster(algorithm, data) and cluster(data, algorithm) but the evidence so far shows that cluster(algorithm, data) is used in a larger part of the overall Julia ecosystem, so Iā€™ll go with that unless substantial data are provided to favor the alternative.


That is great suggestion, but also 100% orthogonal to the API discussion. Once the API is in place you should consider doing a PR that adds this functionality. Since you used it often enough you are likely the most qualified!


@juliohm @sylvaticus @RoyiAvital @aplavin @alyst @ablaom thank you for participating in the disucssion. Yould you also mind voting on the 3 proposals for where and how to put this API into place (poll at the top post)? THe cummunity is 50-50 now.

2 Likes

The BetaML API has the advantage that the user can choose, when met with new data, to (a) continue the training using the new data making another fit! (with the effect to adjust the medoids/mixtures/representativesā€¦) or (b) limiting to ā€œpredictā€ the class assignments of the new data using the saved medoids/mixtures/representatives and without change them.

This assumes that the clustering method used can utilize ā€œprevious trainingā€. From my perspective, all methods I used so far cannot do this. DBSCAN, to take an example: when faced with ā€œnew dataā€ you will have to recluster everything by merging in new and old data. Hence, this ā€œretrainingā€ option cannot be used as a basis for the low level API that should be usable with every single clustering algorithm.

It should be simple to provide a recluster function that takes in the output format and new data.

Remember with my proposed design cluster(algorithm, data) does not return a vector of labels. It returns a return type with arbitarry many fields. The function cluster_labels() takes the return type and returns the labels. Hence, it should be rather trivial to define retype = recluster(out_type, new_data) for algorithms that have this possibility. We can make recluster part of the API if ML people think it is useful!

1 Like

@Datseris I donā€™t have a strong opinion about where to house the repo but can arrange an invitation to either JuliaML or JuliaAI if needed.