it may be important context that afaik LearnAPI.jl
is not intended for direct use by beginners, but rather as a Tables.jl
-like abstract API for package developers
From what I have seen in the past, these attempts to create large complex APIs for ML tend to freeze development of basic features for most users.
Iād focus on getting a simple cluster/clusterprob
pair working nicely with multiple models, and only then try to generalize things for more advanced use cases.
The other thing missing in the existing clustering package, is an optimal clustering assignment. As we all know, different clustering algorithms can give different numbers of clusters, and can also assign a single datapoint to different clusters.
Here is the doco for NbClust,
The R package NbClust
has a really nice feature where is runs a bunch of clustering algorithms over the data. I think there are like 20 different methods. And then the NbClust
package spits out the optimal number of clusters and assignments, based upon different voting methods for each clustering method. This is one of the functions that I use a lot in R, and that I keep having to use RCall to pull into Julia.
I plan to go ahead and finish this soon. It is only a matter of deciding how to do it.
Letās focus on the low level api for now. At the moment nobody stops you from wrapping the raw output in a cetegorical array. I would also second what @juliohm said. The focus here is just the pure clustering API itself. It is better to put priority into having this API down, than to rather try to future-design and cater to ML applications. ML applications can always use a good API no matter how it is designed, provided it is good. But going the other way, using a ML-catered API for non-ML things, is certainly not as simple.
cluster_labels
I think is better for cluster_indices
. Furthermore, as was pointed out already, there is no reason to limit the labels
into the integers. They could be anything. Although, since it could be anything, it might as well be the integers. Specific algorithms can provide specific vectors of anything to match the integers should they wish so. Personally I donāt see the point yet.
The question is what to use for representing āunclusteredā data points that the algorithm failed to assign? If using the positive integers for the labels, the integer -1
or 0
could be used for them. So this is a simple argument in favor of integers as labels.
At the moment the community seems split between cluster(algorithm, data)
and cluster(data, algorithm)
but the evidence so far shows that cluster(algorithm, data)
is used in a larger part of the overall Julia ecosystem, so Iāll go with that unless substantial data are provided to favor the alternative.
That is great suggestion, but also 100% orthogonal to the API discussion. Once the API is in place you should consider doing a PR that adds this functionality. Since you used it often enough you are likely the most qualified!
@juliohm @sylvaticus @RoyiAvital @aplavin @alyst @ablaom thank you for participating in the disucssion. Yould you also mind voting on the 3 proposals for where and how to put this API into place (poll at the top post)? THe cummunity is 50-50 now.
The BetaML API has the advantage that the user can choose, when met with new data, to (a) continue the training using the new data making another fit!
(with the effect to adjust the medoids/mixtures/representativesā¦) or (b) limiting to āpredictā the class assignments of the new data using the saved medoids/mixtures/representatives and without change them.
This assumes that the clustering method used can utilize āprevious trainingā. From my perspective, all methods I used so far cannot do this. DBSCAN, to take an example: when faced with ānew dataā you will have to recluster everything by merging in new and old data. Hence, this āretrainingā option cannot be used as a basis for the low level API that should be usable with every single clustering algorithm.
It should be simple to provide a recluster
function that takes in the output format and new data.
Remember with my proposed design cluster(algorithm, data)
does not return a vector of labels. It returns a return type with arbitarry many fields. The function cluster_labels()
takes the return type and returns the labels. Hence, it should be rather trivial to define retype = recluster(out_type, new_data)
for algorithms that have this possibility. We can make recluster
part of the API if ML people think it is useful!
@Datseris I donāt have a strong opinion about where to house the repo but can arrange an invitation to either JuliaML or JuliaAI if needed.