I am looking for a clustering algorithm (probably density based) that works with custom distances (I have my own non-Euclidean distance metric) and can work on very large datasets. I can evaluate nearest neighbors quickly, but am not able to form the entire pairwise distance matrix as that would be too large to fit in memory. Is anyone aware of such a clustering algorithm?
I am currently considering:
- Markov Clustering Algorithm: This requires conversion of the partial distance matrix to a similarity matrix, and thus typically selecting a length-scale, which I do not really want to do.
- Using the new and exciting DefaultArrays.jl to represent the full distance matrix where the missing distances are just assigned a very large value. I’m not sure if algorithms that use the full distance matrix will scale well in this case, even if I’m able to represent the matrix. Any hope that Clustering.hclust will work in this case?
- Doing an initial clustering with, e.g., K-means with a large number of clusters (which I can do approximately for the custom metric), form averages under the custom metric for each cluster and then perform density-based clustering of these first-level clusters.