Density-based clustering with incomplete distance matrix

baggepinnen · March 7, 2020, 4:12am

I am looking for a clustering algorithm (probably density based) that works with custom distances (I have my own non-Euclidean distance metric) and can work on very large datasets. I can evaluate nearest neighbors quickly, but am not able to form the entire pairwise distance matrix as that would be too large to fit in memory. Is anyone aware of such a clustering algorithm?

I am currently considering:

Markov Clustering Algorithm: This requires conversion of the partial distance matrix to a similarity matrix, and thus typically selecting a length-scale, which I do not really want to do.
Using the new and exciting DefaultArrays.jl to represent the full distance matrix where the missing distances are just assigned a very large value. I’m not sure if algorithms that use the full distance matrix will scale well in this case, even if I’m able to represent the matrix. Any hope that Clustering.hclust will work in this case?
Doing an initial clustering with, e.g., K-means with a large number of clusters (which I can do approximately for the custom metric), form averages under the custom metric for each cluster and then perform density-based clustering of these first-level clusters.

Topic		Replies	Views
Dbscan clustering with distance matrix General Usage clustering	6	2118	December 7, 2021
Distance matrix + clustering with custom distance function New to Julia question , distances	6	1597	December 18, 2020
Speeding up findmin for a matrix for hierarchical clustering Machine Learning	4	585	June 11, 2020
DimensionMismatch: The size of a distance matrix ((1, 440)) doesn't match the length of assignment vector (440) Machine Learning clustering	3	253	November 25, 2022
DBSCAN using Hamming Distance Metric Statistics question	0	701	October 7, 2019

Density-based clustering with incomplete distance matrix

Related topics