I’ve written a hierarchical clustering algorithm and I would like to speed it up if possible. It seems that my limiting step is
D is my dissimilarity matrix. This is a
LowerTriangular matrix (since the order of the pairs doesn’t matter, i.e.
D[i,j] == D[j,i].
I was thinking it might help to convert this into a Vector, sort it and then do
findmax. However, two issues:
- How do I get back to my matrix from the vector?
- I update
D after each cluster merge. How can I insert the new dissimilarity values into the sorted vector representation of
D in an efficient way?
Any thoughts would be very much welcome!
PS I know there is a clustering package for Julia, but I opted to write my own because I also need to implement a particular type of hierarchical clustering where you can only merge adjacent clusters. It seemed a bit too involved to do this by extending the clustering package.
Sorting is O(n \log(n)) whereas
findmax is O(n) , so I doubt that’ll be any faster.
You may want to have a look at heaps if you want a datastructure that maintains easy access to an extreme value.
Hmmm, good point. I’ll try and rewrite my code with heaps in that case.
There’s not just one, there now the awesome new parallel clustering package: ParallelKMeans.jl/index.md at master · PyDataBlog/ParallelKMeans.jl · GitHub that I discovered in the (then latest, June now popped in) Julia newsletter ("‘orders of magnitude’ faster than Python’s scikit-learn, R and Clustering.jl").
that you could use with hierarchical, but I found something with better time complexity. “Ultrametric [Baire space]” was new to me (and unknown to the professor teaching clustering/data mining; and also “generalized ultrametric [spaces]”).
In this work a novel distance called the Baire distance is presented.
We show how this distance can be used to generate clusters in a way that is computationally inexpensive when compared with more traditional techniques. This approach therefore makes a good candidate for exploratory data analysis when data sets are very big or in cases where the dimensionality is large
What are you implementing? I see also recent from 2018: https://arxiv.org/pdf/1806.03432.pdf
This in the package above seemed the best (almost all the algorithms there new to me):
and also interesting:
Full Implementation of Triangle inequality based on Elkan - 2003 Using the Triangle Inequality to Accelerate K-Means".
Cool! I’m basically implementing the two hierarchical clustering algorithms that you can find here:
 S. Pineda and J. M. Morales, “Chronological time-period clustering for optimal capacity expansion planning with storage,” IEEE Trans. Power Syst., vol. 33, no. 6, pp. 7162–7170, 2018.
I’m not sure I want to do anything fancier - the clustering is only a means to an end for me.