Speeding up findmin for a matrix for hierarchical clustering

gobs · June 9, 2020, 8:40am

I’ve written a hierarchical clustering algorithm and I would like to speed it up if possible. It seems that my limiting step is findmax(D) where D is my dissimilarity matrix. This is a LowerTriangular matrix (since the order of the pairs doesn’t matter, i.e. D[i,j] == D[j,i].

I was thinking it might help to convert this into a Vector, sort it and then do findmax. However, two issues:

How do I get back to my matrix from the vector?
I update D after each cluster merge. How can I insert the new dissimilarity values into the sorted vector representation of D in an efficient way?

Any thoughts would be very much welcome!

PS I know there is a clustering package for Julia, but I opted to write my own because I also need to implement a particular type of hierarchical clustering where you can only merge adjacent clusters. It seemed a bit too involved to do this by extending the clustering package.

baggepinnen · June 9, 2020, 10:31am

Sorting is O(n \log(n)) whereas findmax is O(n) , so I doubt that’ll be any faster.

You may want to have a look at heaps if you want a datastructure that maintains easy access to an extreme value.

gobs · June 9, 2020, 12:15pm

Hmmm, good point. I’ll try and rewrite my code with heaps in that case.

Palli · June 10, 2020, 4:25pm

There’s not just one, there now the awesome new parallel clustering package: https://github.com/PyDataBlog/ParallelKMeans.jl/blob/master/docs/src/index.md that I discovered in the (then latest, June now popped in) Julia newsletter (“‘orders of magnitude’ faster than Python’s scikit-learn, R and Clustering.jl”).

that you could use with hierarchical, but I found something with better time complexity. “Ultrametric [Baire space]” was new to me (and unknown to the professor teaching clustering/data mining; and also “generalized ultrametric [spaces]”).

In this work a novel distance called the Baire distance is presented.
We show how this distance can be used to generate clusters in a way that is computationally inexpensive when compared with more traditional techniques. This approach therefore makes a good candidate for exploratory data analysis when data sets are very big or in cases where the dimensionality is large

What are you implementing? I see also recent from 2018: https://arxiv.org/pdf/1806.03432.pdf

This in the package above seemed the best (almost all the algorithms there new to me):

Yinyang K-Means

and also interesting:
Full Implementation of Triangle inequality based on Elkan - 2003 Using the Triangle Inequality to Accelerate K-Means".

gobs · June 11, 2020, 8:02am

Cool! I’m basically implementing the two hierarchical clustering algorithms that you can find here:

[1] S. Pineda and J. M. Morales, “Chronological time-period clustering for optimal capacity expansion planning with storage,” IEEE Trans. Power Syst., vol. 33, no. 6, pp. 7162–7170, 2018.

I’m not sure I want to do anything fancier - the clustering is only a means to an end for me.

Topic		Replies	Views
Find the max element of a numeric vector iteratively with mask and early break Performance question , algorithm	9	205	June 27, 2025
Density-based clustering with incomplete distance matrix Machine Learning big-data , clustering	0	945	March 7, 2020
Is it possible to make this function faster? Performance question	11	763	May 7, 2020
Optimize code by parallelization/GPU Performance	8	564	October 12, 2022
Speed up parallel maximum across columns Performance parallel , distributed , loops	1	406	August 18, 2020

Speeding up findmin for a matrix for hierarchical clustering

Related topics