# Speeding up findmin for a matrix for hierarchical clustering

I’ve written a hierarchical clustering algorithm and I would like to speed it up if possible. It seems that my limiting step is `findmax(D)` where `D` is my dissimilarity matrix. This is a `LowerTriangular` matrix (since the order of the pairs doesn’t matter, i.e. `D[i,j] == D[j,i]`.

I was thinking it might help to convert this into a Vector, sort it and then do `findmax`. However, two issues:

1. How do I get back to my matrix from the vector?
2. I update `D` after each cluster merge. How can I insert the new dissimilarity values into the sorted vector representation of `D` in an efficient way?

Any thoughts would be very much welcome!

PS I know there is a clustering package for Julia, but I opted to write my own because I also need to implement a particular type of hierarchical clustering where you can only merge adjacent clusters. It seemed a bit too involved to do this by extending the clustering package.

Sorting is O(n \log(n)) whereas `findmax` is O(n) , so I doubt that’ll be any faster.

You may want to have a look at heaps if you want a datastructure that maintains easy access to an extreme value.

Hmmm, good point. I’ll try and rewrite my code with heaps in that case.

There’s not just one, there now the awesome new parallel clustering package: ParallelKMeans.jl/index.md at master · PyDataBlog/ParallelKMeans.jl · GitHub that I discovered in the (then latest, June now popped in) Julia newsletter ("‘orders of magnitude’ faster than Python’s scikit-learn, R and Clustering.jl").

that you could use with hierarchical, but I found something with better time complexity. “Ultrametric [Baire space]” was new to me (and unknown to the professor teaching clustering/data mining; and also “generalized ultrametric [spaces]”).

In this work a novel distance called the Baire distance is presented.
We show how this distance can be used to generate clusters in a way that is computationally inexpensive when compared with more traditional techniques. This approach therefore makes a good candidate for exploratory data analysis when data sets are very big or in cases where the dimensionality is large