I’ve written a hierarchical clustering algorithm and I would like to speed it up if possible. It seems that my limiting step is findmax(D) where D is my dissimilarity matrix. This is a LowerTriangular matrix (since the order of the pairs doesn’t matter, i.e. D[i,j] == D[j,i].
I was thinking it might help to convert this into a Vector, sort it and then do findmax. However, two issues:
How do I get back to my matrix from the vector?
I update D after each cluster merge. How can I insert the new dissimilarity values into the sorted vector representation of D in an efficient way?
Any thoughts would be very much welcome!
PS I know there is a clustering package for Julia, but I opted to write my own because I also need to implement a particular type of hierarchical clustering where you can only merge adjacent clusters. It seemed a bit too involved to do this by extending the clustering package.
There’s not just one, there now the awesome new parallel clustering package: https://github.com/PyDataBlog/ParallelKMeans.jl/blob/master/docs/src/index.md that I discovered in the (then latest, June now popped in) Julia newsletter (“‘orders of magnitude’ faster than Python’s scikit-learn, R and Clustering.jl”).
that you could use with hierarchical, but I found something with better time complexity. “Ultrametric [Baire space]” was new to me (and unknown to the professor teaching clustering/data mining; and also “generalized ultrametric [spaces]”).
In this work a novel distance called the Baire distance is presented.
We show how this distance can be used to generate clusters in a way that is computationally inexpensive when compared with more traditional techniques. This approach therefore makes a good candidate for exploratory data analysis when data sets are very big or in cases where the dimensionality is large
Cool! I’m basically implementing the two hierarchical clustering algorithms that you can find here:
[1] S. Pineda and J. M. Morales, “Chronological time-period clustering for optimal capacity expansion planning with storage,” IEEE Trans. Power Syst., vol. 33, no. 6, pp. 7162–7170, 2018.
I’m not sure I want to do anything fancier - the clustering is only a means to an end for me.