How to find the elbow of a graph in Julia

Hi how’s it going?

A common technique to finding the optimal number for K in K-means clustering is to use the elbow method. You try different values of K, and plot the average euclidean distance from the points to their cluster centers, and find the location on the graph that resembles an “elbow”.

In Python there are some quick packages to implement this, and in Julia I’ve reached a point (no pun intended) where I have the elbow graphs all implemented. Here is an image of one.

How would you recommend I extract the X axis value from the elbow of this graph (in this case X=2) as the optimal number for K.

Screenshot from 2020-04-02 12-48-41

Compute the distance to a line made up with with first and last point of your broken line. The elbow point it’s the furthest one.
Example with GMT.jl

using GMT
xy = [1.0 9.0; 2.9 6.1; 4.6 3.5; 7.0 2.3; 9.1 1.3];     # Example poly-line

mapproject(xy, dist2line=(line=[xy[1:1, :]; xy[end:end, :]], unit=:cartesian))
Array{GMT.GMTdataset,1} with 1 segments
First segment DATA
5×5 Array{Float64,2}:
 1.0  9.0  0.0       1.0      9.0
 2.9  6.1  0.79278   3.44621  6.67459
 4.6  3.5  1.50592   5.63756  4.59146
 7.0  2.3  0.722092  7.49751  2.82335
 9.1  1.3  0.0       9.1      1.3

third column holds the distance to line. The elbow is the third point.

Sorry for the not intuitive unit=:cartesian keyword. It will become cartesian=true in next version.

1 Like

Thank you! Do you know of any other packages / functions within plots to do these calculations? I’m having a difficult time getting GMT up and running. I totally get the formula though makes a lot of sense, just want to know if there are any simpler implementations

You could look at although it looks like it might be a bit more than you need.

1 Like

Does anyone have good references for standard algorithms for this problem? It comes up a lot also when using PCA/SVD to identify a subspace in noisy data. I haven’t needed to automatically determine the dimensionality from the data (I often know it a priori) but it would be nice to have a good place to start looking.

Finding the point furthest from the line made with the first and last point misses the purpose of the elbow method. I would warn against using this for cases other than when you are just trying to the elbow point of two straight lines.

As per my limited understanding, the elbow method is about identifying the last number which leads to a significant (per subjective judgement) reduction in the sum of squared distances to the mean points. As you can see here, the author identifies 3 as the elbow point—not 2 which is at a sharper bend and furthest away from the above-mentioned line.

@Julia1 I am curious which Python packages you have used for this, and how their decision algorithms judge which point is the elbow.

If you had plotted my example data you would have seen that the 3rd point is obviously the furthest point. But, note that my response was only based on geometric question posted by the OP. And for that it works pretty well

xy = [1.0 9.0; 2.9 6.1; 4.6 3.5; 7.0 2.3; 9.1 1.3];
plot(xy, savefig="elbow.png", lw=0.5, marker=:circle, ms=0.2, mc=:red, show=1)

Apologies for the confusion, by ‘the author’ I was referring to the author of the article which I linked to.

Indeed the code may work well for the present geometric question, I only wanted to caution against using this solution in determining the elbow point in the general case.

1 Like

Can you please open an issue in GMT.jl. I know that on unix when GMT was not installed in standard location the OS fails to find the shared libs.

Ok, I see your point and you are right. Perhaps compute the slopes and find the max variation? Or some smooth variation of it if there are outliers.

I used the Yellowbrick Package. Here is the code.

1 Like

Yes I’ll do that. That’s the exact error I’m getting.

Thanks for the link; What you can also try is use the same “knee point detection” algorithm that Yellowbrick is using ( and import it to Julia via PyCall.jl