Using Cosine Similarity for KMeans clustering

I am making with an implementation of K-means clustering in Julia.

Figure out, and implement a modification of k-means that alternatively measure similarity by the angle between vectors.

So I assumed that one could use Cosine Similarity for this, I have made the code work with regular K-means by calculating th squared Euclidian Distance, by this:

Distances[:,i] = sum((X.-C[[i],:]).^2, dims=2) # Where C is center, Distances are added using the i-th center

I tried to do this by using cosine similarity such as this:

Distances[:, i] = sum(1 .- ((X*C[[i], :]).^2 /(sum(X.^2, dims=2).*(C[[i],:]'*C[[i],:]))))

But this seems to not be working.

Alternatively I tried:

clust_center = C[[i], :]
curr_dist = 0
for currx = 1:size(X)[1]
    curr_dist = curr_dist+ evaluate(CosineDist(), X[currx, :], clust_center)
end
Distances[:,i] = curr_dist

But then I get the error:

ArgumentError: indexed assignment with a single value to many locations is not supported; perhaps use broadcasting `.=` instead?

Have I misunderstood the question or am I implementing it wrong?

You might want to look at the code in Distances.jl for cosine distance: https://github.com/JuliaStats/Distances.jl/blob/fa867d59098dd848fd71bc48005a1bf858928a47/src/metrics.jl#L399-L412

Also this:

Distances[:, i] = sum(1 .- ((X*C[[i], :]).^2 /(sum(X.^2, dims=2).*(C[[i],:]'*C[[i],:]))))

use .= maybe check the docs for broadcasting.

Thank you! I was checking out the Distances github, but I was unable to find the code for Cosine Distance. That was my fault. I managed to get it to work after taking a look at the code.

1 Like