Julia : Cosine Similarity algorithm implementation

vishwanath · June 5, 2018, 12:41pm

As a beginner, i am trying to implement cosine similarity algorithm in julia. Below is a piece of code which i got from one of the resource. I just want to know the flow of the code.

# Cosine dist
        a: [2, 0, 1, 1, 0, 2, 1, 1]

        b: [2, 1, 1, 0, 1, 1, 1, 1]

 #1           @inline function eval_start(::CosineDist, a::AbstractArray{T}, b::AbstractArray{T}) where {T <: Real}  
 #2               zero(T), zero(T), zero(T)                                        
 #3           end                                                                  

#4           @inline eval_op(::CosineDist, ai, bi) = ai * bi, ai * ai, bi * bi    
#5            @inline function eval_reduce(::CosineDist, s1, s2)                   
#6                a1, b1, c1 = s1                                                    
#7                  a2, b2, c2 = s2                                                  
#8                return a1 + a2, b1 + b2, c1 + c2                                 
#9            end                                                                  

#10            function eval_end(::CosineDist, s)                                   
#11                ab, a2, b2 = s    
#12                max(1 - ab / (sqrt(a2) * sqrt(b2)), zero(eltype(ab)))            
#13            end                                                                  
#14            cosine_dist(a::AbstractArray, b::AbstractArray) = evaluate(CosineDist(), a, b)

But we can do it using inbuilt function

cosine_dist(a, b)

How about the scalability?. As data may contain crores of records.In that case what will be most efficient way of implementation.

fredrikekre · June 5, 2018, 12:43pm

If you post the same question on multiple forums it is good to cross-reference them: https://stackoverflow.com/questions/50699048/julia-cosine-similarity-algorithm-implementation

vishwanath · June 5, 2018, 1:01pm

Okay. My intention is to get help from online community.

foobar_lv2 · June 5, 2018, 1:32pm

Unless your vectors are tiny, you are probably limited by memory bandwidth. In other words, look at distances.jl and figure out to cast your problem (e.g. results[i,j] = dist(A[:,i], B[:,j])) as a matrix-multiply (because BLAS authors know your machine better than you).

For example, if you are very lazy you can normalize your data, compute euclidean distances and get your cosine dissimilarity from that. If you are less lazy then you remove the square root (no reason to take the square root and then square it again).

Topic		Replies	Views
Using Cosine Similarity for KMeans clustering General Usage	2	808	February 9, 2021
Interesting post about SIMD dot product (and cosine similarity) Offtopic performance	17	862	December 2, 2024
Cosine seems slow Performance	14	1797	November 27, 2019
Y.A.t.Q : Yet Another @threads Question General Usage parallel , multithreading , threads	17	1136	November 18, 2021
Getting the benchmarked speedup from Distances.jl General Usage performance	8	559	September 7, 2020

Julia : Cosine Similarity algorithm implementation

Related topics