Julia : Cosine Similarity algorithm implementation

question

#1

As a beginner, i am trying to implement cosine similarity algorithm in julia. Below is a piece of code which i got from one of the resource. I just want to know the flow of the code.

# Cosine dist
        a: [2, 0, 1, 1, 0, 2, 1, 1]

        b: [2, 1, 1, 0, 1, 1, 1, 1]

 #1           @inline function eval_start(::CosineDist, a::AbstractArray{T}, b::AbstractArray{T}) where {T <: Real}  
 #2               zero(T), zero(T), zero(T)                                        
 #3           end                                                                  

#4           @inline eval_op(::CosineDist, ai, bi) = ai * bi, ai * ai, bi * bi    
#5            @inline function eval_reduce(::CosineDist, s1, s2)                   
#6                a1, b1, c1 = s1                                                    
#7                  a2, b2, c2 = s2                                                  
#8                return a1 + a2, b1 + b2, c1 + c2                                 
#9            end                                                                  

#10            function eval_end(::CosineDist, s)                                   
#11                ab, a2, b2 = s    
#12                max(1 - ab / (sqrt(a2) * sqrt(b2)), zero(eltype(ab)))            
#13            end                                                                  
#14            cosine_dist(a::AbstractArray, b::AbstractArray) = evaluate(CosineDist(), a, b)

But we can do it using inbuilt function

cosine_dist(a, b)

How about the scalability?. As data may contain crores of records.In that case what will be most efficient way of implementation.


#2

If you post the same question on multiple forums it is good to cross-reference them: https://stackoverflow.com/questions/50699048/julia-cosine-similarity-algorithm-implementation


#3

Okay. My intention is to get help from online community.


#4

Unless your vectors are tiny, you are probably limited by memory bandwidth. In other words, look at distances.jl and figure out to cast your problem (e.g. results[i,j] = dist(A[:,i], B[:,j])) as a matrix-multiply (because BLAS authors know your machine better than you).

For example, if you are very lazy you can normalize your data, compute euclidean distances and get your cosine dissimilarity from that. If you are less lazy then you remove the square root (no reason to take the square root and then square it again).