Vector multiply vector vs for loop, type and size dependent performance

Donut_Meepo · November 7, 2018, 6:05am

Hi,

I am trying to do x’*y vector multiplication. Then I have the function

 function vectorMvector_Int8(x::Array{Int8,1}, y::Array{Int8,1})
     result = x'*y
     result
 end

I have another version that using for loop to do the multiplication

 function vectorMvector2_Int8(x::Array{Int8,1}, y::Array{Int8,1})
     result=0
     n = size(x)[1]
     @inbounds @simd for i in 1:n
         result += x[i]*y[i]
     end
     result
 end

When the vectors are small, vector multiplication are slower than for loop
e.g. a=rand([Int8(0),Int8(1)], 100);b=rand([Int8(0),Int8(1)], 100)
benchmark median for vectorMvector is 75 ns while vectorMvector2 is 29 ns.

When vectors are large, for example a=rand([Int8(0),Int8(1)], 100000);b=rand([Int8(0),Int8(1)], 100000), benchmark median for vectorMvector is 11 μs while vectorMvector2 is 24 μs.

However when x,y is Int64 type, vectorMvector2 seems always a little bit faster than vectorMvector

 function vectorMvector_Int64(x::Array{Int64,1}, y::Array{Int64,1})
     result = x'*y
     result
 end

 function vectorMvector2_Int64(x::Array{Int64,1}, y::Array{Int64,1})
     result=0
     n = size(x)[1]
     @inbounds @simd for i in 1:n
         result += x[i]*y[i]
     end
     result
 end

a=rand([Int64(0),Int64(1)], 100);b=rand([Int64(0),Int64(1)], 100), 31 ns vs 28 ns.
a=rand([Int64(0),Int64(1)], 100000);b=rand([Int64(0),Int64(1)], 100000) 31.734 μs vs 28.916 μs

I prefer to use x’*y because it is more meaningful and less typing. But the performance is complicating my choices. Am I doing something wrong for the x’*y operation to cause slow performance?

Thank you!

y4lu · November 7, 2018, 7:44am

If it is always multiplying just 0’s and 1’s, you might try sum( x .& y)

Donut_Meepo · November 7, 2018, 8:17am

It is 3X slower in the Int64 datatype situation. Although it might relates to another issue, I feel doing element-wise operation then using the sum function is slower than explicitly sum them up like the for loop.

I realize that the Int8 version functions is not suitable for large vector, since it is likely to overthrow. I tested the Int64 version on more general vectors

a=rand(Int64(0):Int64(100), 100);b=rand(Int64(0):Int64(100), 100)
and larger size of the vector
a=rand(Int64(0):Int64(100), 100000);b=rand(Int64(0):Int64(100), 100000)

As expected the timing are exactly the same as the 0 & 1 vectors.

Tamas_Papp · November 7, 2018, 8:18am

Unless this is a learning exercise, can’t you use LinearAlgebra.dot?

DNF · November 7, 2018, 8:40am

You know, your codes for the int8 and int64 input are identical, so it’s a bit weird to name them vectorMvector_Int8 and vectorMvector_Int64, etc. It just makes it more likely that some typo will introduce discrepancies. It’s also a bit grating, because it so flagrantly flies in the face of multiple dispatch.

Can’t you just write your code like this

function vectorMvector(x, y)
     x'*y
end

function vectorMvector2(x, y)
     result=0
     n = size(x)[1]
     @inbounds @simd for i in 1:n
         result += x[i]*y[i]
     end
     result
 end

and then run it on the various input types?

You should perhaps consider, though, to replace the initialization results = 0 with result = zero(eltype(x)), unless you specifically need the output to be Int64.

Donut_Meepo · November 7, 2018, 8:44am

x’*y is not dot multiply, right?

Tamas_Papp · November 7, 2018, 8:48am

julia> using LinearAlgebra

julia> N = 10
10

julia> x, y = rand(Int, N), rand(Int, N);

julia> x'*y == dot(x, y)
true

Donut_Meepo · November 7, 2018, 7:28pm

I am sorry I am new for using Julia. I will refer more to the exist library. Thank you.

My actual problem is calling the dot operation multiple times in a for loop. Code manually the dot operation seems run faster than simply calling dot(). Again I am sure I did miss something. I really appreciate it if you can tell me how to do it right, since using the dot() function will make the code more clear.

function test(mat)
    n, m = size(mat)
    count = 0
    @inbounds for i in 1:m, j in i:m
        n, m = size(mat)
        temp_sum = 0
        #@views temp_sum = dot(mat[:,i], mat[:,j])
        for k in 1:n
            temp_sum += (mat[k,i]*mat[k,j])
        end
        if temp_sum==0
            count += 1
        end
    end
    count
end

I am using this to generate input a=rand([0,1], 22, 1000).

Donut_Meepo · November 7, 2018, 7:30pm

Thank you. I am just switching from python+numba. It requires tell the type signatures. The ones you pointed out really work the same performance for different argument type combinations!

Tamas_Papp · November 7, 2018, 7:38pm

Try views.

Donut_Meepo · November 7, 2018, 7:41pm

Do you mean code as following?

function test(mat)
    n, m = size(mat)
    count = 0
    @inbounds for i in 1:m, j in i:m
        temp_sum = 0
        @views temp_sum = dot(mat[:,i], mat[:,j])
        if temp_sum==0
            count += 1
        end
    end
    count
end

Tamas_Papp · November 7, 2018, 7:45pm

Yes, that should work.

Incidentally, I don’t see the purpose of

n, m = size(mat)

inside the loop.

tkoolen · November 7, 2018, 7:50pm

Note that x' * y actually calls dot:

github.com

JuliaLang/julia/blob/7454dc9e54a3009a0be2860759735c849d647b39/stdlib/LinearAlgebra/src/adjtrans.jl#L197


      
          # TODO unify and allow mixed combinations
          
          
### linear algebra
          
          
(-)(A::Adjoint)   = Adjoint(  -A.parent)
          (-)(A::Transpose) = Transpose(-A.parent)
          
          
## multiplication *
          
          
# Adjoint/Transpose-vector * vector
          *(u::AdjointAbsVec, v::AbstractVector) = dot(u.parent, v)
          *(u::TransposeAbsVec{T}, v::AbstractVector{T}) where {T<:Real} = dot(u.parent, v)
          function *(u::TransposeAbsVec, v::AbstractVector)
              @assert !has_offset_axes(u, v)
              @boundscheck length(u) == length(v) || throw(DimensionMismatch())
              return sum(@inbounds(u[k]*v[k]) for k in 1:length(u))
          end
          # vector * Adjoint/Transpose-vector
          *(u::AbstractVector, v::AdjOrTransAbsVec) = broadcast(*, u, v)
          # Adjoint/Transpose-vector * Adjoint/Transpose-vector
          # (necessary for disambiguation with fallback methods in linalg/matmul)

Donut_Meepo · November 7, 2018, 8:07pm

Sorry that was my typo. I corrected. By using for inside, benchmark shows 3.8 ms with 16 bytes and 1 allocation. By using @views dot(), benchmark shows 14.654 ms with 45.82 MB and 1001001

Topic		Replies	Views
Matrix-vector multiplication slower than a 'naive' for loop? Performance vector	7	1647	July 30, 2020
Why for-loop of vector multiplication is slower that dot product? General Usage question	11	452	August 11, 2022
Effect of Column Major Storage on Multiplication Performance	3	446	November 7, 2020
For loop in function and multiplication of larger matrices, slow speed in parallel Performance performance , parallel , loops	3	1298	November 22, 2019
Multiplication of vector of matrices and vector of vectors Performance linearalgebra , tullio	3	416	November 7, 2022

Vector multiply vector vs for loop, type and size dependent performance

Related topics