Matrix-vector multiplication slower than a 'naive' for loop?

barankarakus · July 29, 2020, 7:30pm

Hi,

I’m new to Julia and have been playing around with it to test its performance claims.

I am looking to multiply a matrix X by a sparse vector β, generated via the following code:

n = 500
p = 100000

β = rand(p); 
zero_indices = Int[]
nonzero_indices = Int[]
for i in 1:p
    if rand() > 0.3
        push!(zero_indices, i)
    else
        push!(nonzero_indices, i)
    end
end
for i in zero_indices
    β[i] = 0 
end 
X = rand(n, p);

I tried three ways to do this:

Using the * operator, i.e. computing v = X * β.
Using a for loop:

v = zeros(size(X)[1])
for i in 1:length(β)
    v += X[:, i] * β[i]
end

Using a for loop, but iterating only over those indices for which β[i] is non-zero (stored in an integer array named nonzero_indices).

v = zeros(size(X)[1])
for i in nonzero_indices
    v += X[:, i] * β[i]
end

I used the @benchmark macro to time each of these approaches. Here are the mean times:

28.182 ms
1.272 μs
1.132 μs

It makes sense to me that method 3 is faster than method 2: we’re looping over fewer things. I suspected it may even be faster than method 1, since we’re leveraging the sparsity of the vector β. But why on earth is method 2 so much faster than method 1? That is, why is a for loop implementation of matrix-vector multiplication so much faster than an operator dedicated to this computation?

Thanks!

rdeits · July 29, 2020, 7:52pm

Double-check your benchmarking code. I’m getting:

julia> @btime f1($X, $β);
  19.875 ms (1 allocation: 4.06 KiB)

julia> @btime f2($X, $β);
  186.214 ms (300001 allocations: 1.16 GiB)

julia> @btime f3($X, $β, $nonzero_indices);
  56.557 ms (89530 allocations: 355.19 MiB)

which makes sense given your observations. My function definitions are:

julia> function f1(X, β)
         X * β
       end
f1 (generic function with 1 method)

julia> function f2(X, β)
         v = zeros(size(X)[1])
         for i in 1:length(β)
           v += X[:, i] * β[i]
         end
         v
       end
f2 (generic function with 1 method)

julia> function f3(X, β, nonzero_indices)
         v = zeros(size(X)[1])
         for i in nonzero_indices
           v += X[:, i] * β[i]
         end
         v
       end

Now if you want a faster implementation, you just need to follow the performance tips. Here’s an easy improvement of f3 which makes it faster than f1 by more than a factor of 2:

julia> function f4(X, β, nonzero_indices)
         v = zeros(size(X, 1))
         for i in nonzero_indices
           v .+= @view(X[:, i]) .* β[i]
         end
         v
       end
f4 (generic function with 1 method)

julia> @btime f4($X, $β, $nonzero_indices);
  7.871 ms (29844 allocations: 1.37 MiB)

By the way, Julia 1.5 (released soon) will make f4 even faster by avoiding memory allocation for the views:

julia-1.5> @btime f4($X, $β, $nonzero_indices);
  7.289 ms (1 allocation: 4.06 KiB)

barankarakus · July 29, 2020, 8:03pm

Thanks!

Your timing results sure make more sense. I re-ran my experiment but unfortunately I’m still getting similar results. I generated the matrix and vector exactly as above, and ran the following cells in a Jupyter notebook:

Do you have any idea what might be going wrong here?

rdeits · July 29, 2020, 8:10pm

quote creates an expression when metaprogramming. It doesn’t actually do anything with that expression. You’re benchmarking essentially the time it takes to parse the code down into an expression object, not how long it takes to run it.

Get rid of the quote and use let or begin to introduce a new block. Even better, use an actual function (see the very first performance tip).

barankarakus · July 29, 2020, 8:10pm

Ah I see… Very silly of me ! Thank you.

barankarakus · July 29, 2020, 9:43pm

rdeits:

Now if you want a faster implementation, you just need to follow the performance tips. Here’s an easy improvement of f3 which makes it faster than f1 by more than a factor of 2:
julia> function f4(X, β, nonzero_indices)
         v = zeros(size(X, 1))
         for i in nonzero_indices
           v .+= @view(X[:, i]) .* β[i]
         end
         v
       end
f4 (generic function with 1 method)

julia> @btime f4($X, $β, $nonzero_indices);
  7.871 ms (29844 allocations: 1.37 MiB)
By the way, Julia 1.5 (released soon) will make f4 even faster by avoiding memory allocation for the views:
julia-1.5> @btime f4($X, $β, $nonzero_indices);
  7.289 ms (1 allocation: 4.06 KiB)

This is a huge reduction in memory. From the documentation I understand that the point of views is to avoid copying the data, instead referencing it in-place. I’m curious to know why these views involve so much memory allocation - and why they don’t in Julia 1.5. Would you be able to enlighten?

rdeits · July 29, 2020, 10:02pm

A view is a lightweight reference to another array, so when you create a view you don’t have to copy the data in that array, but you still have to construct the view itself. In Julia versions before 1.5, the view itself was often allocated on the heap, so code using views would still show a large number of allocations, but each allocation would be very small (just a pointer to the original array and some extra information about the size). That’s what happened here. Compare f3 (no views) which had:

89530 allocations: 355.19 MiB

with f4 (using views) which had:

29844 allocations: 1.37 MiB

The number of allocations is only reduced by about 3X, but the total amount of allocated memory is reduced by 300X because we’re no longer copying the data for each slice of X. The average allocation size (1.37 MiB / 29844) is less than 50 bytes, which is tiny (just a few Int64s or pointers). That makes sense since we’re allocating a large number of lightweight views.

The improvement in Julia 1.5 is that the view itself can now be constructed on the stack and therefore does not require memory allocation. The details are pretty low-level, but the relevant PR is here: https://github.com/JuliaLang/julia/pull/33886

nhavt · July 30, 2020, 8:29am

Hi, I propose another version which is as fast as f4 and requires fewer allocations.

function f5(X, β, nonzero_indices)
    v = zeros(size(X, 1))
    @inbounds @simd for i in nonzero_indices
                      for j in 1:size(X,1)
                        v[j] += X[j,i] * β[i]
                      end
                    end
    v
end

Below is the computing time with my computer:

@btime f4($X, $β, $nonzero_indices);
8.376 ms (29681 allocations: 1.36 MiB)
@btime f5($X, $β, $nonzero_indices);
8.288 ms (1 allocation: 4.06 KiB)

Topic		Replies	Views
Sparse matrix-vector product: much more slow than Matlab Performance matlab , optimization	24	4547	December 20, 2017
Specialized matrix-matrix multiplication algorithm New to Julia question , performance , linearalgebra	5	420	July 9, 2024
Speeding up matrix multiplication with a subset of matrix's rows General Usage linearalgebra	12	971	July 24, 2019
Reducing time New to Julia	4	790	June 22, 2020
Batched matrix-multiplication optimization Performance performance , linearalgebra	15	354	May 30, 2025

Matrix-vector multiplication slower than a 'naive' for loop?

Related topics