 # Matrix-vector multiplication slower than a 'naive' for loop?

Hi,

I’m new to Julia and have been playing around with it to test its performance claims.

I am looking to multiply a matrix `X` by a sparse vector `β`, generated via the following code:

``````n = 500
p = 100000

β = rand(p);
zero_indices = Int[]
nonzero_indices = Int[]
for i in 1:p
if rand() > 0.3
push!(zero_indices, i)
else
push!(nonzero_indices, i)
end
end
for i in zero_indices
β[i] = 0
end
X = rand(n, p);
``````

I tried three ways to do this:

1. Using the `*` operator, i.e. computing `v = X * β`.
2. Using a for loop:
``````v = zeros(size(X))
for i in 1:length(β)
v += X[:, i] * β[i]
end
``````
1. Using a for loop, but iterating only over those indices for which `β[i]` is non-zero (stored in an integer array named `nonzero_indices`).
``````v = zeros(size(X))
for i in nonzero_indices
v += X[:, i] * β[i]
end
``````

I used the `@benchmark ` macro to time each of these approaches. Here are the mean times:

1. 28.182 ms
2. 1.272 μs
3. 1.132 μs

It makes sense to me that method 3 is faster than method 2: we’re looping over fewer things. I suspected it may even be faster than method 1, since we’re leveraging the sparsity of the vector β. But why on earth is method 2 so much faster than method 1? That is, why is a for loop implementation of matrix-vector multiplication so much faster than an operator dedicated to this computation?

Thanks!

Double-check your benchmarking code. I’m getting:

``````julia> @btime f1(\$X, \$β);
19.875 ms (1 allocation: 4.06 KiB)

julia> @btime f2(\$X, \$β);
186.214 ms (300001 allocations: 1.16 GiB)

julia> @btime f3(\$X, \$β, \$nonzero_indices);
56.557 ms (89530 allocations: 355.19 MiB)
``````

which makes sense given your observations. My function definitions are:

``````julia> function f1(X, β)
X * β
end
f1 (generic function with 1 method)

julia> function f2(X, β)
v = zeros(size(X))
for i in 1:length(β)
v += X[:, i] * β[i]
end
v
end
f2 (generic function with 1 method)

julia> function f3(X, β, nonzero_indices)
v = zeros(size(X))
for i in nonzero_indices
v += X[:, i] * β[i]
end
v
end
``````

Now if you want a faster implementation, you just need to follow the performance tips. Here’s an easy improvement of `f3` which makes it faster than `f1` by more than a factor of 2:

``````julia> function f4(X, β, nonzero_indices)
v = zeros(size(X, 1))
for i in nonzero_indices
v .+= @view(X[:, i]) .* β[i]
end
v
end
f4 (generic function with 1 method)

julia> @btime f4(\$X, \$β, \$nonzero_indices);
7.871 ms (29844 allocations: 1.37 MiB)
``````

By the way, Julia 1.5 (released soon) will make `f4` even faster by avoiding memory allocation for the views:

``````julia-1.5> @btime f4(\$X, \$β, \$nonzero_indices);
7.289 ms (1 allocation: 4.06 KiB)
``````
2 Likes

Thanks!

Your timing results sure make more sense. I re-ran my experiment but unfortunately I’m still getting similar results. I generated the matrix and vector exactly as above, and ran the following cells in a Jupyter notebook:

Do you have any idea what might be going wrong here?

`quote` creates an expression when metaprogramming. It doesn’t actually do anything with that expression. You’re benchmarking essentially the time it takes to parse the code down into an expression object, not how long it takes to run it.

Get rid of the `quote` and use `let` or `begin` to introduce a new block. Even better, use an actual function (see the very first performance tip).

3 Likes

Ah I see… Very silly of me ! Thank you.

This is a huge reduction in memory. From the documentation I understand that the point of views is to avoid copying the data, instead referencing it in-place. I’m curious to know why these views involve so much memory allocation - and why they don’t in Julia 1.5. Would you be able to enlighten? A view is a lightweight reference to another array, so when you create a view you don’t have to copy the data in that array, but you still have to construct the view itself. In Julia versions before 1.5, the view itself was often allocated on the heap, so code using views would still show a large number of allocations, but each allocation would be very small (just a pointer to the original array and some extra information about the size). That’s what happened here. Compare `f3` (no views) which had:

``````89530 allocations: 355.19 MiB
``````

with `f4` (using views) which had:

``````29844 allocations: 1.37 MiB
``````

The number of allocations is only reduced by about 3X, but the total amount of allocated memory is reduced by 300X because we’re no longer copying the data for each slice of `X`. The average allocation size (1.37 MiB / 29844) is less than 50 bytes, which is tiny (just a few `Int64`s or pointers). That makes sense since we’re allocating a large number of lightweight views.

The improvement in Julia 1.5 is that the view itself can now be constructed on the stack and therefore does not require memory allocation. The details are pretty low-level, but the relevant PR is here: https://github.com/JuliaLang/julia/pull/33886

2 Likes

Hi, I propose another version which is as fast as f4 and requires fewer allocations.

``````function f5(X, β, nonzero_indices)
v = zeros(size(X, 1))
@inbounds @simd for i in nonzero_indices
for j in 1:size(X,1)
v[j] += X[j,i] * β[i]
end
end
v
end
``````

Below is the computing time with my computer:

``````@btime f4(\$X, \$β, \$nonzero_indices);
8.376 ms (29681 allocations: 1.36 MiB)
@btime f5(\$X, \$β, \$nonzero_indices);
8.288 ms (1 allocation: 4.06 KiB)
``````
1 Like