I’m new to Julia and have been playing around with it to test its performance claims.
I am looking to multiply a matrix X by a sparse vector β, generated via the following code:
n = 500
p = 100000
β = rand(p);
zero_indices = Int[]
nonzero_indices = Int[]
for i in 1:p
if rand() > 0.3
push!(zero_indices, i)
else
push!(nonzero_indices, i)
end
end
for i in zero_indices
β[i] = 0
end
X = rand(n, p);
I tried three ways to do this:
Using the * operator, i.e. computing v = X * β.
Using a for loop:
v = zeros(size(X)[1])
for i in 1:length(β)
v += X[:, i] * β[i]
end
Using a for loop, but iterating only over those indices for which β[i] is non-zero (stored in an integer array named nonzero_indices).
v = zeros(size(X)[1])
for i in nonzero_indices
v += X[:, i] * β[i]
end
I used the @benchmark macro to time each of these approaches. Here are the mean times:
28.182 ms
1.272 μs
1.132 μs
It makes sense to me that method 3 is faster than method 2: we’re looping over fewer things. I suspected it may even be faster than method 1, since we’re leveraging the sparsity of the vector β. But why on earth is method 2 so much faster than method 1? That is, why is a for loop implementation of matrix-vector multiplication so much faster than an operator dedicated to this computation?
which makes sense given your observations. My function definitions are:
julia> function f1(X, β)
X * β
end
f1 (generic function with 1 method)
julia> function f2(X, β)
v = zeros(size(X)[1])
for i in 1:length(β)
v += X[:, i] * β[i]
end
v
end
f2 (generic function with 1 method)
julia> function f3(X, β, nonzero_indices)
v = zeros(size(X)[1])
for i in nonzero_indices
v += X[:, i] * β[i]
end
v
end
Now if you want a faster implementation, you just need to follow the performance tips. Here’s an easy improvement of f3 which makes it faster than f1 by more than a factor of 2:
julia> function f4(X, β, nonzero_indices)
v = zeros(size(X, 1))
for i in nonzero_indices
v .+= @view(X[:, i]) .* β[i]
end
v
end
f4 (generic function with 1 method)
julia> @btime f4($X, $β, $nonzero_indices);
7.871 ms (29844 allocations: 1.37 MiB)
By the way, Julia 1.5 (released soon) will make f4 even faster by avoiding memory allocation for the views:
Your timing results sure make more sense. I re-ran my experiment but unfortunately I’m still getting similar results. I generated the matrix and vector exactly as above, and ran the following cells in a Jupyter notebook:
quote creates an expression when metaprogramming. It doesn’t actually do anything with that expression. You’re benchmarking essentially the time it takes to parse the code down into an expression object, not how long it takes to run it.
Get rid of the quote and use let or begin to introduce a new block. Even better, use an actual function (see the very first performance tip).
This is a huge reduction in memory. From the documentation I understand that the point of views is to avoid copying the data, instead referencing it in-place. I’m curious to know why these views involve so much memory allocation - and why they don’t in Julia 1.5. Would you be able to enlighten?
A view is a lightweight reference to another array, so when you create a view you don’t have to copy the data in that array, but you still have to construct the view itself. In Julia versions before 1.5, the view itself was often allocated on the heap, so code using views would still show a large number of allocations, but each allocation would be very small (just a pointer to the original array and some extra information about the size). That’s what happened here. Compare f3 (no views) which had:
89530 allocations: 355.19 MiB
with f4 (using views) which had:
29844 allocations: 1.37 MiB
The number of allocations is only reduced by about 3X, but the total amount of allocated memory is reduced by 300X because we’re no longer copying the data for each slice of X. The average allocation size (1.37 MiB / 29844) is less than 50 bytes, which is tiny (just a few Int64s or pointers). That makes sense since we’re allocating a large number of lightweight views.
The improvement in Julia 1.5 is that the view itself can now be constructed on the stack and therefore does not require memory allocation. The details are pretty low-level, but the relevant PR is here: https://github.com/JuliaLang/julia/pull/33886
Hi, I propose another version which is as fast as f4 and requires fewer allocations.
function f5(X, β, nonzero_indices)
v = zeros(size(X, 1))
@inbounds @simd for i in nonzero_indices
for j in 1:size(X,1)
v[j] += X[j,i] * β[i]
end
end
v
end
Below is the computing time with my computer:
@btime f4($X, $β, $nonzero_indices);
8.376 ms (29681 allocations: 1.36 MiB)
@btime f5($X, $β, $nonzero_indices);
8.288 ms (1 allocation: 4.06 KiB)