I’ve always been told to reduce redundant memory allocations, especially when the given operation is to be executed for many times; and loops are fast, so vectorised code is not always necessary.
However, in practice, I found it is not always the case. For example, I calculated the matrix product of X * diagonal(y) * X’, where y is a vector and X a matrix, and stored the results in a pre-allocated matrix D:
function func1(X, y)
transpose(X) * Diagonal(y) * X
end
function func2!(X, y, D)
Xt = transpose(X)
fill!(D, 0)
for n = 1:size(X, 1)
@inbounds @views D .+= Xt[:, n] .* transpose(Xt[:, n]) .* y[n]
end
D
end
And the results are
julia> D = Matrix{Float64}(undef, 500, 500);
julia> @benchmark D .= func1(randn(1000, 500), randn(1000))
BenchmarkTools.Trial:
memory estimate: 9.54 MiB
allocs estimate: 10
--------------
minimum time: 11.245 ms (0.00% GC)
median time: 15.246 ms (0.00% GC)
mean time: 21.155 ms (6.78% GC)
maximum time: 163.981 ms (9.29% GC)
--------------
samples: 236
evals/sample: 1
julia> @benchmark func2!(randn(1000, 500), randn(1000), $D)
BenchmarkTools.Trial: memory estimate: 3.82 MiB
allocs estimate: 3
--------------
minimum time: 353.599 ms (0.00% GC)
median time: 383.437 ms (0.00% GC)
mean time: 506.716 ms (0.00% GC)
maximum time: 1.159 s (0.00% GC)
--------------
samples: 11
evals/sample: 1
I was wondering what made func2! slower than func1, even the former yielded fewer memory allocations. And more generally, is there a golden rule to help decide when to use loops, and when to use built-in matrix operations?
Beating BLAS routines is always a tough ask because they are already very optimized.
Reducing allocations can be better specially if your code is allocating a lot. Since func1 doesn’t allocate that much the impact of those allocations isn’t that large. You can see that by making a version of func1 that doesn’t allocate
function func3!(X, y, D, E)
mul!(E, Diagonal(y), X)
mul!(D, transpose(X), E)
nothing
end
julia> @benchmark func1($a,$b)
BenchmarkTools.Trial:
memory estimate: 5.72 MiB
allocs estimate: 6
--------------
minimum time: 25.907 ms (0.00% GC)
median time: 26.195 ms (0.00% GC)
mean time: 26.289 ms (0.28% GC)
maximum time: 27.058 ms (2.35% GC)
--------------
samples: 191
evals/sample: 1
julia> @benchmark func3!($a,$b, $D, $E)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 25.388 ms (0.00% GC)
median time: 25.914 ms (0.00% GC)
mean time: 25.884 ms (0.00% GC)
maximum time: 26.236 ms (0.00% GC)
--------------
samples: 194
evals/sample: 1
Another thing to take a look at, its very likely that the BLAS operation is using multiple threads, while your version isn’t.
Thank you for your answer - it’s very helpful. I was wondering if macros like @turbo would cause overheads of multi-threading, especially when I have many sequential Matrix operations?
@turbo is single-threaded by default, however when calling @tturbo the package doesn’t use the normal julia threading and instead uses https://github.com/JuliaSIMD/Polyester.jl, which has lower overhead. I wouldnt worry about the threading overhead though, its on the order of microseconds. Check the package for some more information about it.
julia> using LoopVectorization, BenchmarkHistograms, LinearAlgebra
julia> D = Matrix{Float64}(undef, 500, 500); D2 = similar(D); D3 = similar(D);
julia> X = randn(1000,500); y = randn(1000);
julia> function func5!(D, X, y)
@tturbo for m in axes(D,1), k in axes(D,2)
Dmk = zero(eltype(D))
for n = axes(X, 1)
Dmk += X[n, m] * X[n, k] * y[n]
end
D[m, k] = Dmk
end
D
end
func5! (generic function with 1 method)
julia> function func2!(D, X, y)
Xt = transpose(X)
fill!(D, 0)
for n = 1:size(X, 1)
@inbounds @views D .+= Xt[:, n] .* transpose(Xt[:, n]) .* y[n]
end
D
end
func2! (generic function with 1 method)
julia> func2!(D2,X,y);
julia> func5!(D,X,y) ≈ D2
true
julia> @benchmark func5!($D,$X,$y)
samples: 4270; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(1.162e6 - 1.17e6 ] ██████████████████████████████3551
(1.17e6 - 1.178e6] █████▊680
(1.178e6 - 1.186e6] ▎15
(1.186e6 - 1.194e6] ▏3
(1.194e6 - 1.202e6] ▏3
(1.202e6 - 1.21e6 ] 0
(1.21e6 - 1.218e6] 0
(1.218e6 - 1.225e6] ▏2
(1.225e6 - 1.233e6] ▏2
(1.233e6 - 1.241e6] ▏1
(1.241e6 - 1.249e6] ▏1
(1.249e6 - 1.257e6] ▏5
(1.257e6 - 1.265e6] ▏2
(1.265e6 - 1.334e6] ▏5
Counts
min: 1.162 ms (0.00% GC); mean: 1.168 ms (0.00% GC); median: 1.167 ms (0.00% GC); max: 1.334 ms (0.00% GC).
julia> @benchmark func2!($D3,$X,$y)
samples: 30; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(1.703e8 - 1.705e8] █████████████████████▌5
(1.705e8 - 1.708e8] █████████████████████▌5
(1.708e8 - 1.71e8 ] █████████████████████████▊6
(1.71e8 - 1.713e8] ██████████████████████████████ 7
(1.713e8 - 1.716e8] █████████████████████████▊6
(1.716e8 - 1.718e8] ████▍1
Counts
min: 170.258 ms (0.00% GC); mean: 170.990 ms (0.00% GC); median: 171.029 ms (0.00% GC); max: 171.823 ms (0.00% GC).
Note that I benchmarked on a computer with 18 cores and 1 MiB L2 cache/core, so LoopVectorization can handle much larger arrays on this computer before performance falls off a cliff [e.g., see the README plot. I will fix that eventually, but for now it’d be worth manually blocking the arrays or using Tullio.