I’m trying to optimize a multiplication of a vector of Matrices A
of size (L, L, L)
and a vector of vectors x
of size (L,L)
, which will result in a vector of vectors c
of size (L, L)
. That is, each c[i,:]
is the result of multiplying A[i,:,:] * x[i,:]
. An example of code snippet doing this is:
using LinearAlgebra, Random, BenchmarkTools
L = 200
A = randn(L, L, L)
x = randn(L, L)
c = zeros(L, L)
function mult1!(c, A, x, L)
c .= zero(eltype(c))
for i in 1:L
@views LinearAlgebra.mul!(c[i,:], A[i,:,:], x[i,:])
end
end
When benchmarking this I got concerned about memory allocation:
julia> @benchmark mult1!(c, A, x, L)
BenchmarkTools.Trial: 265 samples with 1 evaluation.
Range (min … max): 18.432 ms … 28.804 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 18.794 ms ┊ GC (median): 0.00%
Time (mean ± σ): 18.908 ms ± 801.765 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▄█ ▆▄▃▄▃
▃▃▅▅▆▄▇▇███▆██████▅▅▆▅▃▃▄▁▄▃▄▃▄▁▁▃▁▄▁▃▁▁▃▃▁▁▁▁▃▃▁▃▁▁▁▁▁▁▁▁▁▃ ▃
18.4 ms Histogram: frequency by time 20.1 ms <
Memory estimate: 34.38 KiB, allocs estimate: 800.
I thought that this operation shouldn’t need any memory allocation, but it is doing 800 allocations for L=200
. Therefore I decided to code the Matrix-vector multiplication myself to see if I could avoid those allocations and speed things up, as the Julia documentation suggests. Instead of LinearAlgebra.mul!
above, I coded a function mult_add
performing a multiplication v1 = M * v2
:
function mult2!(c, A, x, L)
c .= zero(eltype(c))
for i in 1:L
@views mult_add!(c[i,:], A[i,:,:], x[i,:], L)
end
end
function mult_add!(v1, M, v2, L)
for j in 1:L
for k in 1:L
v1[j] += M[j,k] * v2[k]
end
end
end
Running the benchmark I check that indeed this avoids memory allocation:
julia> @benchmark mult2!(c, A, x, L)
BenchmarkTools.Trial: 146 samples with 1 evaluation.
Range (min … max): 32.477 ms … 110.520 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 32.913 ms ┊ GC (median): 0.00%
Time (mean ± σ): 34.410 ms ± 7.208 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▇█▃
████▄▁▁▁▁▁▄▁▄▁▁▁▁▁▁▁▁▄▁▆▁▄▁▁▁▄▄▁▁▁▁▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▁▁▁▁▄ ▄
32.5 ms Histogram: log(frequency) by time 53 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
However, the computing time increased from 18ms to 32ms. I guess that the LinearAlgebra.mul!
function is much more optimized than my function mult_add!
, but I have a couple of questions:
- By reducing allocations I was expecting to reduce the peak in the benchmark of
mult1!
. However I got a more prominent peak at the beginning of the histogram ofmult2!
. Why? - Can the first version of the code with
mult1!
be easily optimized so that it avoids memory allocations and reduces execution time?