I’m trying to optimize a multiplication of a vector of Matrices `A`

of size `(L, L, L)`

and a vector of vectors `x`

of size `(L,L)`

, which will result in a vector of vectors `c`

of size `(L, L)`

. That is, each `c[i,:]`

is the result of multiplying `A[i,:,:] * x[i,:]`

. An example of code snippet doing this is:

```
using LinearAlgebra, Random, BenchmarkTools
L = 200
A = randn(L, L, L)
x = randn(L, L)
c = zeros(L, L)
function mult1!(c, A, x, L)
c .= zero(eltype(c))
for i in 1:L
@views LinearAlgebra.mul!(c[i,:], A[i,:,:], x[i,:])
end
end
```

When benchmarking this I got concerned about memory allocation:

```
julia> @benchmark mult1!(c, A, x, L)
BenchmarkTools.Trial: 265 samples with 1 evaluation.
Range (min … max): 18.432 ms … 28.804 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 18.794 ms ┊ GC (median): 0.00%
Time (mean ± σ): 18.908 ms ± 801.765 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▄█ ▆▄▃▄▃
▃▃▅▅▆▄▇▇███▆██████▅▅▆▅▃▃▄▁▄▃▄▃▄▁▁▃▁▄▁▃▁▁▃▃▁▁▁▁▃▃▁▃▁▁▁▁▁▁▁▁▁▃ ▃
18.4 ms Histogram: frequency by time 20.1 ms <
Memory estimate: 34.38 KiB, allocs estimate: 800.
```

I thought that this operation shouldn’t need any memory allocation, but it is doing 800 allocations for `L=200`

. Therefore I decided to code the Matrix-vector multiplication myself to see if I could avoid those allocations and speed things up, as the Julia documentation suggests. Instead of `LinearAlgebra.mul!`

above, I coded a function `mult_add`

performing a multiplication `v1 = M * v2`

:

```
function mult2!(c, A, x, L)
c .= zero(eltype(c))
for i in 1:L
@views mult_add!(c[i,:], A[i,:,:], x[i,:], L)
end
end
function mult_add!(v1, M, v2, L)
for j in 1:L
for k in 1:L
v1[j] += M[j,k] * v2[k]
end
end
end
```

Running the benchmark I check that indeed this avoids memory allocation:

```
julia> @benchmark mult2!(c, A, x, L)
BenchmarkTools.Trial: 146 samples with 1 evaluation.
Range (min … max): 32.477 ms … 110.520 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 32.913 ms ┊ GC (median): 0.00%
Time (mean ± σ): 34.410 ms ± 7.208 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▇█▃
████▄▁▁▁▁▁▄▁▄▁▁▁▁▁▁▁▁▄▁▆▁▄▁▁▁▄▄▁▁▁▁▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▁▁▁▁▄ ▄
32.5 ms Histogram: log(frequency) by time 53 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
```

However, the computing time increased from 18ms to 32ms. I guess that the `LinearAlgebra.mul!`

function is much more optimized than my function `mult_add!`

, but I have a couple of questions:

- By reducing allocations I was expecting to reduce the peak in the benchmark of
`mult1!`

. However I got a more prominent peak at the beginning of the histogram of`mult2!`

. Why? - Can the first version of the code with
`mult1!`

be easily optimized so that it avoids memory allocations and reduces execution time?