Multiplication of vector of matrices and vector of vectors

I’m trying to optimize a multiplication of a vector of Matrices A of size (L, L, L) and a vector of vectors x of size (L,L), which will result in a vector of vectors c of size (L, L). That is, each c[i,:] is the result of multiplying A[i,:,:] * x[i,:]. An example of code snippet doing this is:

using LinearAlgebra, Random, BenchmarkTools

L = 200
A = randn(L, L, L)
x = randn(L, L)

c = zeros(L, L)

function mult1!(c, A, x, L)
    c .= zero(eltype(c))
    for i in 1:L
        @views LinearAlgebra.mul!(c[i,:], A[i,:,:], x[i,:])

When benchmarking this I got concerned about memory allocation:

julia> @benchmark mult1!(c, A, x, L)
BenchmarkTools.Trial: 265 samples with 1 evaluation.
 Range (min … max):  18.432 ms …  28.804 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     18.794 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.908 ms ± 801.765 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▄▄█ ▆▄▃▄▃                                             
  ▃▃▅▅▆▄▇▇███▆██████▅▅▆▅▃▃▄▁▄▃▄▃▄▁▁▃▁▄▁▃▁▁▃▃▁▁▁▁▃▃▁▃▁▁▁▁▁▁▁▁▁▃ ▃
  18.4 ms         Histogram: frequency by time         20.1 ms <

 Memory estimate: 34.38 KiB, allocs estimate: 800.

I thought that this operation shouldn’t need any memory allocation, but it is doing 800 allocations for L=200. Therefore I decided to code the Matrix-vector multiplication myself to see if I could avoid those allocations and speed things up, as the Julia documentation suggests. Instead of LinearAlgebra.mul! above, I coded a function mult_add performing a multiplication v1 = M * v2:

function mult2!(c, A, x, L)
    c .= zero(eltype(c))
    for i in 1:L
        @views mult_add!(c[i,:], A[i,:,:], x[i,:], L)

function mult_add!(v1, M, v2, L)
    for j in 1:L
        for k in 1:L
            v1[j] += M[j,k] * v2[k]

Running the benchmark I check that indeed this avoids memory allocation:

julia> @benchmark mult2!(c, A, x, L)
BenchmarkTools.Trial: 146 samples with 1 evaluation.
 Range (min … max):  32.477 ms … 110.520 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     32.913 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.410 ms ±   7.208 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ████▄▁▁▁▁▁▄▁▄▁▁▁▁▁▁▁▁▄▁▆▁▄▁▁▁▄▄▁▁▁▁▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▁▁▁▁▄ ▄
  32.5 ms       Histogram: log(frequency) by time        53 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

However, the computing time increased from 18ms to 32ms. I guess that the LinearAlgebra.mul! function is much more optimized than my function mult_add!, but I have a couple of questions:

  1. By reducing allocations I was expecting to reduce the peak in the benchmark of mult1!. However I got a more prominent peak at the beginning of the histogram of mult2!. Why?
  2. Can the first version of the code with mult1! be easily optimized so that it avoids memory allocations and reduces execution time?

Try to organize your data so that you can slice on the last index instead and keep stride one in your views.


To put numbers on that:

julia> @btime mult1!($c, $A, $x, $L);  # as above, my computer
  13.159 ms (800 allocations: 34.38 KiB)

julia> using NNlib  # to do a different computation!

julia> @btime batched_mul!(reshape($c,$L,1,$L), $A, reshape($x,$L,1,$L));
  1.985 ms (39 allocations: 3.23 KiB)

julia> @btime permutedims!($(similar(A)), $A, (2,3,1));  # cost of re-ordering
  13.308 ms (0 allocations: 0 bytes)

Here batched_mul is computing c[a,i] = A[a,b,i] * x[b,i] in which each matrix slice has stride 1. If you cannot re-organise your calculation to have this throughout, then the cost of permuting is quite high, and in this case another approach is:

julia> using Tullio

julia> mult3!(c, A, x, L=0) = @tullio c[i,a] = A[i,a,b] * x[i,b] avx=false;  # without LV

julia> @btime mult3!($c, $A, $x, $L);
  2.328 ms (49 allocations: 2.53 KiB)

julia> using LoopVectorization

julia> mult4!(c, A, x, L=0) = @tullio c[i,a] = A[i,a,b] * x[i,b];  # avx=true default LV loaded

julia> @btime mult4!($c, $A, $x, $L);
  1.565 ms (49 allocations: 2.53 KiB)

Just to be crystal clear, what you are saying is that I should organize my data so that my multiplication reads

@views LinearAlgebra.mul!(c[:,i], A[:,:,i], x[:,i])


This way I indeed get 0 allocations and faster code. Thank you @GunnarFarneback and @mcabbott !