Matrix-by-(slice of)vector multiplication with limited allocation

I am trying to multiply a small matrix by a slice of a vector (subvector), i.e., z[3:7] .= A * x[3:7], and it allocates memory for a copy of the slice. A view is a lightweight option, but it still needs some memory and takes time when I do it many times (10 million times or more). Is there any way to perform such an operation with limited memory usage? (like GEMM or SYMM for slices?)

Examples:

julia> z = zeros(7);

julia> x = ones(7);

julia> A=rand(5,5);

# naive case
julia> @time z[3:7] .= A * x[3:7];
  0.000012 seconds (6 allocations: 384 bytes)

# view for x
julia> @time z[3:7] .= A * (@view x[3:7]);
  0.000012 seconds (6 allocations: 304 bytes)

# view for z
julia> @time vz = @view z[3:7];
  0.000002 seconds (1 allocation: 48 bytes)

julia> @time vz .= A * (@view x[3:7]);
  0.000006 seconds (2 allocations: 176 bytes)

Some ways:

julia> @btime $z[3:7] .= $A * $x[3:7];
  131.082 ns (2 allocations: 256 bytes)

julia> using LinearAlgebra, Einsum

julia> @btime @views mul!($z[3:7], $A, $x[3:7]);
  69.175 ns (0 allocations: 0 bytes)

julia> emul!(C,A,B) = @einsum C[i] = A[i,k] * B[k]

julia> @btime @views emul!($z[3:7], $A, $x[3:7]);
  26.772 ns (0 allocations: 0 bytes)

Thank you for your quick reply! I tried it but it seems like it doesn’t allocate with @btime but does allocate in actual situation. I am afraid I am missing the point.

using LinearAlgebra, Einsum, TimerOutputs
function alloc_test1()
   reset_timer!()
   for i=1:10000000
      @timeit "time" @views mul!(z[3:7], A, x[3:7]);
   end
end
alloc_test1(); print_timer();

emul!(C,A,B) = @einsum C[i] = A[i,k] * B[k]
function alloc_test2()
   reset_timer!()
   for i=1:10000000
      @timeit "time" @views emul!(z[3:7], A, x[3:7])
   end
end
alloc_test2(); print_timer();
# alloc_test1()
 ──────────────────────────────────────────────────────────────────
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      2.83s / 81.2%           1.34GiB / 100%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 time       10.0M    2.30s   100%   230ns   1.34GiB  100%      144B
 ──────────────────────────────────────────────────────────────────

# alloc_test2()
 ──────────────────────────────────────────────────────────────────
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      2.21s / 76.4%            917MiB / 100%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 time       10.0M    1.69s   100%   169ns    917MiB  100%     96.2B
 ──────────────────────────────────────────────────────────────────

I guess this means @timeit is unsuitable for measuring nanoseconds. @btime has its quirks but it works well, and is already running things many times for you. If you want to check:

julia> function alloc_test_n(z,A,x,n)
          for i=1:n
             @views mul!(z[3:7], A, x[3:7]);
          end
       end;

julia> alloc_test_n(z,A,x,10^4);

julia> @time alloc_test_n(z,A,x,10^5)
  0.006389 seconds (1 allocation: 16 bytes)

julia> @time alloc_test_n(z,A,x,10^6)
  0.063919 seconds (1 allocation: 16 bytes)

julia> @time alloc_test_n(z,A,x,10^7)
  0.637590 seconds (1 allocation: 16 bytes)

julia> @time alloc_test_n(z,A,x,10^8)
  6.398111 seconds (1 allocation: 16 bytes)
1 Like

If your matrices and vectors are this small (< 10 \times 10) and you are doing millions of operations, then you are may be better off with StaticArrays.jl so that it can inline and unroll the operations (and will also allow you to eliminate allocations, assuming the sizes are fixed). BLAS gemm operations (which can also avoid allocations if you drop down to a low enough level) are going to be suboptimal here because they incur a lot of overhead for such small matrices.

e.g.

julia> using StaticArrays, BenchmarkTools

julia> A = SMatrix{5,5}(rand(5,5));

julia> z = zeros(7); x = ones(7);

julia> @btime $z[3:7] .= $A * SVector(ntuple(i -> $x[2+i], Val{5}()));
  6.219 ns (0 allocations: 0 bytes)

(The business with ntuple is so I can write out the equivalent of SVector(x[3],x[4],x[5],x[6],x[7]), i.e. SVector(x[3:7]...) but with the loop unrolled statically. However, if you are working with StaticArrays then you may want to re-think your other data structures to use SVector as well.)

4 Likes

Yes, see GitHub - KristofferC/TimerOutputs.jl: Formatted output of timed sections in Julia.

There is a small overhead in timing a section (0.25 ΞΌs) which means that this package is not suitable for measuring sections that finish very quickly. For proper benchmarking you want to use a more suitable tool like BenchmarkTools .

3 Likes

I appreciate all your replies.

@mcabbott Your example is helpful to understand Julia’s behavior! Defining functions often removes the extra allocations.

@stevengj Thank you for your suggestion and explanation. I am still learning a way of optimization in Julia. StaticArray.jl is a definite option for me.

@kristoffer.carlsson Good to know! Thank you for the information.