Matrix-by-(slice of)vector multiplication with limited allocation

yutaka · September 11, 2020, 3:24pm

I am trying to multiply a small matrix by a slice of a vector (subvector), i.e., z[3:7] .= A * x[3:7], and it allocates memory for a copy of the slice. A view is a lightweight option, but it still needs some memory and takes time when I do it many times (10 million times or more). Is there any way to perform such an operation with limited memory usage? (like GEMM or SYMM for slices?)

Examples:

julia> z = zeros(7);

julia> x = ones(7);

julia> A=rand(5,5);

# naive case
julia> @time z[3:7] .= A * x[3:7];
  0.000012 seconds (6 allocations: 384 bytes)

# view for x
julia> @time z[3:7] .= A * (@view x[3:7]);
  0.000012 seconds (6 allocations: 304 bytes)

# view for z
julia> @time vz = @view z[3:7];
  0.000002 seconds (1 allocation: 48 bytes)

julia> @time vz .= A * (@view x[3:7]);
  0.000006 seconds (2 allocations: 176 bytes)

mcabbott · September 11, 2020, 3:32pm

Some ways:

julia> @btime $z[3:7] .= $A * $x[3:7];
  131.082 ns (2 allocations: 256 bytes)

julia> using LinearAlgebra, Einsum

julia> @btime @views mul!($z[3:7], $A, $x[3:7]);
  69.175 ns (0 allocations: 0 bytes)

julia> emul!(C,A,B) = @einsum C[i] = A[i,k] * B[k]

julia> @btime @views emul!($z[3:7], $A, $x[3:7]);
  26.772 ns (0 allocations: 0 bytes)

yutaka · September 11, 2020, 4:14pm

Thank you for your quick reply! I tried it but it seems like it doesn’t allocate with @btime but does allocate in actual situation. I am afraid I am missing the point.

using LinearAlgebra, Einsum, TimerOutputs
function alloc_test1()
   reset_timer!()
   for i=1:10000000
      @timeit "time" @views mul!(z[3:7], A, x[3:7]);
   end
end
alloc_test1(); print_timer();

emul!(C,A,B) = @einsum C[i] = A[i,k] * B[k]
function alloc_test2()
   reset_timer!()
   for i=1:10000000
      @timeit "time" @views emul!(z[3:7], A, x[3:7])
   end
end
alloc_test2(); print_timer();

# alloc_test1()
 ──────────────────────────────────────────────────────────────────
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      2.83s / 81.2%           1.34GiB / 100%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 time       10.0M    2.30s   100%   230ns   1.34GiB  100%      144B
 ──────────────────────────────────────────────────────────────────

# alloc_test2()
 ──────────────────────────────────────────────────────────────────
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      2.21s / 76.4%            917MiB / 100%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 time       10.0M    1.69s   100%   169ns    917MiB  100%     96.2B
 ──────────────────────────────────────────────────────────────────

mcabbott · September 11, 2020, 4:27pm

I guess this means @timeit is unsuitable for measuring nanoseconds. @btime has its quirks but it works well, and is already running things many times for you. If you want to check:

julia> function alloc_test_n(z,A,x,n)
          for i=1:n
             @views mul!(z[3:7], A, x[3:7]);
          end
       end;

julia> alloc_test_n(z,A,x,10^4);

julia> @time alloc_test_n(z,A,x,10^5)
  0.006389 seconds (1 allocation: 16 bytes)

julia> @time alloc_test_n(z,A,x,10^6)
  0.063919 seconds (1 allocation: 16 bytes)

julia> @time alloc_test_n(z,A,x,10^7)
  0.637590 seconds (1 allocation: 16 bytes)

julia> @time alloc_test_n(z,A,x,10^8)
  6.398111 seconds (1 allocation: 16 bytes)

stevengj · September 11, 2020, 4:35pm

If your matrices and vectors are this small (< 10 \times 10) and you are doing millions of operations, then you are may be better off with StaticArrays.jl so that it can inline and unroll the operations (and will also allow you to eliminate allocations, assuming the sizes are fixed). BLAS gemm operations (which can also avoid allocations if you drop down to a low enough level) are going to be suboptimal here because they incur a lot of overhead for such small matrices.

e.g.

julia> using StaticArrays, BenchmarkTools

julia> A = SMatrix{5,5}(rand(5,5));

julia> z = zeros(7); x = ones(7);

julia> @btime $z[3:7] .= $A * SVector(ntuple(i -> $x[2+i], Val{5}()));
  6.219 ns (0 allocations: 0 bytes)

(The business with ntuple is so I can write out the equivalent of SVector(x[3],x[4],x[5],x[6],x[7]), i.e. SVector(x[3:7]...) but with the loop unrolled statically. However, if you are working with StaticArrays then you may want to re-think your other data structures to use SVector as well.)

kristoffer.carlsson · September 11, 2020, 5:28pm

Yes, see GitHub - KristofferC/TimerOutputs.jl: Formatted output of timed sections in Julia.

There is a small overhead in timing a section (0.25 μs) which means that this package is not suitable for measuring sections that finish very quickly. For proper benchmarking you want to use a more suitable tool like BenchmarkTools .

yutaka · September 11, 2020, 7:21pm

I appreciate all your replies.

@mcabbott Your example is helpful to understand Julia’s behavior! Defining functions often removes the extra allocations.

@stevengj Thank you for your suggestion and explanation. I am still learning a way of optimization in Julia. StaticArray.jl is a definite option for me.

@kristoffer.carlsson Good to know! Thank you for the information.

Topic		Replies	Views
Multiplication of vector of matrices and vector of vectors Performance linearalgebra , tullio	3	411	November 7, 2022
Is it possible to cut a vector into 2 pieces without allocations? Performance vector , views	11	739	November 27, 2023
Problem with creating view of vectors in vector of vectors (it keeps allocating) Performance memory-allocation , vector	2	105	July 9, 2024
Passing views to function without allocation New to Julia	14	1446	December 18, 2020
Allocation when assigning to a slice General Usage memory-allocation	13	928	December 21, 2021

Matrix-by-(slice of)vector multiplication with limited allocation

Related topics