Matrix-by-(slice of)vector multiplication with limited allocation

I am trying to multiply a small matrix by a slice of a vector (subvector), i.e., z[3:7] .= A * x[3:7], and it allocates memory for a copy of the slice. A view is a lightweight option, but it still needs some memory and takes time when I do it many times (10 million times or more). Is there any way to perform such an operation with limited memory usage? (like GEMM or SYMM for slices?)

Examples:

julia> z = zeros(7);

julia> x = ones(7);

julia> A=rand(5,5);

# naive case
julia> @time z[3:7] .= A * x[3:7];
0.000012 seconds (6 allocations: 384 bytes)

# view for x
julia> @time z[3:7] .= A * (@view x[3:7]);
0.000012 seconds (6 allocations: 304 bytes)

# view for z
julia> @time vz = @view z[3:7];
0.000002 seconds (1 allocation: 48 bytes)

julia> @time vz .= A * (@view x[3:7]);
0.000006 seconds (2 allocations: 176 bytes)

Some ways:

julia> @btime $z[3:7] .=$A * $x[3:7]; 131.082 ns (2 allocations: 256 bytes) julia> using LinearAlgebra, Einsum julia> @btime @views mul!($z[3:7], $A,$x[3:7]);
69.175 ns (0 allocations: 0 bytes)

julia> emul!(C,A,B) = @einsum C[i] = A[i,k] * B[k]

julia> @btime @views emul!($z[3:7],$A, $x[3:7]); 26.772 ns (0 allocations: 0 bytes) Thank you for your quick reply! I tried it but it seems like it doesnβt allocate with @btime but does allocate in actual situation. I am afraid I am missing the point. using LinearAlgebra, Einsum, TimerOutputs function alloc_test1() reset_timer!() for i=1:10000000 @timeit "time" @views mul!(z[3:7], A, x[3:7]); end end alloc_test1(); print_timer(); emul!(C,A,B) = @einsum C[i] = A[i,k] * B[k] function alloc_test2() reset_timer!() for i=1:10000000 @timeit "time" @views emul!(z[3:7], A, x[3:7]) end end alloc_test2(); print_timer(); # alloc_test1() ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Time Allocations ββββββββββββββββββββββ βββββββββββββββββββββββ Tot / % measured: 2.83s / 81.2% 1.34GiB / 100% Section ncalls time %tot avg alloc %tot avg ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ time 10.0M 2.30s 100% 230ns 1.34GiB 100% 144B ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # alloc_test2() ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Time Allocations ββββββββββββββββββββββ βββββββββββββββββββββββ Tot / % measured: 2.21s / 76.4% 917MiB / 100% Section ncalls time %tot avg alloc %tot avg ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ time 10.0M 1.69s 100% 169ns 917MiB 100% 96.2B ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ I guess this means @timeit is unsuitable for measuring nanoseconds. @btime has its quirks but it works well, and is already running things many times for you. If you want to check: julia> function alloc_test_n(z,A,x,n) for i=1:n @views mul!(z[3:7], A, x[3:7]); end end; julia> alloc_test_n(z,A,x,10^4); julia> @time alloc_test_n(z,A,x,10^5) 0.006389 seconds (1 allocation: 16 bytes) julia> @time alloc_test_n(z,A,x,10^6) 0.063919 seconds (1 allocation: 16 bytes) julia> @time alloc_test_n(z,A,x,10^7) 0.637590 seconds (1 allocation: 16 bytes) julia> @time alloc_test_n(z,A,x,10^8) 6.398111 seconds (1 allocation: 16 bytes) 1 Like If your matrices and vectors are this small (< 10 \times 10) and you are doing millions of operations, then you are may be better off with StaticArrays.jl so that it can inline and unroll the operations (and will also allow you to eliminate allocations, assuming the sizes are fixed). BLAS gemm operations (which can also avoid allocations if you drop down to a low enough level) are going to be suboptimal here because they incur a lot of overhead for such small matrices. e.g. julia> using StaticArrays, BenchmarkTools julia> A = SMatrix{5,5}(rand(5,5)); julia> z = zeros(7); x = ones(7); julia> @btime$z[3:7] .= $A * SVector(ntuple(i ->$x[2+i], Val{5}()));
6.219 ns (0 allocations: 0 bytes)

(The business with ntuple is so I can write out the equivalent of SVector(x[3],x[4],x[5],x[6],x[7]), i.e. SVector(x[3:7]...) but with the loop unrolled statically. However, if you are working with StaticArrays then you may want to re-think your other data structures to use SVector as well.)

4 Likes

There is a small overhead in timing a section (0.25 ΞΌs) which means that this package is not suitable for measuring sections that finish very quickly. For proper benchmarking you want to use a more suitable tool like BenchmarkTools .

3 Likes