I am trying to multiply a small matrix by a slice of a vector (subvector), i.e., z[3:7] .= A * x[3:7], and it allocates memory for a copy of the slice. A view is a lightweight option, but it still needs some memory and takes time when I do it many times (10 million times or more). Is there any way to perform such an operation with limited memory usage? (like GEMM or SYMM for slices?)

Examples:

julia> z = zeros(7);
julia> x = ones(7);
julia> A=rand(5,5);
# naive case
julia> @time z[3:7] .= A * x[3:7];
0.000012 seconds (6 allocations: 384 bytes)
# view for x
julia> @time z[3:7] .= A * (@view x[3:7]);
0.000012 seconds (6 allocations: 304 bytes)
# view for z
julia> @time vz = @view z[3:7];
0.000002 seconds (1 allocation: 48 bytes)
julia> @time vz .= A * (@view x[3:7]);
0.000006 seconds (2 allocations: 176 bytes)

Thank you for your quick reply! I tried it but it seems like it doesnβt allocate with @btime but does allocate in actual situation. I am afraid I am missing the point.

using LinearAlgebra, Einsum, TimerOutputs
function alloc_test1()
reset_timer!()
for i=1:10000000
@timeit "time" @views mul!(z[3:7], A, x[3:7]);
end
end
alloc_test1(); print_timer();
emul!(C,A,B) = @einsum C[i] = A[i,k] * B[k]
function alloc_test2()
reset_timer!()
for i=1:10000000
@timeit "time" @views emul!(z[3:7], A, x[3:7])
end
end
alloc_test2(); print_timer();

# alloc_test1()
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time Allocations
ββββββββββββββββββββββ βββββββββββββββββββββββ
Tot / % measured: 2.83s / 81.2% 1.34GiB / 100%
Section ncalls time %tot avg alloc %tot avg
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
time 10.0M 2.30s 100% 230ns 1.34GiB 100% 144B
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# alloc_test2()
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time Allocations
ββββββββββββββββββββββ βββββββββββββββββββββββ
Tot / % measured: 2.21s / 76.4% 917MiB / 100%
Section ncalls time %tot avg alloc %tot avg
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
time 10.0M 1.69s 100% 169ns 917MiB 100% 96.2B
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ

I guess this means @timeit is unsuitable for measuring nanoseconds. @btime has its quirks but it works well, and is already running things many times for you. If you want to check:

If your matrices and vectors are this small (< 10 \times 10) and you are doing millions of operations, then you are may be better off with StaticArrays.jl so that it can inline and unroll the operations (and will also allow you to eliminate allocations, assuming the sizes are fixed). BLAS gemm operations (which can also avoid allocations if you drop down to a low enough level) are going to be suboptimal here because they incur a lot of overhead for such small matrices.

e.g.

julia> using StaticArrays, BenchmarkTools
julia> A = SMatrix{5,5}(rand(5,5));
julia> z = zeros(7); x = ones(7);
julia> @btime $z[3:7] .= $A * SVector(ntuple(i -> $x[2+i], Val{5}()));
6.219 ns (0 allocations: 0 bytes)

(The business with ntuple is so I can write out the equivalent of SVector(x[3],x[4],x[5],x[6],x[7]), i.e. SVector(x[3:7]...) but with the loop unrolled statically. However, if you are working with StaticArrays then you may want to re-think your other data structures to use SVector as well.)

There is a small overhead in timing a section (0.25 ΞΌs) which means that this package is not suitable for measuring sections that finish very quickly. For proper benchmarking you want to use a more suitable tool like BenchmarkTools.

@mcabbott Your example is helpful to understand Juliaβs behavior! Defining functions often removes the extra allocations.

@stevengj Thank you for your suggestion and explanation. I am still learning a way of optimization in Julia. StaticArray.jl is a definite option for me.