Inplace multiplication of sub-matrices without allocations

I am writing a performance-critical part and would like to avoid any memory allocations in that part by creating caches beforehand. I then essentially need to multiply sub-arrays in a loop. Is there any way to do these matrix multiplications of sub-arrays without any allocations?
Here is a MWE:

using LinearAlgebra

A = rand(4,10,10)
B = rand(10,10)

C = zeros(4,10,10)

# This is the function I plan to loop over
function foo(A,B,C,dim)
    mul!(C[dim,1:10,1:10], A[dim,1:10,1:10], B)
    nothing
end

foo(A,B,C,1)

@time foo(A,B,C,1)
# 0.000010 seconds (2 allocations: 1.750 KiB)

# This is the function that loops over the first dimension
function wrapper_foo(A,B,C)
    for i in 1:4
        foo(A,B,C,i)
    end
    nothing
end

wrapper_foo(A,B,C)
@time wrapper_foo(A,B,C)
# 0.000037 seconds (12 allocations: 85.500 KiB)

In situations where I don’t use sub-arrays, mul!() works without allocating any temporary arrays. Any ideas?

Use view or @view with the submatrix expression.

Also, if possible, make dim the third dimension in your arrays, to avoid strides.

Thanks a lot for the fast reply. I already tried to use @view but it just increases the number of allocations from 2 to 3 while increasing the amount of allocated memory substantially.
Thanks for the suggestion about dim. My actual problem has a six-dimensional array. With the last two ones not being looped over. Should I change the dimensions such that I loop over dim 3-6 instead?

For multi-dimensional tensor contractions like this, you should probably use Tullio with LoopVectorization. Decomposing the operation into 2D operations leaves a bunch of performance on the table.

Here is the code about using @view:

using LinearAlgebra, BenchmarkTools
function foo(A,B,C,dim)
    mul!(C[dim,1:10,1:10], A[dim,1:10,1:10], B)
    nothing
end

function foo2(A,B,C,dim)
    mul!(C[dim,1:10,1:10], @view(A[dim,1:10,1:10]), @view(B[1:10,1:10]))
    nothing
end

foo(A,B,C,1)
foo2(A,B,C,1)

@btime foo(A,B,C,1)
#  504.167 ns (2 allocations: 1.75 KiB)
@btime foo2(A,B,C,1)
#  1.020 μs (3 allocations: 21.38 KiB)  

Thanks for the suggestion. I parallelize the wrapping loop(s) with Threads.@spawn. Do you think LoopVectorization would be a better fit? Inside the foo function happen a few more things than just the matrix multiplication. But most of this results in zero allocations. Toolio.jl looks great but I am not sure how I would write the program with it. Would you mind giving me an example with my MVE?
Thanks a lot!

Tullio works by using Einstein notation, so this program would turn into

using LoopVectorization, Tullio
function wrapper_foo(A,B,C)
    @tullio C[i, j, k] = A[i, j, p] * B[p, k]
end

View is less performant when taking strides, but more performant when not.

julia> function foo!(A,B,C,dim)
           mul!(C[dim,1:10,1:10], A[dim,1:10,1:10], B)
           nothing
       end
foo! (generic function with 1 method)

julia> function foo2!(A,B,C,dim)
           mul!(C[dim,1:10,1:10], @view(A[dim,1:10,1:10]), @view(B[1:10,1:10]))
           nothing
       end
foo2! (generic function with 1 method)

julia> function foo3!(A,B,C,dim)
           mul!(C[:,:,dim], @view(A[:,:,dim]), B)
           nothing
       end
foo3! (generic function with 1 method)

julia> function foo4!(A,B,C,dim)
           mul!(@view(C[:,:,dim]), @view(A[:,:,dim]), B)
           nothing
       end
foo4! (generic function with 1 method)

julia> function foo5!(A,B,C,dim)
           mul!(C[:,:,dim], A[:,:,dim], B)
           nothing
       end
foo5! (generic function with 1 method)

julia> @btime foo!(A,B,C,1)
  1.140 μs (2 allocations: 1.75 KiB)

julia> @btime foo2!(A,B,C,1)
  2.022 μs (3 allocations: 21.38 KiB)

julia> @btime foo3!(A,B,C,1)
  450.289 ns (1 allocation: 400 bytes)

julia> @btime foo4!(A,B,C,1)
  324.885 ns (0 allocations: 0 bytes)

julia> @btime foo5!(A,B,C,1)
  526.380 ns (2 allocations: 800 bytes)

Out of curiosity, why do we use view(A, :, :, dim) instead of View(A)[:, :, dim]? It seems like much of the motivation for the @view macro is to allow the begin and end keywords which are valid in [] square brackets, but if we simply had an indexable type we get those back.

Playing around, it seems like this should be doable:

struct View{A<:AbstractArray} a::A end
Base.getindex(v::View, args...) = view(v.a, args...)
Base.axes(v::View, args...) = axes(v.a, args...)
View(A)[:, :, end] # this just works

Applying it here:

julia> function foo6!(A,B,C,dim)
           mul!(View(C)[:,:,dim], View(A)[:,:,dim], B)
           nothing
       end
foo6! (generic function with 1 method)

julia> @btime foo6!(A,B,C,1)
  330.808 ns (0 allocations: 0 bytes)

This one can be written as reshape & then one matrix multiplication. Whether that’s faster than Tullio depends on the size (and will depend on your machine & BLAS library).

function reshape!(A,B,C)
    s1, s2 = size(B)
    mul!(reshape(C, :, s2), reshape(A, :, s1), B)
    C
end

function loop!(A,B,C)
    @views for d in axes(C,1)
        @inbounds mul!(C[d,:,:], A[d,:,:], B)
    end
    C
end

using LoopVectorization, Tullio
function tullio!(A,B,C)
    @tullio C[i, j, k] = A[i, j, p] * B[p, k]
end

using BenchmarkTools, LinearAlgebra
let n = 1  # n=10 changes the ranking?
    A = rand(4n,10n,10n)
    B = rand(10n,10n)
    C1 = zeros(4n,10n,10n)
    C2 = zeros(4n,10n,10n)
    C3 = zeros(4n,10n,10n)
    @btime loop!($A,$B,$C1)
    @btime tullio!($A,$B,$C2)
    @btime reshape!($A,$B,$C3)
    C1 ā‰ˆ C2 ā‰ˆ C3
end

Note that quite a few variants above which do mul!(C[:,:,dim], ... etc, without view, aren’t changing C at all, but just mutating a temporary copy.

Can you write the formula for this?

If you must slice then slicing the last dimensions is generally best. But better to avoid it if you can.

Awesome! Thanks! But do I understand correctly that I get zero allocations in foo4!() only because of switching the dimensions of A and C and taking the slice with respect to the last and not the first dimension? If so: Why?

mul!(C[dim,1:10,1:10], A[dim,1:10,1:10], B)

cannot update the original matrix C. Only (@view C[…]) here can, because C[…] without @view just makes a copy of (part of) C.

As @mcabbott and @photor have brought up, taking a slice without view and passing it to map! causes a new matrix to be allocated, so that mul!(C[dim,:,:],... means that the original matrix C isn’t even modified (so the code is functionally incorrect). Among the examples I listed, only foo4! and foo6! are functional. The rest don’t even work.

You should not accept my comment as an answer to your question; I was just comparing timings. @mcabbott presents some much better solutions. On my machine, for the n=1 case tullio! is fastest, and for the n=10 case reshape! is fastest (even though it causes some small allocations).

A multi-dimensional array like

[ 1 4 7
  2 5 8
  3 6 9 ]

is stored in memory as a sequence of numbers, and information regarding its ā€œdimensionalityā€ isn’t in the data but in how the code treats it. In Julia, it’s stored in a column-major format, meaning that columns are contiguous like 1 2 3 4 5 6 7 8 9. Accessing a slice like A[:,1] (which gives [1,2,3]) is more performant than accessing A[1,:] (which gives [1,4,7]), because accessing contiguous data from memory is more efficient than taking strides to access non-contiguous data.

As for why exactly a view of the non-contiguous data would cause so much allocation, I don’t know the mechanics to answer this.