Inplace multiplication of sub-matrices without allocations

I am writing a performance-critical part and would like to avoid any memory allocations in that part by creating caches beforehand. I then essentially need to multiply sub-arrays in a loop. Is there any way to do these matrix multiplications of sub-arrays without any allocations?
Here is a MWE:

using LinearAlgebra

A = rand(4,10,10)
B = rand(10,10)

C = zeros(4,10,10)

# This is the function I plan to loop over
function foo(A,B,C,dim)
    mul!(C[dim,1:10,1:10], A[dim,1:10,1:10], B)
    nothing
end

foo(A,B,C,1)

@time foo(A,B,C,1)
# 0.000010 seconds (2 allocations: 1.750 KiB)

# This is the function that loops over the first dimension
function wrapper_foo(A,B,C)
    for i in 1:4
        foo(A,B,C,i)
    end
    nothing
end

wrapper_foo(A,B,C)
@time wrapper_foo(A,B,C)
# 0.000037 seconds (12 allocations: 85.500 KiB)

In situations where I don’t use sub-arrays, mul!() works without allocating any temporary arrays. Any ideas?

Use view or @view with the submatrix expression.

Also, if possible, make dim the third dimension in your arrays, to avoid strides.

2 Likes

Thanks a lot for the fast reply. I already tried to use @view but it just increases the number of allocations from 2 to 3 while increasing the amount of allocated memory substantially.
Thanks for the suggestion about dim. My actual problem has a six-dimensional array. With the last two ones not being looped over. Should I change the dimensions such that I loop over dim 3-6 instead?

For multi-dimensional tensor contractions like this, you should probably use Tullio with LoopVectorization. Decomposing the operation into 2D operations leaves a bunch of performance on the table.

1 Like

Here is the code about using @view:

using LinearAlgebra, BenchmarkTools
function foo(A,B,C,dim)
    mul!(C[dim,1:10,1:10], A[dim,1:10,1:10], B)
    nothing
end

function foo2(A,B,C,dim)
    mul!(C[dim,1:10,1:10], @view(A[dim,1:10,1:10]), @view(B[1:10,1:10]))
    nothing
end

foo(A,B,C,1)
foo2(A,B,C,1)

@btime foo(A,B,C,1)
#  504.167 ns (2 allocations: 1.75 KiB)
@btime foo2(A,B,C,1)
#  1.020 μs (3 allocations: 21.38 KiB)  

Thanks for the suggestion. I parallelize the wrapping loop(s) with Threads.@spawn. Do you think LoopVectorization would be a better fit? Inside the foo function happen a few more things than just the matrix multiplication. But most of this results in zero allocations. Toolio.jl looks great but I am not sure how I would write the program with it. Would you mind giving me an example with my MVE?
Thanks a lot!

Tullio works by using Einstein notation, so this program would turn into

using LoopVectorization, Tullio
function wrapper_foo(A,B,C)
    @tullio C[i, j, k] = A[i, j, p] * B[p, k]
end
1 Like

View is less performant when taking strides, but more performant when not.

julia> function foo!(A,B,C,dim)
           mul!(C[dim,1:10,1:10], A[dim,1:10,1:10], B)
           nothing
       end
foo! (generic function with 1 method)

julia> function foo2!(A,B,C,dim)
           mul!(C[dim,1:10,1:10], @view(A[dim,1:10,1:10]), @view(B[1:10,1:10]))
           nothing
       end
foo2! (generic function with 1 method)

julia> function foo3!(A,B,C,dim)
           mul!(C[:,:,dim], @view(A[:,:,dim]), B)
           nothing
       end
foo3! (generic function with 1 method)

julia> function foo4!(A,B,C,dim)
           mul!(@view(C[:,:,dim]), @view(A[:,:,dim]), B)
           nothing
       end
foo4! (generic function with 1 method)

julia> function foo5!(A,B,C,dim)
           mul!(C[:,:,dim], A[:,:,dim], B)
           nothing
       end
foo5! (generic function with 1 method)

julia> @btime foo!(A,B,C,1)
  1.140 μs (2 allocations: 1.75 KiB)

julia> @btime foo2!(A,B,C,1)
  2.022 μs (3 allocations: 21.38 KiB)

julia> @btime foo3!(A,B,C,1)
  450.289 ns (1 allocation: 400 bytes)

julia> @btime foo4!(A,B,C,1)
  324.885 ns (0 allocations: 0 bytes)

julia> @btime foo5!(A,B,C,1)
  526.380 ns (2 allocations: 800 bytes)
2 Likes

Out of curiosity, why do we use view(A, :, :, dim) instead of View(A)[:, :, dim]? It seems like much of the motivation for the @view macro is to allow the begin and end keywords which are valid in [] square brackets, but if we simply had an indexable type we get those back.

Playing around, it seems like this should be doable:

struct View{A<:AbstractArray} a::A end
Base.getindex(v::View, args...) = view(v.a, args...)
Base.axes(v::View, args...) = axes(v.a, args...)
View(A)[:, :, end] # this just works

Applying it here:

julia> function foo6!(A,B,C,dim)
           mul!(View(C)[:,:,dim], View(A)[:,:,dim], B)
           nothing
       end
foo6! (generic function with 1 method)

julia> @btime foo6!(A,B,C,1)
  330.808 ns (0 allocations: 0 bytes)
2 Likes

This one can be written as reshape & then one matrix multiplication. Whether that’s faster than Tullio depends on the size (and will depend on your machine & BLAS library).

function reshape!(A,B,C)
    s1, s2 = size(B)
    mul!(reshape(C, :, s2), reshape(A, :, s1), B)
    C
end

function loop!(A,B,C)
    @views for d in axes(C,1)
        @inbounds mul!(C[d,:,:], A[d,:,:], B)
    end
    C
end

using LoopVectorization, Tullio
function tullio!(A,B,C)
    @tullio C[i, j, k] = A[i, j, p] * B[p, k]
end

using BenchmarkTools, LinearAlgebra
let n = 1  # n=10 changes the ranking?
    A = rand(4n,10n,10n)
    B = rand(10n,10n)
    C1 = zeros(4n,10n,10n)
    C2 = zeros(4n,10n,10n)
    C3 = zeros(4n,10n,10n)
    @btime loop!($A,$B,$C1)
    @btime tullio!($A,$B,$C2)
    @btime reshape!($A,$B,$C3)
    C1 ≈ C2 ≈ C3
end

Note that quite a few variants above which do mul!(C[:,:,dim], ... etc, without view, aren’t changing C at all, but just mutating a temporary copy.

Can you write the formula for this?

If you must slice then slicing the last dimensions is generally best. But better to avoid it if you can.

3 Likes

Awesome! Thanks! But do I understand correctly that I get zero allocations in foo4!() only because of switching the dimensions of A and C and taking the slice with respect to the last and not the first dimension? If so: Why?

mul!(C[dim,1:10,1:10], A[dim,1:10,1:10], B)

cannot update the original matrix C. Only (@view C[…]) here can, because C[…] without @view just makes a copy of (part of) C.

1 Like

As @mcabbott and @photor have brought up, taking a slice without view and passing it to map! causes a new matrix to be allocated, so that mul!(C[dim,:,:],... means that the original matrix C isn’t even modified (so the code is functionally incorrect). Among the examples I listed, only foo4! and foo6! are functional. The rest don’t even work.

You should not accept my comment as an answer to your question; I was just comparing timings. @mcabbott presents some much better solutions. On my machine, for the n=1 case tullio! is fastest, and for the n=10 case reshape! is fastest (even though it causes some small allocations).

A multi-dimensional array like

[ 1 4 7
  2 5 8
  3 6 9 ]

is stored in memory as a sequence of numbers, and information regarding its “dimensionality” isn’t in the data but in how the code treats it. In Julia, it’s stored in a column-major format, meaning that columns are contiguous like 1 2 3 4 5 6 7 8 9. Accessing a slice like A[:,1] (which gives [1,2,3]) is more performant than accessing A[1,:] (which gives [1,4,7]), because accessing contiguous data from memory is more efficient than taking strides to access non-contiguous data.

As for why exactly a view of the non-contiguous data would cause so much allocation, I don’t know the mechanics to answer this.