# Inplace multiplication of sub-matrices without allocations

I am writing a performance-critical part and would like to avoid any memory allocations in that part by creating caches beforehand. I then essentially need to multiply sub-arrays in a loop. Is there any way to do these matrix multiplications of sub-arrays without any allocations?
Here is a MWE:

``````using LinearAlgebra

A = rand(4,10,10)
B = rand(10,10)

C = zeros(4,10,10)

# This is the function I plan to loop over
function foo(A,B,C,dim)
mul!(C[dim,1:10,1:10], A[dim,1:10,1:10], B)
nothing
end

foo(A,B,C,1)

@time foo(A,B,C,1)
# 0.000010 seconds (2 allocations: 1.750 KiB)

# This is the function that loops over the first dimension
function wrapper_foo(A,B,C)
for i in 1:4
foo(A,B,C,i)
end
nothing
end

wrapper_foo(A,B,C)
@time wrapper_foo(A,B,C)
# 0.000037 seconds (12 allocations: 85.500 KiB)

``````

In situations where I don’t use sub-arrays, mul!() works without allocating any temporary arrays. Any ideas?

Use `view` or `@view` with the submatrix expression.

Also, if possible, make `dim` the third dimension in your arrays, to avoid strides.

2 Likes

Thanks a lot for the fast reply. I already tried to use `@view` but it just increases the number of allocations from 2 to 3 while increasing the amount of allocated memory substantially.
Thanks for the suggestion about `dim`. My actual problem has a six-dimensional array. With the last two ones not being looped over. Should I change the dimensions such that I loop over dim 3-6 instead?

For multi-dimensional tensor contractions like this, you should probably use `Tullio` with `LoopVectorization`. Decomposing the operation into 2D operations leaves a bunch of performance on the table.

1 Like

Here is the code about using `@view`:

``````using LinearAlgebra, BenchmarkTools
function foo(A,B,C,dim)
mul!(C[dim,1:10,1:10], A[dim,1:10,1:10], B)
nothing
end

function foo2(A,B,C,dim)
mul!(C[dim,1:10,1:10], @view(A[dim,1:10,1:10]), @view(B[1:10,1:10]))
nothing
end

foo(A,B,C,1)
foo2(A,B,C,1)

@btime foo(A,B,C,1)
#  504.167 ns (2 allocations: 1.75 KiB)
@btime foo2(A,B,C,1)
#  1.020 μs (3 allocations: 21.38 KiB)
``````

Thanks for the suggestion. I parallelize the wrapping loop(s) with `Threads.@spawn`. Do you think `LoopVectorization` would be a better fit? Inside the `foo` function happen a few more things than just the matrix multiplication. But most of this results in zero allocations. `Toolio.jl` looks great but I am not sure how I would write the program with it. Would you mind giving me an example with my MVE?
Thanks a lot!

Tullio works by using Einstein notation, so this program would turn into

``````using LoopVectorization, Tullio
function wrapper_foo(A,B,C)
@tullio C[i, j, k] = A[i, j, p] * B[p, k]
end
``````
1 Like

View is less performant when taking strides, but more performant when not.

``````julia> function foo!(A,B,C,dim)
mul!(C[dim,1:10,1:10], A[dim,1:10,1:10], B)
nothing
end
foo! (generic function with 1 method)

julia> function foo2!(A,B,C,dim)
mul!(C[dim,1:10,1:10], @view(A[dim,1:10,1:10]), @view(B[1:10,1:10]))
nothing
end
foo2! (generic function with 1 method)

julia> function foo3!(A,B,C,dim)
mul!(C[:,:,dim], @view(A[:,:,dim]), B)
nothing
end
foo3! (generic function with 1 method)

julia> function foo4!(A,B,C,dim)
mul!(@view(C[:,:,dim]), @view(A[:,:,dim]), B)
nothing
end
foo4! (generic function with 1 method)

julia> function foo5!(A,B,C,dim)
mul!(C[:,:,dim], A[:,:,dim], B)
nothing
end
foo5! (generic function with 1 method)

julia> @btime foo!(A,B,C,1)
1.140 μs (2 allocations: 1.75 KiB)

julia> @btime foo2!(A,B,C,1)
2.022 μs (3 allocations: 21.38 KiB)

julia> @btime foo3!(A,B,C,1)
450.289 ns (1 allocation: 400 bytes)

julia> @btime foo4!(A,B,C,1)
324.885 ns (0 allocations: 0 bytes)

julia> @btime foo5!(A,B,C,1)
526.380 ns (2 allocations: 800 bytes)
``````
2 Likes

Out of curiosity, why do we use `view(A, :, :, dim)` instead of `View(A)[:, :, dim]`? It seems like much of the motivation for the `@view` macro is to allow the `begin` and `end` keywords which are valid in `[]` square brackets, but if we simply had an indexable type we get those back.

Playing around, it seems like this should be doable:

``````struct View{A<:AbstractArray} a::A end
Base.getindex(v::View, args...) = view(v.a, args...)
Base.axes(v::View, args...) = axes(v.a, args...)
View(A)[:, :, end] # this just works
``````

Applying it here:

``````julia> function foo6!(A,B,C,dim)
mul!(View(C)[:,:,dim], View(A)[:,:,dim], B)
nothing
end
foo6! (generic function with 1 method)

julia> @btime foo6!(A,B,C,1)
330.808 ns (0 allocations: 0 bytes)
``````
2 Likes

This one can be written as reshape & then one matrix multiplication. Whether that’s faster than Tullio depends on the size (and will depend on your machine & BLAS library).

``````function reshape!(A,B,C)
s1, s2 = size(B)
mul!(reshape(C, :, s2), reshape(A, :, s1), B)
C
end

function loop!(A,B,C)
@views for d in axes(C,1)
@inbounds mul!(C[d,:,:], A[d,:,:], B)
end
C
end

using LoopVectorization, Tullio
function tullio!(A,B,C)
@tullio C[i, j, k] = A[i, j, p] * B[p, k]
end

using BenchmarkTools, LinearAlgebra
let n = 1  # n=10 changes the ranking?
A = rand(4n,10n,10n)
B = rand(10n,10n)
C1 = zeros(4n,10n,10n)
C2 = zeros(4n,10n,10n)
C3 = zeros(4n,10n,10n)
@btime loop!(\$A,\$B,\$C1)
@btime tullio!(\$A,\$B,\$C2)
@btime reshape!(\$A,\$B,\$C3)
C1 ≈ C2 ≈ C3
end
``````

Note that quite a few variants above which do `mul!(C[:,:,dim], ...` etc, without `view`, aren’t changing `C` at all, but just mutating a temporary copy.

Can you write the formula for this?

If you must slice then slicing the last dimensions is generally best. But better to avoid it if you can.

3 Likes

Awesome! Thanks! But do I understand correctly that I get zero allocations in `foo4!()` only because of switching the dimensions of `A` and `C` and taking the slice with respect to the last and not the first dimension? If so: Why?

``````mul!(C[dim,1:10,1:10], A[dim,1:10,1:10], B)
``````

cannot update the original matrix C. Only (@view C[…]) here can, because C[…] without @view just makes a copy of (part of) C.

1 Like

As @mcabbott and @photor have brought up, taking a slice without `view` and passing it to `map!` causes a new matrix to be allocated, so that `mul!(C[dim,:,:],...` means that the original matrix `C` isn’t even modified (so the code is functionally incorrect). Among the examples I listed, only `foo4!` and `foo6!` are functional. The rest don’t even work.

You should not accept my comment as an answer to your question; I was just comparing timings. @mcabbott presents some much better solutions. On my machine, for the `n=1` case `tullio!` is fastest, and for the `n=10` case `reshape!` is fastest (even though it causes some small allocations).

A multi-dimensional array like

``````[ 1 4 7
2 5 8
3 6 9 ]
``````

is stored in memory as a sequence of numbers, and information regarding its “dimensionality” isn’t in the data but in how the code treats it. In Julia, it’s stored in a column-major format, meaning that columns are contiguous like `1 2 3 4 5 6 7 8 9`. Accessing a slice like `A[:,1]` (which gives `[1,2,3]`) is more performant than accessing `A[1,:]` (which gives `[1,4,7]`), because accessing contiguous data from memory is more efficient than taking strides to access non-contiguous data.

As for why exactly a `view` of the non-contiguous data would cause so much allocation, I don’t know the mechanics to answer this.