# Matrix multiplication on multiple dimensions

I want to multiply each matrix along one dimension of a mutli-dimensional matrix by another matrix. Is there a way to do it quickly in one go? See example below.

``````#first factor: multi-dimensional matrix
A = zeros(3,2,2);
A[:,:,1] = [1 2;3 4;5 6];
A[:,:,2] = [7 8;9 10;11 12];
#second factor
B = [0.2 0.8; 0.4 0.6];

#result:
C = zeros(3,2,2);
C[:,:,1] = A[:,:,1]*B;
C[:,:,2] = A[:,:,2]*B;
``````
1 Like

Why not a loop? Something like:

``````function f(A, B)
C = similar(A)
for i in axes(A, 3)
C[:,  :, i] = @view(A[:, :, i]) * B
end
C
end
``````

should work just fine.

Are your actual data as small as `3x2` ? If so, then replacing `A` with a `Vector` of `SMatrix` from https://github.com/JuliaArrays/StaticArrays.jl could be even faster.

3 Likes

A loop is fine but I was hoping there would be something faster since I have to do this operation many many times. No, my dimensions are much larger (400495).

A loop usually is the fastest approach (in Julia or any other fast language). Any “in one go” approach you can think of is just a loop under the hood.

1 Like

Oh, although there is a pretty easy improvement to the code I wrote that would help quite a bit. My code computes `@view(A[:, :, i]) * B` (which allocates a new matrix for the result) and then stores that in `C`. You can avoid that temporary allocation with:

``````mul!(@view(C[:, :, i]), @view(A[:, :, i]), B)
``````

You’ll need `using LinearAlgebra` to get `mul!`

2 Likes

This is an excellent usecase for Tullio.jl, which can be way faster than a naive loop while offering quite nice and surprisingly flexible syntax.

``````julia> using Tullio, LoopVectorization

julia> function f(A, B)
C = similar(A, size(A, 1), size(B, 2), size(A, 3))
for i in axes(A, 3)
mul!(@view(C[:, :, i]), @view(A[:, :, i]), B)
end
C
end
f (generic function with 1 method)

julia> g(A, B) = @tullio C[i, j, k] := A[i, l, k] * B[l, j]
g (generic function with 1 method)

julia> let A = randn(20, 21, 22), B = randn(21, 23)
@btime f(\$A, \$B)
@btime g(\$A, \$B)
end;
34.380 μs (2 allocations: 79.14 KiB)
17.120 μs (2 allocations: 79.14 KiB)
``````
7 Likes

Could you clarify the size of `B` versus `A` here? If the matrix is much bigger than this, @rdeits’s method will become faster than Tullio since Tullio eventually loses to BLAS for very large matrices.

If I understand your correctly though, you’re still at sizes where Tullio wins, at least on my machine.

``````julia> let A = randn(400, 49, 5), B = randn(49, 49)
@btime f(\$A, \$B)
@btime g(\$A, \$B)
end;
195.528 μs (2 allocations: 765.70 KiB)
113.658 μs (75 allocations: 770.89 KiB)

julia> let A = randn(400, 49, 5), B = randn(49, 400)
@btime f(\$A, \$B)
@btime g(\$A, \$B)
end;
1.225 ms (2 allocations: 6.10 MiB)
645.042 μs (75 allocations: 6.11 MiB)
``````

Here’s an example of sizes where Tullio begins to lose:

``````julia> let A = randn(400, 495, 500), B = randn(495, 400)
@btime f(\$A, \$B)
@btime g(\$A, \$B)
end;
1.009 s (2 allocations: 610.35 MiB)
1.113 s (76 allocations: 610.36 MiB)
``````

Even when it does lose, it does so pretty gracefully.

2 Likes

One way to help `mul!` out is to thread the outermost loop, instead of the ones it sees.

``````julia> using Compat, LinearAlgebra

julia> function h(A::AbstractArray{<:Any, 3}, B::AbstractMatrix)
C = similar(A, size(A, 1), size(B, 2), size(A, 3))
@views mul!(C[:,  :, i], A[:, :, i], B)
end
C
end;

julia> let A = randn(200, 201, 202), B = randn(201, 203)
@btime f(\$A, \$B)  # simple mul!
@btime g(\$A, \$B)  # @tullio
end;
75.093 ms (2 allocations: 62.57 MiB)
63.242 ms (52 allocations: 62.57 MiB)
43.078 ms (23 allocations: 62.57 MiB)

julia> BLAS.vendor()
:mkl
``````

But much will depend on your computer and the precise problem size.

This is almost `NNlib.batched_mul`, but not quite. If the matrices are big and you have a GPU, you could look into using that, since batched_gemm handles this case.

4 Likes

Perhaps needs a Tullio issue:

``````julia> using LoopVectorization, Tullio, BenchmarkTools, Compat

julia> using LinearAlgebra; BLAS.vendor()
:mkl

julia> function f(A, B)
C = similar(A, size(A, 1), size(B, 2), size(A, 3))
for i in axes(A, 3)
mul!(@view(C[:, :, i]), @view(A[:, :, i]), B)
end
C
end
f (generic function with 1 method)

julia> g(A, B) = @tullio C[i, j, k] := A[i, l, k] * B[l, j]
g (generic function with 1 method)

julia> function h(A::AbstractArray{<:Any, 3}, B::AbstractMatrix)
C = similar(A, size(A, 1), size(B, 2), size(A, 3))
@views mul!(C[:,  :, i], A[:, :, i], B)
end
C
end;

julia> function AmulB!(C, A, B)
@avx for m ∈ axes(A,1), n ∈ axes(B,2)
Cmn = zero(eltype(C))
for k ∈ axes(A,2)
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
AmulB! (generic function with 1 method)

julia> function k(A::AbstractArray{<:Any, 3}, B::AbstractMatrix)
C = similar(A, size(A, 1), size(B, 2), size(A, 3))
@views AmulB!(C[:,  :, i], A[:, :, i], B)
end
C
end;

julia> versioninfo()
Julia Version 1.5.3-pre.0
Commit 5905edebb1 (2020-09-24 05:09 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, tigerlake)

julia> let A = randn(200, 201, 202), B = randn(201, 203)
@btime f(\$A, \$B)  # simple mul!
@btime g(\$A, \$B)  # @tullio
@btime k(\$A, \$B)  # threaded loopvectorization
end;
33.038 ms (2 allocations: 62.57 MiB)
45.259 ms (52 allocations: 62.57 MiB)
26.197 ms (23 allocations: 62.57 MiB)
25.306 ms (23 allocations: 62.57 MiB)
``````

LoopVectorization with `@threads` takes 25 ms, vs 45 ms for Tullio.

On a desktop:

``````julia> using LoopVectorization, Tullio, BenchmarkTools, Compat
[ Info: Precompiling Compat [34da2185-b29b-5c13-b0c7-acf172513d20]

julia> using LinearAlgebra; BLAS.vendor()
:mkl

julia> function f(A, B)
C = similar(A, size(A, 1), size(B, 2), size(A, 3))
for i in axes(A, 3)
mul!(@view(C[:, :, i]), @view(A[:, :, i]), B)
end
C
end
f (generic function with 1 method)

julia> g(A, B) = @tullio C[i, j, k] := A[i, l, k] * B[l, j]
g (generic function with 1 method)

julia> function h(A::AbstractArray{<:Any, 3}, B::AbstractMatrix)
C = similar(A, size(A, 1), size(B, 2), size(A, 3))
@views mul!(C[:,  :, i], A[:, :, i], B)
end
C
end;

julia> function AmulB!(C, A, B)
@avx for m ∈ axes(A,1), n ∈ axes(B,2)
Cmn = zero(eltype(C))
for k ∈ axes(A,2)
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
AmulB! (generic function with 1 method)

julia> function k(A::AbstractArray{<:Any, 3}, B::AbstractMatrix)
C = similar(A, size(A, 1), size(B, 2), size(A, 3))
@views AmulB!(C[:,  :, i], A[:, :, i], B)
end
C
end;

julia> versioninfo()
Julia Version 1.5.3-pre.0
Commit 5905edebb1* (2020-09-24 05:09 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
Environment: