Matrix multiplication on multiple dimensions

I want to multiply each matrix along one dimension of a mutli-dimensional matrix by another matrix. Is there a way to do it quickly in one go? See example below.

Thank you in advance!

#first factor: multi-dimensional matrix
A = zeros(3,2,2);
A[:,:,1] = [1 2;3 4;5 6];
A[:,:,2] = [7 8;9 10;11 12];
#second factor
B = [0.2 0.8; 0.4 0.6];

#result:
C = zeros(3,2,2);
C[:,:,1] = A[:,:,1]*B;
C[:,:,2] = A[:,:,2]*B;

Why not a loop? Something like:

function f(A, B)
  C = similar(A)
  for i in axes(A, 3)
    C[:,  :, i] = @view(A[:, :, i]) * B
  end
  C
end

should work just fine.

Are your actual data as small as 3x2 ? If so, then replacing A with a Vector of SMatrix from https://github.com/JuliaArrays/StaticArrays.jl could be even faster.

2 Likes

A loop is fine but I was hoping there would be something faster since I have to do this operation many many times. No, my dimensions are much larger (400495).

A loop usually is the fastest approach (in Julia or any other fast language). Any “in one go” approach you can think of is just a loop under the hood.

1 Like

Oh, although there is a pretty easy improvement to the code I wrote that would help quite a bit. My code computes @view(A[:, :, i]) * B (which allocates a new matrix for the result) and then stores that in C. You can avoid that temporary allocation with:

mul!(@view(C[:, :, i]), @view(A[:, :, i]), B)

You’ll need using LinearAlgebra to get mul!

2 Likes

This is an excellent usecase for Tullio.jl, which can be way faster than a naive loop while offering quite nice and surprisingly flexible syntax.

julia> using Tullio, LoopVectorization

julia> function f(A, B)
         C = similar(A, size(A, 1), size(B, 2), size(A, 3))
         for i in axes(A, 3)
           mul!(@view(C[:, :, i]), @view(A[:, :, i]), B)
         end
         C
       end
f (generic function with 1 method)

julia> g(A, B) = @tullio C[i, j, k] := A[i, l, k] * B[l, j]
g (generic function with 1 method)

julia> let A = randn(20, 21, 22), B = randn(21, 23)
           @btime f($A, $B)
           @btime g($A, $B)
       end;
  34.380 μs (2 allocations: 79.14 KiB)
  17.120 μs (2 allocations: 79.14 KiB)
7 Likes

Could you clarify the size of B versus A here? If the matrix is much bigger than this, @rdeits’s method will become faster than Tullio since Tullio eventually loses to BLAS for very large matrices.

If I understand your correctly though, you’re still at sizes where Tullio wins, at least on my machine.

julia> let A = randn(400, 49, 5), B = randn(49, 49)
           @btime f($A, $B)
           @btime g($A, $B)
       end;
  195.528 μs (2 allocations: 765.70 KiB)
  113.658 μs (75 allocations: 770.89 KiB)

julia> let A = randn(400, 49, 5), B = randn(49, 400)
           @btime f($A, $B)
           @btime g($A, $B)
       end;
  1.225 ms (2 allocations: 6.10 MiB)
  645.042 μs (75 allocations: 6.11 MiB)

Here’s an example of sizes where Tullio begins to lose:

julia> let A = randn(400, 495, 500), B = randn(495, 400)
           @btime f($A, $B)
           @btime g($A, $B)
       end;
  1.009 s (2 allocations: 610.35 MiB)
  1.113 s (76 allocations: 610.36 MiB)

Even when it does lose, it does so pretty gracefully.

2 Likes

One way to help mul! out is to thread the outermost loop, instead of the ones it sees.

julia> using Compat, LinearAlgebra

julia> function h(A::AbstractArray{<:Any, 3}, B::AbstractMatrix)
           C = similar(A, size(A, 1), size(B, 2), size(A, 3))
           old_threads = Compat.get_num_threads() # BLAS.get_num_threads(), soon
           Compat.set_num_threads(1)
           Threads.@threads for i in axes(A, 3)
               @views mul!(C[:,  :, i], A[:, :, i], B)
           end
           Compat.set_num_threads(old_threads)
           C
       end;

julia> let A = randn(200, 201, 202), B = randn(201, 203)
           @btime f($A, $B)  # simple mul!
           @btime g($A, $B)  # @tullio
           @btime h($A, $B)  # threaded
       end;
  75.093 ms (2 allocations: 62.57 MiB)
  63.242 ms (52 allocations: 62.57 MiB)
  43.078 ms (23 allocations: 62.57 MiB)

julia> BLAS.vendor()
:mkl

But much will depend on your computer and the precise problem size.

This is almost NNlib.batched_mul, but not quite. If the matrices are big and you have a GPU, you could look into using that, since batched_gemm handles this case.

4 Likes

Perhaps needs a Tullio issue:

julia> using LoopVectorization, Tullio, BenchmarkTools, Compat

julia> using LinearAlgebra; BLAS.vendor()
:mkl

julia> function f(A, B)
         C = similar(A, size(A, 1), size(B, 2), size(A, 3))
         for i in axes(A, 3)
           mul!(@view(C[:, :, i]), @view(A[:, :, i]), B)
         end
         C
       end
f (generic function with 1 method)

julia> g(A, B) = @tullio C[i, j, k] := A[i, l, k] * B[l, j]
g (generic function with 1 method)

julia> function h(A::AbstractArray{<:Any, 3}, B::AbstractMatrix)
           C = similar(A, size(A, 1), size(B, 2), size(A, 3))
           old_threads = Compat.get_num_threads() # BLAS.get_num_threads(), soon
           Compat.set_num_threads(1)
           Threads.@threads for i in axes(A, 3)
               @views mul!(C[:,  :, i], A[:, :, i], B)
           end
           Compat.set_num_threads(old_threads)
           C
       end;

julia> function AmulB!(C, A, B)
           @avx for m ∈ axes(A,1), n ∈ axes(B,2)
               Cmn = zero(eltype(C))
               for k ∈ axes(A,2)
                   Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
       end
AmulB! (generic function with 1 method)

julia> function k(A::AbstractArray{<:Any, 3}, B::AbstractMatrix)
           C = similar(A, size(A, 1), size(B, 2), size(A, 3))
           Threads.@threads for i in axes(A, 3)
               @views AmulB!(C[:,  :, i], A[:, :, i], B)
           end
           C
       end;

julia> versioninfo()
Julia Version 1.5.3-pre.0
Commit 5905edebb1 (2020-09-24 05:09 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, tigerlake)

julia> let A = randn(200, 201, 202), B = randn(201, 203)
           @btime f($A, $B)  # simple mul!
           @btime g($A, $B)  # @tullio
           @btime h($A, $B)  # threaded
           @btime k($A, $B)  # threaded loopvectorization
       end;
  33.038 ms (2 allocations: 62.57 MiB)
  45.259 ms (52 allocations: 62.57 MiB)
  26.197 ms (23 allocations: 62.57 MiB)
  25.306 ms (23 allocations: 62.57 MiB)

LoopVectorization with @threads takes 25 ms, vs 45 ms for Tullio.

On a desktop:

julia> using LoopVectorization, Tullio, BenchmarkTools, Compat
[ Info: Precompiling Compat [34da2185-b29b-5c13-b0c7-acf172513d20]

julia> using LinearAlgebra; BLAS.vendor()
:mkl

julia> function f(A, B)
         C = similar(A, size(A, 1), size(B, 2), size(A, 3))
         for i in axes(A, 3)
           mul!(@view(C[:, :, i]), @view(A[:, :, i]), B)
         end
         C
       end
f (generic function with 1 method)

julia> g(A, B) = @tullio C[i, j, k] := A[i, l, k] * B[l, j]
g (generic function with 1 method)

julia> function h(A::AbstractArray{<:Any, 3}, B::AbstractMatrix)
           C = similar(A, size(A, 1), size(B, 2), size(A, 3))
           old_threads = Compat.get_num_threads() # BLAS.get_num_threads(), soon
           Compat.set_num_threads(1)
           Threads.@threads for i in axes(A, 3)
               @views mul!(C[:,  :, i], A[:, :, i], B)
           end
           Compat.set_num_threads(old_threads)
           C
       end;

julia> function AmulB!(C, A, B)
           @avx for m ∈ axes(A,1), n ∈ axes(B,2)
               Cmn = zero(eltype(C))
               for k ∈ axes(A,2)
                   Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
       end
AmulB! (generic function with 1 method)

julia> function k(A::AbstractArray{<:Any, 3}, B::AbstractMatrix)
           C = similar(A, size(A, 1), size(B, 2), size(A, 3))
           Threads.@threads for i in axes(A, 3)
               @views AmulB!(C[:,  :, i], A[:, :, i], B)
           end
           C
       end;

julia> versioninfo()
Julia Version 1.5.3-pre.0
Commit 5905edebb1* (2020-09-24 05:09 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 18

julia> let A = randn(200, 201, 202), B = randn(201, 203)
           @btime f($A, $B)  # simple mul!
           @btime g($A, $B)  # @tullio
           @btime h($A, $B)  # threaded
           @btime k($A, $B)  # threaded loopvectorization
       end;
  16.672 ms (2 allocations: 62.57 MiB)
  16.414 ms (244 allocations: 62.59 MiB)
  3.724 ms (93 allocations: 62.58 MiB)
  3.382 ms (93 allocations: 62.58 MiB)
3 Likes