Which matrix multiplication is more efficient? A*B and C*B or [A; C]*B

I have pairs of square matrix multiplications which share one of the matrices, for example:

AB,
C
B,

where A, B and C are real square matrices. My question is: is it more efficient to compute both multiplications separately or is it more efficient to do the following?

[A; C]*B

that is to create a rectangular matrix, [A; C], and, then, multiply it by B.

There’s inbuilt logic in * which works this out for you:

julia> @less rand(2, 3) * rand(3, 2) * rand(2, 20)
*(A::AbstractMatrix, B::AbstractMatrix, C::AbstractMatrix) = _tri_matmul(A,B,C)
*(tv::AdjOrTransAbsVec, B::AbstractMatrix, C::AbstractMatrix) = (tv*B) * C

function _tri_matmul(A,B,C,╬┤=nothing)
    n,m = size(A)
    # m,k == size(B)
    k,l = size(C)
    costAB_C = n*m*k + n*k*l  # multiplications, allocations n*k + n*l
    costA_BC = m*k*l + n*m*l  #                              m*l + n*l
    if costA_BC < costAB_C
        isnothing(╬┤) ? A * (B*C) : A * mat_mat_scalar(B,C,╬┤)
    else
        isnothing(╬┤) ? (A*B) * C : mat_mat_scalar(A*B, C, ╬┤)
    end
end
1 Like
julia> A = randn(3, 4); C = randn(3, 4); B = randn(4, 5);

julia> A * B
3×5 Matrix{Float64}:
  0.399677  -1.15814   -3.0248    1.41393    2.29081
 -0.356186   1.94796    1.70922  -1.702     -2.89905
  0.340445  -0.463124  -1.95381   0.800185   0.8865

julia> C * B
3×5 Matrix{Float64}:
 0.971753  -2.56369   -1.49199     1.77191    2.76175
 0.434628  -0.535996   0.0465123   0.791034  -0.247286
 0.506344   0.422407  -0.406272   -1.00503   -0.888917

julia> [A C] * B
ERROR: DimensionMismatch: A has dimensions (3,8) but B has dimensions (4,5)

Probably I made myself not clear. I do want the result of the two products separately, AB and CB. What I’m asking is: whether it is more efficient to use the conventional square-square multiplication separately or if it is more efficient to use rectangular multiplication and do the whole thing in a single step, then, retrieve the two square parts from the retangular result.

You cannot do that. If A and B have compatible sizes, and C and B as well, then [A C] will not be compatible with B.

[A; C] can work, though.

You can test it out for yourself with BenchmarkTools.jl, it probably depends on the actual sizes of the matrices involved.

1 Like

Yes, thank you for the correction!

The answer is: for almost all cases it is better to do the separate multiplications.

The crucial factor is memory access, and the concatenation, at least using [A; C] requires a lot of memory movement and allocation.

2 Likes

Thank you!

This mostly depends on the multithreading of your BLAS algorithm. If you’re using multiple BLAS threads, then for [A; C] * B it’ll be able to start using those threads at smaller matrix sizes than if you did the separate multiplications, but once you get to big enough matrix sizes then it’ll become more advantageous again to do the separate multiplications. E.g.:

julia> for N ∈ round.(Int, 2 .^ (5:0.5:9))
           B = randn(N, N)
           A = randn(N, N)
           C = randn(N, N)
           println("N = $N")
           print("(A*B; C*B): "); @btime ($A * $B; $C * $B)
           print("[A; C]*B:   "); @btime [$A; $C] * $B
           println("----------------------")
       end
N = 32
(A*B; C*B):   4.265 μs (2 allocations: 16.25 KiB)
[A; C]*B:     4.819 μs (2 allocations: 32.25 KiB)
----------------------
N = 45
(A*B; C*B):   10.510 μs (2 allocations: 32.00 KiB)
[A; C]*B:     11.241 μs (4 allocations: 63.47 KiB)
----------------------
N = 64
(A*B; C*B):   23.675 μs (4 allocations: 64.09 KiB)
[A; C]*B:     78.760 μs (4 allocations: 128.09 KiB)
----------------------
N = 91
(A*B; C*B):   154.162 μs (4 allocations: 129.59 KiB)
[A; C]*B:     97.836 μs (4 allocations: 258.97 KiB)
----------------------
N = 128
(A*B; C*B):   186.674 μs (4 allocations: 256.09 KiB)
[A; C]*B:     144.354 μs (4 allocations: 512.09 KiB)
----------------------
N = 181
(A*B; C*B):   292.494 μs (4 allocations: 512.09 KiB)
[A; C]*B:     249.723 μs (4 allocations: 1023.97 KiB)
----------------------
N = 256
(A*B; C*B):   474.490 μs (4 allocations: 1.00 MiB)
[A; C]*B:     472.706 μs (4 allocations: 2.00 MiB)
----------------------
N = 362
(A*B; C*B):   976.973 μs (4 allocations: 2.00 MiB)
[A; C]*B:     1.005 ms (4 allocations: 4.00 MiB)
----------------------
N = 512
(A*B; C*B):   2.342 ms (4 allocations: 4.00 MiB)
[A; C]*B:     2.513 ms (4 allocations: 8.00 MiB)
----------------------

Here we see that for N around 100, it’s a bit better (on my machine) to do the concatenated multiplication, but outside of that relatively narrow it’s preferable to do the separate muls.

The other thing though is what exactly do you mean by doing the multiplications separately? Because if you are just doing them separately and simply storing the results in separate locations, e.g. (A * B, C * B), then yes that’s usually faster, but if you’re going to be concatenating the results at the end, e.g. [A*B; C * B], then it’s basically always best to do [A; C] * B).

julia> for N ∈ round.(Int, 2 .^ (5:0.5:9))
           B = randn(N, N)
           A = randn(N, N)
           C = randn(N, N)
           println("N = $N")
           print("[A*B; C*B]: "); @btime [$A * $B; $C * $B]
           print("[A; C]*B:   "); @btime [$A; $C] * $B
           println("----------------------")
       end
N = 32
[A*B; C*B]:   5.426 μs (3 allocations: 32.38 KiB)
[A; C]*B:     4.869 μs (2 allocations: 32.25 KiB)
----------------------
N = 45
[A*B; C*B]:   12.634 μs (4 allocations: 63.73 KiB)
[A; C]*B:     11.231 μs (4 allocations: 63.47 KiB)
----------------------
N = 64
[A*B; C*B]:   28.504 μs (6 allocations: 128.14 KiB)
[A; C]*B:     79.531 μs (4 allocations: 128.09 KiB)
----------------------
N = 91
[A*B; C*B]:   161.616 μs (6 allocations: 259.08 KiB)
[A; C]*B:     97.024 μs (4 allocations: 258.97 KiB)
----------------------
N = 128
[A*B; C*B]:   209.808 μs (6 allocations: 512.14 KiB)
[A; C]*B:     145.886 μs (4 allocations: 512.09 KiB)
----------------------
N = 181
[A*B; C*B]:   342.960 μs (6 allocations: 1.00 MiB)
[A; C]*B:     263.659 μs (4 allocations: 1023.97 KiB)
----------------------
N = 256
[A*B; C*B]:   444.623 μs (6 allocations: 2.00 MiB)
[A; C]*B:     412.511 μs (4 allocations: 2.00 MiB)
----------------------
N = 362
[A*B; C*B]:   1.063 ms (6 allocations: 4.00 MiB)
[A; C]*B:     1.005 ms (4 allocations: 4.00 MiB)
----------------------
N = 512
[A*B; C*B]:   2.574 ms (6 allocations: 8.00 MiB)
[A; C]*B:     2.566 ms (4 allocations: 8.00 MiB)
----------------------
1 Like

Thank you for your time and effort! I need the results in separate, e.g. (A*B and C*B), so, as you’ve shown, computing A*B and C*B is the way to go, instead of [A; C]*B and, then, retrieving the two results from the concatenated output. Thanks again