What's the idiomatic way of calculating matrix product ABC?

yingqiuz · June 4, 2021, 6:34pm

Hi

Is there a idiomatic way of calculating matrix product ABC, and store the resulted matrix (inplace) to a pre-allocated one? Something similar to mul!(Y, A, B) -> Y, which stores A * B in Y.

Any help would be much appreciated!

Mason · June 4, 2021, 8:29pm

I beleive that the most efficient way to handle this is actually to have two buffers and do two separate matmuls, i.e. preallocate a D and E matrix and then do

mul!(E, mul!(D, A, B), C)

(which is the same as A * B * C but if you’re doing this many times you’ll want to re-use the D and E matrices).

It might seem surprising that this is faster than a quadruple loop approach that only needs one buffer, but there are huge advantages in data locality to only using one single buffer, so it turns out to be faster to just allocate more:

julia> using LoopVectorization

julia> function triple_mul!(D, A, B, C)
           # D[i, l] = A[i, j] * B[j, k] * C[k, l]
           @assert size(D, 1) == size(A, 1) && size(D, 2) == size(C, 2)
           @assert size(A, 2) == size(B, 1)
           @assert size(B, 2) == size(C, 1)
           @tturbo for i ∈ axes(D, 1), l ∈ axes(D, 2)
               Dil = zero(eltype(D))
               for j ∈ axes(A, 2), k ∈ axes(B, 2)
                   Dil += A[i, j] * B[j, k] * C[k, l]
               end
               D[i,l] = Dil
           end
           D
       end
triple_mul! (generic function with 1 method)


julia> foreach((10, 50, 100)) do N
           A, B, C = randn(N, N+1), randn(N+1, N+2), randn(N+2, N+3)
           D = Matrix{Float64}(undef, N, N+3)
           @show N
           o1 = @btime triple_mul!($D, $A, $B, $C)
           o2 = @btime $A * $B * $C
           @assert o1 ≈ o2
           println()
       end
N = 10
  1.611 μs (0 allocations: 0 bytes)
  922.543 ns (2 allocations: 2.20 KiB)

N = 50
  117.658 μs (0 allocations: 0 bytes)
  31.210 μs (4 allocations: 41.22 KiB)

N = 100

  1.738 ms (0 allocations: 0 bytes)
  75.819 μs (4 allocations: 160.34 KiB)

For small matrices, LoopVectorization.@tturbo is usually approximately the fastest thing out there, so if it’s losing against BLAS this badly even when BLAS is spending more time allocating, that’s a very bad sign.

ericphanson · June 4, 2021, 10:59pm

You probably know this, but if the three matrices are different sizes, then an easy optimisation is to make sure you multiply them in the best order. E.g. as an extreme example if C is n x 1, and say the other two are square, then doing A*(B*C) is basically two matrix-vector products, while (A*B)*C is a matrix-matrix product followed by a matrix-vector product (so would be much slower). In the general case of N matrices, this called matrix chain multiplication and the optimal order can be found efficiently with dynamic programming.

(I love bringing this up because I think it provides great intuition for mixed-mode auto-differentiation which is also about choosing the best order in which to compose linear operators).

yingqiuz · June 4, 2021, 11:09pm

Thank you for bring this up - it’s very helpful.

I’m thinking about different cases… What if B is symmetric (either positive-definite or not)? is there an efficient way to solve this? And what if C is the transpose of A? In our field, we encounter a lot XΣX’, where Σ is (semi-)positive definite…

ericphanson · June 4, 2021, 11:15pm

I think others here know a lot more about these cases than I do, but something I’ve heard is to take the Cholesky factorization of Σ, so Σ = L * L', and then you can set Y = X*L so that X*Σ*X' == Y*Y'. I’m not sure the best way numerically to do make use of this though, probably others can say more.

Elrod · June 4, 2021, 11:57pm

Matrix multiplication is O(N^3).
Quadruple loops are O(N^4).

Mason · June 5, 2021, 12:17am

Yeah, I guess it’s not really about locality at all

Tamas_Papp · June 7, 2021, 6:44am

Use a decomposition for Σ (LDL or Cholesky, depending on definiteness), and avoid calculating the product as long as you can. Eg if Σ = L * L', work with XL etc. I do this for the multivariate normal in

Topic		Replies	Views
Product of three matrices fast Performance	5	1439	September 22, 2018
Efficient approach to multiply three matrices (M1M2M3) and two vectors and a matrix (xMy) Performance linearalgebra	18	5819	July 10, 2023
Speeding up Matrix multiplication involving dot and hadamard product New to Julia question , performance , vector , matrices , loopvectorization	11	1645	February 9, 2022
Vector - Matrix - Vector multiplication Performance	19	4243	March 14, 2021
Which matrix multiplication is more efficient? AB and CB or [A; C]*B Performance	9	299	February 23, 2024

What's the idiomatic way of calculating matrix product A*B*C?

Related topics

What's the idiomatic way of calculating matrix product ABC?