Hi
Is there a idiomatic way of calculating matrix product ABC, and store the resulted matrix (inplace) to a pre-allocated one? Something similar to mul!(Y, A, B) -> Y
, which stores A * B
in Y
.
Any help would be much appreciated!
Hi
Is there a idiomatic way of calculating matrix product ABC, and store the resulted matrix (inplace) to a pre-allocated one? Something similar to mul!(Y, A, B) -> Y
, which stores A * B
in Y
.
Any help would be much appreciated!
I beleive that the most efficient way to handle this is actually to have two buffers and do two separate matmuls, i.e. preallocate a D
and E
matrix and then do
mul!(E, mul!(D, A, B), C)
(which is the same as A * B * C
but if you’re doing this many times you’ll want to re-use the D
and E
matrices).
It might seem surprising that this is faster than a quadruple loop approach that only needs one buffer, but there are huge advantages in data locality to only using one single buffer, so it turns out to be faster to just allocate more:
julia> using LoopVectorization
julia> function triple_mul!(D, A, B, C)
# D[i, l] = A[i, j] * B[j, k] * C[k, l]
@assert size(D, 1) == size(A, 1) && size(D, 2) == size(C, 2)
@assert size(A, 2) == size(B, 1)
@assert size(B, 2) == size(C, 1)
@tturbo for i ∈ axes(D, 1), l ∈ axes(D, 2)
Dil = zero(eltype(D))
for j ∈ axes(A, 2), k ∈ axes(B, 2)
Dil += A[i, j] * B[j, k] * C[k, l]
end
D[i,l] = Dil
end
D
end
triple_mul! (generic function with 1 method)
julia> foreach((10, 50, 100)) do N
A, B, C = randn(N, N+1), randn(N+1, N+2), randn(N+2, N+3)
D = Matrix{Float64}(undef, N, N+3)
@show N
o1 = @btime triple_mul!($D, $A, $B, $C)
o2 = @btime $A * $B * $C
@assert o1 ≈ o2
println()
end
N = 10
1.611 μs (0 allocations: 0 bytes)
922.543 ns (2 allocations: 2.20 KiB)
N = 50
117.658 μs (0 allocations: 0 bytes)
31.210 μs (4 allocations: 41.22 KiB)
N = 100
1.738 ms (0 allocations: 0 bytes)
75.819 μs (4 allocations: 160.34 KiB)
For small matrices, LoopVectorization.@tturbo
is usually approximately the fastest thing out there, so if it’s losing against BLAS this badly even when BLAS is spending more time allocating, that’s a very bad sign.
You probably know this, but if the three matrices are different sizes, then an easy optimisation is to make sure you multiply them in the best order. E.g. as an extreme example if C
is n x 1
, and say the other two are square, then doing A*(B*C)
is basically two matrix-vector products, while (A*B)*C
is a matrix-matrix product followed by a matrix-vector product (so would be much slower). In the general case of N
matrices, this called matrix chain multiplication and the optimal order can be found efficiently with dynamic programming.
(I love bringing this up because I think it provides great intuition for mixed-mode auto-differentiation which is also about choosing the best order in which to compose linear operators).
Thank you for bring this up - it’s very helpful.
I’m thinking about different cases… What if B is symmetric (either positive-definite or not)? is there an efficient way to solve this? And what if C is the transpose of A? In our field, we encounter a lot XΣX’, where Σ is (semi-)positive definite…
I think others here know a lot more about these cases than I do, but something I’ve heard is to take the Cholesky factorization of Σ
, so Σ = L * L'
, and then you can set Y = X*L
so that X*Σ*X' == Y*Y'
. I’m not sure the best way numerically to do make use of this though, probably others can say more.
Matrix multiplication is O(N^3).
Quadruple loops are O(N^4).
Yeah, I guess it’s not really about locality at all
Use a decomposition for Σ
(LDL or Cholesky, depending on definiteness), and avoid calculating the product as long as you can. Eg if Σ = L * L'
, work with XL
etc. I do this for the multivariate normal in