Efficient approach to multiply three matrices (M1*M2*M3) and two vectors and a matrix (x*M*y)

I am trying to optimise a small function in which I am performing linear algebra operations. This function is called in a loop about a thousand times and thus speed is critical.

The main points that I would like to optimise are the following:

  1. M2 = M1*M2*M3 where all terms are float dense matrices (in a specialised method M2 is Symmetric, but in most cases it isn’t);
  2. M = x*M*y (again, in a specialised method M is Symmetric, but in most cases it isn’t).

I recon that an in-place update might help, but I am unsure how to perform it with pre and post matrix/vector multiplications.

The second case can be handled by dot and the first case by preallocating result arrays outside the loop and using mul! twice.

5 Likes

If the matrices and vectors are small, probably just using static arrays is a good alternative:

julia> A = rand(10,10); B = rand(10,10);

julia> @btime $A*$B;
  321.728 ns (1 allocation: 896 bytes)

julia> using StaticArrays

julia> A = @SMatrix rand(10,10);

julia> B = @SMatrix rand(10,10);

julia> @btime $A*$B
  65.206 ns (0 allocations: 0 bytes)

If they are larger, probably calling in place mul! will be a good alternative:

julia> using LinearAlgebra

julia> A = rand(1000,1000); B = rand(1000,1000); C = similar(A);

julia> @btime $A*$B;
  18.840 ms (2 allocations: 7.63 MiB)

julia> @btime LinearAlgebra.mul!($C,$A,$B)
  18.650 ms (0 allocations: 0 bytes)
1 Like

Thank you.

The arrays are too large for benefitting from StaticArrays. I tried with LinearAlgebra.mul!, but I haven’t noticed major performance improvements. I was also planning about writing down a series of nested loops with @views, but the comments in the Tullio.jl readme (# Fast & slow) made me think that’s not a great strategy.

Trying to write custom matrix multiplication routines is probably a waste of time. These functions are way too optimized for one being able to write something better, unless the problem has a very specific structure we know how to explore.

Probably mul! will not be faster than out-of-place in general (I guess not having the constraint of doing the multiplication in place can even allow a faster algorithm). If the time of allocation is small relative to the time of computation, just keeping the standard multiplication will be fine. Then you probalby would need to try to use parallel/GPU accelerated versions of the multiplication routines to get faster.

4 Likes

dot seems promising (for the second case). My only problem with it is that I could not find a dot product function able to update M in-place or to benefit from the symmetry of this matrix when possible. Would you please send me a reference to the LinearAlgebra relevant code?

I would highly suggest checking out Tullio.jl or Octavian.jl

2 Likes

Thank you, I did not know about Octavian.jl - I will check it out!

I have started playing with Tullio.jl, but as said above (see Tullio.jl > readme > Fast & slow) it does not seem to be the best option for the first problem (maybe I should benchmark it for the generalised dot product).

This should just happen: @less dot(rand(3), Symmetric(rand(3,3)), rand(3)) shows you the code, which exploits symmetry & doesn’t allocate anything. However, it won’t always be faster than x' * M * y, which does allocate. Something like dot(x, mul!(tmp, M, y)) can avoid that, if you can re-use tmp.

I think @tullio z := x[i] * M[i,j] * y[k] should be pretty efficient, it’s more or less the same loops as dot with some @turbo magic. (It’s for three matrices A * B * C that it will do something algorithmically less efficient than sequential multiplication.) However, it does not know how to exploit Symmetric(M).

2 Likes

Thank you. I will benchmark it now and see how it looks! :slight_smile:

Out of curiosity - are these options (@turbo, Tullio.jl and Octavian.jl) compatible with distributed computing? I might speed these operations further if that’s the case.

Can I take a wild guess that your two operations are for the Gaussian log-likelihood, and that the first thing is for a conditional covariance that looks like \Sigma_1 - \Sigma_{1,2}^T \Sigma_{2} \Sigma_{1,2}? And that your second thing is a quadratic form that is actually x^T \Sigma_{2}^{-1} x?

If that happens to be correct, then you can go reasonably far by saving a cholesky(Sigma2), then calling an in-place ldiv!(Sigma2.L, Sigma12) and then using the five-argument mul! to get the final conditional covariance using one extra buffer.

You could also then compute the quadratic form with sum(abs2, some_cholesky_factored_matrix.L\x).

Again, just a wild guess, so maybe this isn’t all that helpful.

4 Likes

It’s not really the same problem, but the structure is similar! Also, I might apply your approach in TSAnalysis.jl to speed-up the backwards_pass function (in dev).

2 Likes

Hi!
I just came across this issue. Did you find the most efficient way to multiply three matrices? I would highly appreciate it if you could tell me about your results. Thanks!

I recently encountered a similar problem. I need to multiply three large matrices X * Y * Z. It looks like Tullio and Octavian have the best performances, although Octavian takes a much longer time to precompile.

using BenchmarkTools

# generate data 

const N = 500
const J = 10
const K = 8
const M = 200_000

A = rand(N, J)
B = rand(J, K)
C = rand(K, M)


# 1. naive code, slow
m1_naive(X, Y, Z) = X * Y * Z

# 2. still slow
function m2(X, Y, Z)
    res = Matrix{Float64}(undef, N, M)
    res = (X * Y) * Z
    return res
end

# 3. preallocate, use mul! twice
using LinearAlgebra
function m3_mul(X, Y, Z)
    res = Matrix{Float64}(undef, N, M)
    XY = Matrix{Float64}(undef, N, K)
    XY = mul!(XY, X, Y)
    res = mul!(res, XY, Z)
    return res
end

# 4. Tullio
using Tullio
m4_tullio(X, Y, Z) = @tullio res[n, m] := X[n, j] * Y[j, k] * Z[k, m]

# 5. using Octavian
using Octavian
function m5_octavian(X, Y, Z)
    res = Matrix{Float64}(undef, N, M)
    XY = Matrix{Float64}(undef, N, K)
    XY = matmul!(XY, X, Y)
    res = matmul!(res, XY, Z)
    return res
end

julia> @time m1_naive(A, B, C);
  1.756877 seconds (1.35 M allocations: 854.938 MiB, 3.84% gc time, 51.98% compilation time)

julia> @time m2(A, B, C);
  0.833115 seconds (1.35 k allocations: 763.058 MiB, 1.19% gc time, 0.45% compilation time)

julia> @time m3_mul(A, B, C);
  0.876475 seconds (2.22 k allocations: 763.112 MiB, 7.13% gc time, 0.60% compilation time)

julia> @time m4_tullio(A, B, C);
  1.453623 seconds (394.21 k allocations: 788.928 MiB, 4.07% gc time, 177.89% compilation time)

julia> @time m5_octavian(A, B, C);
 10.832562 seconds (8.95 M allocations: 1.271 GiB, 3.59% gc time, 95.59% compilation time)

julia> @btime m1_naive($A, $B, $C);
  803.073 ms (4 allocations: 762.97 MiB)

julia> @btime m2($A, $B, $C);
  799.319 ms (4 allocations: 762.97 MiB)

julia> @btime m3_mul($A, $B, $C);
  805.715 ms (4 allocations: 762.97 MiB)

julia> @btime m4_tullio($A, $B, $C);
  644.766 ms (316 allocations: 762.96 MiB)

julia> @btime m5_octavian($A, $B, $C);
  563.200 ms (4 allocations: 762.97 MiB)
1 Like

This is not preallocation. You are allocating a res array inside m2_prealloc, and then you are discarding it and replacing it with X * Y * Z. To pre-allocate, you need to (a) use mul! as in m3_mul, and (b) pass in the allocated res and XY arrays as parameters rather than re-allocating them each time you call the function.

1 Like

Maybe it’s worth noting that this m4_tullio makes 4 nested loops, so is O(N^4) for large square-ish matrices, while pairwise * is O(N^3). Tullio does not try to spot such things.

For non-square matrices, the order of multiplication may also matter. * will try to choose the faster grouping, although here it makes little difference.

julia> @btime ($A * $B) * $C;
  min 77.828 ms, mean 86.254 ms (4 allocations, 762.97 MiB)

julia> @btime $A * ($B * $C);
  min 79.966 ms, mean 89.249 ms (4 allocations, 778.20 MiB)

julia> @btime m4_tullio($A, $B, $C);  # N^4 alg.
  min 117.788 ms, mean 129.025 ms (52 allocations, 762.94 MiB)

julia> using Tullio, LoopVectorization

julia> function m6_tullio(X, Y, Z)  # 2 * N^3 alg., faster
         @tullio XY[n, k] := X[n, j] * Y[j, k]
         @tullio res[n, m] := XY[n, k] * Z[k, m]
       end;

julia> m6_tullio(A, B, C) β‰ˆ A * B * C
true

julia> @btime m6_tullio($A, $B, $C);
  min 32.998 ms, mean 44.807 ms (54 allocations, 762.97 MiB)
5 Likes

Thank you so much! The new function runs way much faster! I also noticed that the same code runs much slower on my machine and has over 300 allocations. What could be the cause of the difference?

julia> @benchmark m4_tullio($A, $B, $C)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  581.894 ms … 777.365 ms  β”Š GC (min … max): 0.25% … 8.08%
 Time  (median):     645.025 ms               β”Š GC (median):    4.74%
 Time  (mean Β± Οƒ):   649.141 ms Β±  69.037 ms  β”Š GC (mean Β± Οƒ):  5.12% Β± 4.77%

  β–β–ˆ       ▁                    ▁▁▁                           ▁  
  β–ˆβ–ˆβ–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–ˆβ–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  582 ms           Histogram: frequency by time          777 ms <

 Memory estimate: 762.96 MiB, allocs estimate: 314.

julia> @benchmark m6_tullio($A, $B, $C)
BenchmarkTools.Trial: 36 samples with 1 evaluation.
 Range (min … max):   76.264 ms … 210.616 ms  β”Š GC (min … max):  1.06% … 36.74%
 Time  (median):     145.524 ms               β”Š GC (median):    33.50%
 Time  (mean Β± Οƒ):   141.689 ms Β±  42.856 ms  β”Š GC (mean Β± Οƒ):  26.89% Β± 17.62%

      β–ƒ β–ˆ  β–ƒ                 β–ƒ       β–ƒ     β–ƒ  β–ƒ           β–ƒ      
  β–‡β–β–‡β–β–ˆβ–‡β–ˆβ–β–β–ˆβ–‡β–β–β–β–β–β–β–β–‡β–‡β–‡β–‡β–β–β–β–‡β–β–ˆβ–β–β–β–β–β–β–β–ˆβ–β–‡β–‡β–β–β–ˆβ–β–‡β–ˆβ–‡β–‡β–β–β–‡β–β–‡β–β–β–β–‡β–ˆβ–β–β–‡β–‡ ▁
  76.3 ms          Histogram: frequency by time          211 ms <

 Memory estimate: 762.99 MiB, allocs estimate: 316.

julia> versioninfo()
Julia Version 1.9.1
Commit 147bdf428cd (2023-06-07 08:27 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 Γ— Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 28 on 32 virtual cores
Environment:
  LD_LIBRARY_PATH = /share/pkg.7/julia/1.9.1/install/lib
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 28

I was surprised to read about this optimization; it is not mentioned in the manual. Where can I find out more?

Maybe it’s a bug that this isn’t printed in the manual, but it is described in help:

help?> *
search: *

[...]

  *(A, B::AbstractMatrix, C)
  A * B * C * D

  Chained multiplication of 3 or 4 matrices is done in the most efficient sequence, based on the
  sizes of the arrays. That is, the number of scalar multiplications needed for (A * B) * C (with
  3 dense matrices) is compared to that for A * (B * C) to choose which of these to execute.

  If the last factor is a vector, or the first a transposed vector, then it is efficient to deal
  with these first. In particular x' * B * y means (x' * B) * y for an ordinary column-major
  B::Matrix. Unlike dot(x, B, y), this allocates an intermediate array.

  If the first or last factor is a number, this will be fused with the matrix multiplication,
  using 5-arg mul!.

  See also muladd, dot.

  β”‚ Julia 1.7
  β”‚
  β”‚  These optimisations require at least Julia 1.7.
3 Likes