Efficient approach to multiply three matrices (M1*M2*M3) and two vectors and a matrix (x*M*y)

I am trying to optimise a small function in which I am performing linear algebra operations. This function is called in a loop about a thousand times and thus speed is critical.

The main points that I would like to optimise are the following:

  1. M2 = M1*M2*M3 where all terms are float dense matrices (in a specialised method M2 is Symmetric, but in most cases it isn’t);
  2. M = x*M*y (again, in a specialised method M is Symmetric, but in most cases it isn’t).

I recon that an in-place update might help, but I am unsure how to perform it with pre and post matrix/vector multiplications.

The second case can be handled by dot and the first case by preallocating result arrays outside the loop and using mul! twice.


If the matrices and vectors are small, probably just using static arrays is a good alternative:

julia> A = rand(10,10); B = rand(10,10);

julia> @btime $A*$B;
  321.728 ns (1 allocation: 896 bytes)

julia> using StaticArrays

julia> A = @SMatrix rand(10,10);

julia> B = @SMatrix rand(10,10);

julia> @btime $A*$B
  65.206 ns (0 allocations: 0 bytes)

If they are larger, probably calling in place mul! will be a good alternative:

julia> using LinearAlgebra

julia> A = rand(1000,1000); B = rand(1000,1000); C = similar(A);

julia> @btime $A*$B;
  18.840 ms (2 allocations: 7.63 MiB)

julia> @btime LinearAlgebra.mul!($C,$A,$B)
  18.650 ms (0 allocations: 0 bytes)
1 Like

Thank you.

The arrays are too large for benefitting from StaticArrays. I tried with LinearAlgebra.mul!, but I haven’t noticed major performance improvements. I was also planning about writing down a series of nested loops with @views, but the comments in the Tullio.jl readme (# Fast & slow) made me think that’s not a great strategy.

Trying to write custom matrix multiplication routines is probably a waste of time. These functions are way too optimized for one being able to write something better, unless the problem has a very specific structure we know how to explore.

Probably mul! will not be faster than out-of-place in general (I guess not having the constraint of doing the multiplication in place can even allow a faster algorithm). If the time of allocation is small relative to the time of computation, just keeping the standard multiplication will be fine. Then you probalby would need to try to use parallel/GPU accelerated versions of the multiplication routines to get faster.


dot seems promising (for the second case). My only problem with it is that I could not find a dot product function able to update M in-place or to benefit from the symmetry of this matrix when possible. Would you please send me a reference to the LinearAlgebra relevant code?

I would highly suggest checking out Tullio.jl or Octavian.jl


Thank you, I did not know about Octavian.jl - I will check it out!

I have started playing with Tullio.jl, but as said above (see Tullio.jl > readme > Fast & slow) it does not seem to be the best option for the first problem (maybe I should benchmark it for the generalised dot product).

This should just happen: @less dot(rand(3), Symmetric(rand(3,3)), rand(3)) shows you the code, which exploits symmetry & doesn’t allocate anything. However, it won’t always be faster than x' * M * y, which does allocate. Something like dot(x, mul!(tmp, M, y)) can avoid that, if you can re-use tmp.

I think @tullio z := x[i] * M[i,j] * y[k] should be pretty efficient, it’s more or less the same loops as dot with some @turbo magic. (It’s for three matrices A * B * C that it will do something algorithmically less efficient than sequential multiplication.) However, it does not know how to exploit Symmetric(M).


Thank you. I will benchmark it now and see how it looks! :slight_smile:

Out of curiosity - are these options (@turbo, Tullio.jl and Octavian.jl) compatible with distributed computing? I might speed these operations further if that’s the case.

Can I take a wild guess that your two operations are for the Gaussian log-likelihood, and that the first thing is for a conditional covariance that looks like \Sigma_1 - \Sigma_{1,2}^T \Sigma_{2} \Sigma_{1,2}? And that your second thing is a quadratic form that is actually x^T \Sigma_{2}^{-1} x?

If that happens to be correct, then you can go reasonably far by saving a cholesky(Sigma2), then calling an in-place ldiv!(Sigma2.L, Sigma12) and then using the five-argument mul! to get the final conditional covariance using one extra buffer.

You could also then compute the quadratic form with sum(abs2, some_cholesky_factored_matrix.L\x).

Again, just a wild guess, so maybe this isn’t all that helpful.


It’s not really the same problem, but the structure is similar! Also, I might apply your approach in TSAnalysis.jl to speed-up the backwards_pass function (in dev).


I just came across this issue. Did you find the most efficient way to multiply three matrices? I would highly appreciate it if you could tell me about your results. Thanks!