I want to calculate xMy, where M is a complex matrix given by M = (AB-CD)β * (AB-BC), and A,B,C,D are complex or real matrices, not of the same dimension, i.e. in general not square matrices. Some example matrices are given below.

A = randn(ComplexF64, 1000,3000)
B = randn(ComplexF64, 3000,2000)
C = randn(ComplexF64, 1000, 3000)
D = randn(ComplexF64, 3000,2000)
x = randn(ComplexF64, 1, 2000)
y = randn(ComplexF64, 2000)

Note that the last version is 10x faster than the first two. Why is that?

I want to make the calculation of xMy fast. I will calculate this expression several times ( approx. 100-1000 times). I guess I can save some time by preallocating memory for intermediary results. How can I achieve both?

You should take a look at the mul! function in LinearAlgebra, and you should rather use x = randn(ComplexF64, 2000) vector and then x', instead of a 1xN matrix.

I am aware of the mul! function, its 3 and 5 argument version.

But how should I compose it to mimick my above expressions?

In which sequence should I do matrix-matrix multiplications?

Or should I just calculate from right to left matrix-vector multiplications?

The third version I posted above does something fancy, which results in 10x speed up. I would like to understand what is going on here, or at least what rule of thumb one can follow to compose matrix/vector multiplications to get a decent speed.

The best I could come up with is to use version 3 of my initial algorithm and to only calculate matrix-vector products and preallocate every intermediate result. Something like

Since the matrices have different sizes the sequence matters. You can go through and count roughly the number of operations for each multiplication pair to get the best complexity.

I believe there is actually a package that can help you with this, but I forget itβs name.

@DNF could you be a bit more explicit on how to count? My intuition would tell me that because at the beginning and end are vectors, one should start computing from there. Is this correct?

This is a bad idea: you want to avoid matrixβmatrix products, and only perform matrixβvector products.

Multiplying two m \times m matrices requires \Theta(m^3) operations, while multiplying an m\times m matrix by an m-component vector requires only \Theta(m^2) operations.

This is not specific to Juliaβit is a well-known fact in computational linear algebra.

Because the last version computes only matrixβvector products: * is left-associative, so x*B'*A'*A*B*y is equivalent to ((((x*B')*A')*A)*B)*y.

@stevengj thanks for explanation! Would you think Julia will be able to βunrollβ (A+B)x into Ax + Bx at some point automatically? I am not a computer science person and this is something I would not have seen to give such a huge performance improvement, until today.

The performance problem is not in adding the matrices, but multiplying them. A+B is m^2 additions. Then multipling by x adds another m^2 operations. (A+B)x ~ m^2 as does (Ax +Bx). It is the inner multiplications in (AB .- CD) that are expensive.

Were you thinking about GFlops.jl?
Unfortunately it wonβt work too well here, as all the actual work will be delegated to the BLAS library, whose code is written in C and is not seen by Cassette.jl (on which GFlops.jl relies)

Rearranging expressions like this can make floating-point computations more unstable though. I am not sure this is something that you want to do automatically.

Come to think of it, there are ways to βfoolβ Julia into not using BLAS to perform the linear algebra operations. For example using βfakeβ floating-point numbers, that would be handled by generic algorithms, which are implemented in Julia and therefore seen by tools like GFlops.jl

With this, comparing the algorithmic complexities of various (equivalent) expressions becomes easy:

julia> using GFlops
# Convert matrices to "fake" floats
julia> A = rand(20, 50) .|> FakeFloat64;
julia> B = rand(50, 70) .|> FakeFloat64;
julia> c = rand(70) .|> FakeFloat64;
# Check that both expressions give similar results
julia> @assert (A*B)*c β A*(B*c)

# Count operations
# this first one involves a matrix-matrix multiplication => large number of ops
julia> @count_ops (A*B)*c
Flop Counter: 145665 flop
ββββββββ¬ββββββββββ
β β Float64 β
ββββββββΌββββββββββ€
β add β 74221 β
β mul β 71442 β
β div β 1 β
β sqrt β 1 β
ββββββββ΄ββββββββββ
# second expression with only matrix-vector products => fewer ops
julia> @count_ops A*(B*c)
Flop Counter: 9210 flop
βββββββ¬ββββββββββ
β β Float64 β
βββββββΌββββββββββ€
β add β 4570 β
β mul β 4640 β
βββββββ΄ββββββββββ

I could add such βfakeβ floats to GFlops.jl if it seems to be of interest in more general cases.

There is a Julia function named fastmatmul to optimally associate the matrices and perform the multiplications in a cascade of matrix products at this site. It uses dynamic programming as documented in the accompanying book Julia 1.0 Programming Cookbook and is free to use (MIT license).

julia> A, B = rand(2000,2000), rand(2000,2000)
julia> C, D = rand(2000,2000), rand(2000,2000)
julia> x,y = rand(2000), rand(2000)
julia> @btime $y'*($A*$B - $C*$D)*$x
195.672 ms (7 allocations: 91.57 MiB)
223144.76575300365
julia> @btime @tensoropt $y'*($A*$B - $C*$D)*$x
198.883 ms (7 allocations: 91.57 MiB)
223144.76575300365

@tensoropt does not improve performance over Julias native y'*(A*B-C*D)*x. In the documentation it says: It will however not break apart expressions that have been explicitly grouped with parenthesis. Could this be a reason?