I want to calculate xMy, where M is a complex matrix given by M = (AB-CD)’ * (AB-BC), and A,B,C,D are complex or real matrices, not of the same dimension, i.e. in general not square matrices. Some example matrices are given below.
A = randn(ComplexF64, 1000,3000)
B = randn(ComplexF64, 3000,2000)
C = randn(ComplexF64, 1000, 3000)
D = randn(ComplexF64, 3000,2000)
x = randn(ComplexF64, 1, 2000)
y = randn(ComplexF64, 2000)
Note that the last version is 10x faster than the first two. Why is that?
I want to make the calculation of xMy fast. I will calculate this expression several times ( approx. 100-1000 times). I guess I can save some time by preallocating memory for intermediary results. How can I achieve both?
I am aware of the mul! function, its 3 and 5 argument version.
But how should I compose it to mimick my above expressions?
In which sequence should I do matrix-matrix multiplications?
Or should I just calculate from right to left matrix-vector multiplications?
The third version I posted above does something fancy, which results in 10x speed up. I would like to understand what is going on here, or at least what rule of thumb one can follow to compose matrix/vector multiplications to get a decent speed.
@stevengj thanks for explanation! Would you think Julia will be able to “unroll” (A+B)x into Ax + Bx at some point automatically? I am not a computer science person and this is something I would not have seen to give such a huge performance improvement, until today.
The performance problem is not in adding the matrices, but multiplying them. A+B is m^2 additions. Then multipling by x adds another m^2 operations. (A+B)x ~ m^2 as does (Ax +Bx). It is the inner multiplications in (AB .- CD) that are expensive.
Were you thinking about GFlops.jl?
Unfortunately it won’t work too well here, as all the actual work will be delegated to the BLAS library, whose code is written in C and is not seen by Cassette.jl (on which GFlops.jl relies)
Come to think of it, there are ways to “fool” Julia into not using BLAS to perform the linear algebra operations. For example using “fake” floating-point numbers, that would be handled by generic algorithms, which are implemented in Julia and therefore seen by tools like GFlops.jl
With this, comparing the algorithmic complexities of various (equivalent) expressions becomes easy:
julia> using GFlops
# Convert matrices to "fake" floats
julia> A = rand(20, 50) .|> FakeFloat64;
julia> B = rand(50, 70) .|> FakeFloat64;
julia> c = rand(70) .|> FakeFloat64;
# Check that both expressions give similar results
julia> @assert (A*B)*c ≈ A*(B*c)
# Count operations
# this first one involves a matrix-matrix multiplication => large number of ops
julia> @count_ops (A*B)*c
Flop Counter: 145665 flop
│ │ Float64 │
│ add │ 74221 │
│ mul │ 71442 │
│ div │ 1 │
│ sqrt │ 1 │
# second expression with only matrix-vector products => fewer ops
julia> @count_ops A*(B*c)
Flop Counter: 9210 flop
│ │ Float64 │
│ add │ 4570 │
│ mul │ 4640 │
I could add such “fake” floats to GFlops.jl if it seems to be of interest in more general cases.
There is a Julia function named fastmatmul to optimally associate the matrices and perform the multiplications in a cascade of matrix products at this site. It uses dynamic programming as documented in the accompanying book Julia 1.0 Programming Cookbook and is free to use (MIT license).
julia> A, B = rand(2000,2000), rand(2000,2000)
julia> C, D = rand(2000,2000), rand(2000,2000)
julia> x,y = rand(2000), rand(2000)
julia> @btime $y'*($A*$B - $C*$D)*$x
195.672 ms (7 allocations: 91.57 MiB)
julia> @btime @tensoropt $y'*($A*$B - $C*$D)*$x
198.883 ms (7 allocations: 91.57 MiB)
@tensoropt does not improve performance over Julias native y'*(A*B-C*D)*x. In the documentation it says: It will however not break apart expressions that have been explicitly grouped with parenthesis. Could this be a reason?