Vector - Matrix - Vector multiplication

I want to calculate xMy, where M is a complex matrix given by M = (AB-CD)’ * (AB-BC), and A,B,C,D are complex or real matrices, not of the same dimension, i.e. in general not square matrices. Some example matrices are given below.

A = randn(ComplexF64, 1000,3000)
B = randn(ComplexF64, 3000,2000)
C = randn(ComplexF64, 1000, 3000)
D = randn(ComplexF64, 3000,2000)

x = randn(ComplexF64, 1, 2000)
y = randn(ComplexF64, 2000)

I tried

julia> @btime $x*($A*$B.-$C*$D)'*($A*$B.-$C*$D)*$y
  1.127 s (16 allocations: 183.15 MiB)
1-element Array{Complex{Float64},1}:
 1.8664186843782222e8 - 4.788266272274668e7im

julia> @btime $x*($B'*$A'.-$D'*$C')*($A*$B.-$C*$D)*$y
  1.091 s (16 allocations: 183.15 MiB)
1-element Array{Complex{Float64},1}:
 1.8664186843782255e8 - 4.788266272274663e7im

julia> @btime $x*$B'*$A'*$A*$B*$y .- $x*$B'*$A'*$C*$D*$y .- $x*$D'*$C'*$A*$B*$y .+ $x*$D'*$C'*$C*$D*$y
  98.567 ms (33 allocations: 564.41 KiB)
1-element Array{Complex{Float64},1}:
 1.8664186843782327e8 - 4.788266272274631e7im

Note that the last version is 10x faster than the first two. Why is that?

I want to make the calculation of xMy fast. I will calculate this expression several times ( approx. 100-1000 times). I guess I can save some time by preallocating memory for intermediary results. How can I achieve both?

Cheers!

You should take a look at the mul! function in LinearAlgebra, and you should rather use x = randn(ComplexF64, 2000) vector and then x', instead of a 1xN matrix.

2 Likes

I am aware of the mul! function, its 3 and 5 argument version.

  • But how should I compose it to mimick my above expressions?
  • In which sequence should I do matrix-matrix multiplications?
  • Or should I just calculate from right to left matrix-vector multiplications?

The third version I posted above does something fancy, which results in 10x speed up. I would like to understand what is going on here, or at least what rule of thumb one can follow to compose matrix/vector multiplications to get a decent speed.

The best I could come up with is to use version 3 of my initial algorithm and to only calculate matrix-vector products and preallocate every intermediate result. Something like

cache = [Vector{ComplexF64}(undef, 3000), Vector{ComplexF64}(undef, 1000), Vector{ComplexF64}(undef, 2000)]

julia> @btime mul!($cache[1], $B,$y); mul!($cache[2], $A, $cache[1]); mul!($cache[1], $A', $cache[2]); mul!($cache[3], $B', $cache[1]); dot($x,$cache[3])
  3.280 ms (0 allocations: 0 bytes)

Since the matrices have different sizes the sequence matters. You can go through and count roughly the number of operations for each multiplication pair to get the best complexity.

I believe there is actually a package that can help you with this, but I forget it’s name.

1 Like

@DNF could you be a bit more explicit on how to count? My intuition would tell me that because at the beginning and end are vectors, one should start computing from there. Is this correct?

This is a bad idea: you want to avoid matrix–matrix products, and only perform matrix–vector products.

Multiplying two m \times m matrices requires \Theta(m^3) operations, while multiplying an m\times m matrix by an m-component vector requires only \Theta(m^2) operations.

This is not specific to Juliaβ€”it is a well-known fact in computational linear algebra.

Because the last version computes only matrix–vector products: * is left-associative, so x*B'*A'*A*B*y is equivalent to ((((x*B')*A')*A)*B)*y.

10 Likes

@stevengj thanks for explanation! Would you think Julia will be able to β€œunroll” (A+B)x into Ax + Bx at some point automatically? I am not a computer science person and this is something I would not have seen to give such a huge performance improvement, until today.

The performance problem is not in adding the matrices, but multiplying them. A+B is m^2 additions. Then multipling by x adds another m^2 operations. (A+B)x ~ m^2 as does (Ax +Bx). It is the inner multiplications in (AB .- CD) that are expensive.

3 Likes

I think in general @avx might be smart enough to re-write these types of expressions in an optimal way. Not 100% sure though.

@Jeff_Emanuel Thank you! This now makes sense to me.

@Oscar_Smith @avx does not seem to solve that. Or am I using it wrong?

julia> A, B = rand(2000,2000), rand(2000,2000)
julia> C, D = rand(2000,2000), rand(2000,2000)
julia> x,y = rand(2000), rand(2000)

julia> @btime $y'*($A*$B - $C*$D)*$x
  195.672 ms (7 allocations: 91.57 MiB)
223144.76575300365

julia> @btime $y'*$A*$B*$x - $y'*$C*$D*$x
  4.155 ms (4 allocations: 63.00 KiB)
223144.76575291157

julia> @btime @avx $y'*($A*$B - $C*$D)*$x
  196.520 ms (7 allocations: 91.57 MiB)
223144.76575300365

Were you thinking about GFlops.jl?
Unfortunately it won’t work too well here, as all the actual work will be delegated to the BLAS library, whose code is written in C and is not seen by Cassette.jl (on which GFlops.jl relies)

Rearranging expressions like this can make floating-point computations more unstable though. I am not sure this is something that you want to do automatically.

2 Likes

I’m pretty sure (A+B)x and Ax+Bx should be pretty similar numerically.

Come to think of it, there are ways to β€œfool” Julia into not using BLAS to perform the linear algebra operations. For example using β€œfake” floating-point numbers, that would be handled by generic algorithms, which are implemented in Julia and therefore seen by tools like GFlops.jl

Proof of concept:

primitive type FakeFloat64 <: AbstractFloat 64 end

FakeFloat64(x::Float64) = reinterpret(FakeFloat64, x)
FakeFloat64(x::Number) = FakeFloat64(Float64(x))
as_float64(x::FakeFloat64) = reinterpret(Float64, x)
Base.promote_rule(::Type{FakeFloat64}, ::Type{T}) where {T<:Integer} = FakeFloat64

Base.:+(x::FakeFloat64, y::FakeFloat64) = FakeFloat64(as_float64(x) + as_float64(y))
Base.:-(x::FakeFloat64, y::FakeFloat64) = FakeFloat64(as_float64(x) - as_float64(y))
Base.:*(x::FakeFloat64, y::FakeFloat64) = FakeFloat64(as_float64(x) * as_float64(y))
Base.:/(x::FakeFloat64, y::FakeFloat64) = FakeFloat64(as_float64(x) / as_float64(y))
Base.:<(x::FakeFloat64, y::FakeFloat64) = as_float64(x) < as_float64(y)
Base.:<=(x::FakeFloat64, y::FakeFloat64) = as_float64(x) <= as_float64(y)
Base.:-(x::FakeFloat64) = FakeFloat64(-as_float64(x))
Base.sqrt(x::FakeFloat64) = FakeFloat64(sqrt(as_float64(x)))
Base.eps(::Type{FakeFloat64}) = FakeFloat64(eps(Float64))
Base.show(io::IO, x::FakeFloat64) = show(io, as_float64(x))

With this, comparing the algorithmic complexities of various (equivalent) expressions becomes easy:

julia> using GFlops

# Convert matrices to "fake" floats
julia> A = rand(20, 50) .|> FakeFloat64;
julia> B = rand(50, 70) .|> FakeFloat64;
julia> c = rand(70)     .|> FakeFloat64;

# Check that both expressions give similar results
julia> @assert (A*B)*c β‰ˆ A*(B*c)
# Count operations
# this first one involves a matrix-matrix multiplication => large number of ops
julia> @count_ops (A*B)*c
Flop Counter: 145665 flop
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      β”‚ Float64 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  add β”‚   74221 β”‚
β”‚  mul β”‚   71442 β”‚
β”‚  div β”‚       1 β”‚
β”‚ sqrt β”‚       1 β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

# second expression with only matrix-vector products => fewer ops
julia> @count_ops A*(B*c)
Flop Counter: 9210 flop
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     β”‚ Float64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ add β”‚    4570 β”‚
β”‚ mul β”‚    4640 β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

I could add such β€œfake” floats to GFlops.jl if it seems to be of interest in more general cases.

6 Likes

One case where bad things could happen is when there are large cancellations in the A+B sum.

A (very) simplified example could look like:

julia> N = 3
3

# Identity matrix
julia> A = diagm(fill(1.0,   N))
3Γ—3 Array{Float64,2}:
 1.0  0.0  0.0
 0.0  1.0  0.0
 0.0  0.0  1.0

# Nearly (but not exactly) -A
julia> B = diagm(fill(eps(), N)) .- A
3Γ—3 Array{Float64,2}:
 -1.0   0.0   0.0
  0.0  -1.0   0.0
  0.0   0.0  -1.0

in this particular case, two mathematically equivalent expressions yield results differing in their second decimal digit:

# Test on some random vector
julia> x = rand(N);

julia> (A+B)*x
3-element Array{Float64,1}:
 2.5187054247781594e-17
 1.2659280856994421e-16
 1.856277798773992e-18

julia> A*x + B*x
3-element Array{Float64,1}:
 2.7755575615628914e-17
 1.1102230246251565e-16
 1.734723475976807e-18
2 Likes

There is a Julia function named fastmatmul to optimally associate the matrices and perform the multiplications in a cascade of matrix products at this site. It uses dynamic programming as documented in the accompanying book Julia 1.0 Programming Cookbook and is free to use (MIT license).

2 Likes

I think it was GitHub - AustinPrivett/MatrixChainMultiply.jl: Find the fastest way to multiply a chain of matrices and do it.

3 Likes

There is also a PR to add rules for up to 4 matrices, where it’s simple enough to solve by brute force: Add 3-arg * methods by mcabbott Β· Pull Request #37898 Β· JuliaLang/julia Β· GitHub . But it won’t automatically re-organise the brackets in this example.

Also worth mentioning, TensorOperations.jl has a @tensoropt macro to think about such things, docs: Index notation with macros Β· TensorOperations.jl .

2 Likes

I did some further quick benchmarks.

julia> A, B = rand(2000,2000), rand(2000,2000)
julia> C, D = rand(2000,2000), rand(2000,2000)
julia> x,y = rand(2000), rand(2000)

julia> @btime $y'*($A*$B - $C*$D)*$x
  195.672 ms (7 allocations: 91.57 MiB)
223144.76575300365

julia> @btime @tensoropt $y'*($A*$B - $C*$D)*$x
  198.883 ms (7 allocations: 91.57 MiB)
223144.76575300365

@tensoropt does not improve performance over Julias native y'*(A*B-C*D)*x. In the documentation it says: It will however not break apart expressions that have been explicitly grouped with parenthesis. Could this be a reason?