Vector - Matrix - Vector multiplication

jamblejoe · March 13, 2021, 6:43pm

I want to calculate xMy, where M is a complex matrix given by M = (AB-CD)’ * (AB-BC), and A,B,C,D are complex or real matrices, not of the same dimension, i.e. in general not square matrices. Some example matrices are given below.

A = randn(ComplexF64, 1000,3000)
B = randn(ComplexF64, 3000,2000)
C = randn(ComplexF64, 1000, 3000)
D = randn(ComplexF64, 3000,2000)

x = randn(ComplexF64, 1, 2000)
y = randn(ComplexF64, 2000)

I tried

julia> @btime $x*($A*$B.-$C*$D)'*($A*$B.-$C*$D)*$y
  1.127 s (16 allocations: 183.15 MiB)
1-element Array{Complex{Float64},1}:
 1.8664186843782222e8 - 4.788266272274668e7im

julia> @btime $x*($B'*$A'.-$D'*$C')*($A*$B.-$C*$D)*$y
  1.091 s (16 allocations: 183.15 MiB)
1-element Array{Complex{Float64},1}:
 1.8664186843782255e8 - 4.788266272274663e7im

julia> @btime $x*$B'*$A'*$A*$B*$y .- $x*$B'*$A'*$C*$D*$y .- $x*$D'*$C'*$A*$B*$y .+ $x*$D'*$C'*$C*$D*$y
  98.567 ms (33 allocations: 564.41 KiB)
1-element Array{Complex{Float64},1}:
 1.8664186843782327e8 - 4.788266272274631e7im

Note that the last version is 10x faster than the first two. Why is that?

I want to make the calculation of xMy fast. I will calculate this expression several times ( approx. 100-1000 times). I guess I can save some time by preallocating memory for intermediary results. How can I achieve both?

Cheers!

DNF · March 13, 2021, 7:16pm

You should take a look at the mul! function in LinearAlgebra, and you should rather use x = randn(ComplexF64, 2000) vector and then x', instead of a 1xN matrix.

jamblejoe · March 13, 2021, 7:58pm

I am aware of the mul! function, its 3 and 5 argument version.

But how should I compose it to mimick my above expressions?
In which sequence should I do matrix-matrix multiplications?
Or should I just calculate from right to left matrix-vector multiplications?

The third version I posted above does something fancy, which results in 10x speed up. I would like to understand what is going on here, or at least what rule of thumb one can follow to compose matrix/vector multiplications to get a decent speed.

jamblejoe · March 13, 2021, 8:03pm

The best I could come up with is to use version 3 of my initial algorithm and to only calculate matrix-vector products and preallocate every intermediate result. Something like

cache = [Vector{ComplexF64}(undef, 3000), Vector{ComplexF64}(undef, 1000), Vector{ComplexF64}(undef, 2000)]

julia> @btime mul!($cache[1], $B,$y); mul!($cache[2], $A, $cache[1]); mul!($cache[1], $A', $cache[2]); mul!($cache[3], $B', $cache[1]); dot($x,$cache[3])
  3.280 ms (0 allocations: 0 bytes)

DNF · March 13, 2021, 8:03pm

Since the matrices have different sizes the sequence matters. You can go through and count roughly the number of operations for each multiplication pair to get the best complexity.

I believe there is actually a package that can help you with this, but I forget it’s name.

jamblejoe · March 13, 2021, 8:07pm

@DNF could you be a bit more explicit on how to count? My intuition would tell me that because at the beginning and end are vectors, one should start computing from there. Is this correct?

stevengj · March 13, 2021, 8:08pm

This is a bad idea: you want to avoid matrix–matrix products, and only perform matrix–vector products.

Multiplying two m \times m matrices requires \Theta(m^3) operations, while multiplying an m\times m matrix by an m-component vector requires only \Theta(m^2) operations.

This is not specific to Julia—it is a well-known fact in computational linear algebra.

Because the last version computes only matrix–vector products: * is left-associative, so x*B'*A'*A*B*y is equivalent to ((((x*B')*A')*A)*B)*y.

jamblejoe · March 13, 2021, 8:12pm

@stevengj thanks for explanation! Would you think Julia will be able to “unroll” (A+B)x into Ax + Bx at some point automatically? I am not a computer science person and this is something I would not have seen to give such a huge performance improvement, until today.

Jeff_Emanuel · March 13, 2021, 8:42pm

The performance problem is not in adding the matrices, but multiplying them. A+B is m^2 additions. Then multipling by x adds another m^2 operations. (A+B)x ~ m^2 as does (Ax +Bx). It is the inner multiplications in (AB .- CD) that are expensive.

Oscar_Smith · March 13, 2021, 8:51pm

I think in general @avx might be smart enough to re-write these types of expressions in an optimal way. Not 100% sure though.

jamblejoe · March 13, 2021, 9:08pm

@Jeff_Emanuel Thank you! This now makes sense to me.

@Oscar_Smith @avx does not seem to solve that. Or am I using it wrong?

julia> A, B = rand(2000,2000), rand(2000,2000)
julia> C, D = rand(2000,2000), rand(2000,2000)
julia> x,y = rand(2000), rand(2000)

julia> @btime $y'*($A*$B - $C*$D)*$x
  195.672 ms (7 allocations: 91.57 MiB)
223144.76575300365

julia> @btime $y'*$A*$B*$x - $y'*$C*$D*$x
  4.155 ms (4 allocations: 63.00 KiB)
223144.76575291157

julia> @btime @avx $y'*($A*$B - $C*$D)*$x
  196.520 ms (7 allocations: 91.57 MiB)
223144.76575300365

ffevotte · March 13, 2021, 9:55pm

Were you thinking about GFlops.jl?
Unfortunately it won’t work too well here, as all the actual work will be delegated to the BLAS library, whose code is written in C and is not seen by Cassette.jl (on which GFlops.jl relies)

fph · March 13, 2021, 10:11pm

Rearranging expressions like this can make floating-point computations more unstable though. I am not sure this is something that you want to do automatically.

Oscar_Smith · March 13, 2021, 10:35pm

I’m pretty sure (A+B)x and Ax+Bx should be pretty similar numerically.

ffevotte · March 13, 2021, 10:38pm

Come to think of it, there are ways to “fool” Julia into not using BLAS to perform the linear algebra operations. For example using “fake” floating-point numbers, that would be handled by generic algorithms, which are implemented in Julia and therefore seen by tools like GFlops.jl

Proof of concept:

primitive type FakeFloat64 <: AbstractFloat 64 end

FakeFloat64(x::Float64) = reinterpret(FakeFloat64, x)
FakeFloat64(x::Number) = FakeFloat64(Float64(x))
as_float64(x::FakeFloat64) = reinterpret(Float64, x)
Base.promote_rule(::Type{FakeFloat64}, ::Type{T}) where {T<:Integer} = FakeFloat64

Base.:+(x::FakeFloat64, y::FakeFloat64) = FakeFloat64(as_float64(x) + as_float64(y))
Base.:-(x::FakeFloat64, y::FakeFloat64) = FakeFloat64(as_float64(x) - as_float64(y))
Base.:*(x::FakeFloat64, y::FakeFloat64) = FakeFloat64(as_float64(x) * as_float64(y))
Base.:/(x::FakeFloat64, y::FakeFloat64) = FakeFloat64(as_float64(x) / as_float64(y))
Base.:<(x::FakeFloat64, y::FakeFloat64) = as_float64(x) < as_float64(y)
Base.:<=(x::FakeFloat64, y::FakeFloat64) = as_float64(x) <= as_float64(y)
Base.:-(x::FakeFloat64) = FakeFloat64(-as_float64(x))
Base.sqrt(x::FakeFloat64) = FakeFloat64(sqrt(as_float64(x)))
Base.eps(::Type{FakeFloat64}) = FakeFloat64(eps(Float64))
Base.show(io::IO, x::FakeFloat64) = show(io, as_float64(x))

With this, comparing the algorithmic complexities of various (equivalent) expressions becomes easy:

julia> using GFlops

# Convert matrices to "fake" floats
julia> A = rand(20, 50) .|> FakeFloat64;
julia> B = rand(50, 70) .|> FakeFloat64;
julia> c = rand(70)     .|> FakeFloat64;

# Check that both expressions give similar results
julia> @assert (A*B)*c ≈ A*(B*c)

# Count operations
# this first one involves a matrix-matrix multiplication => large number of ops
julia> @count_ops (A*B)*c
Flop Counter: 145665 flop
┌──────┬─────────┐
│      │ Float64 │
├──────┼─────────┤
│  add │   74221 │
│  mul │   71442 │
│  div │       1 │
│ sqrt │       1 │
└──────┴─────────┘

# second expression with only matrix-vector products => fewer ops
julia> @count_ops A*(B*c)
Flop Counter: 9210 flop
┌─────┬─────────┐
│     │ Float64 │
├─────┼─────────┤
│ add │    4570 │
│ mul │    4640 │
└─────┴─────────┘

I could add such “fake” floats to GFlops.jl if it seems to be of interest in more general cases.

ffevotte · March 13, 2021, 10:50pm

One case where bad things could happen is when there are large cancellations in the A+B sum.

A (very) simplified example could look like:

julia> N = 3
3

# Identity matrix
julia> A = diagm(fill(1.0,   N))
3×3 Array{Float64,2}:
 1.0  0.0  0.0
 0.0  1.0  0.0
 0.0  0.0  1.0

# Nearly (but not exactly) -A
julia> B = diagm(fill(eps(), N)) .- A
3×3 Array{Float64,2}:
 -1.0   0.0   0.0
  0.0  -1.0   0.0
  0.0   0.0  -1.0

in this particular case, two mathematically equivalent expressions yield results differing in their second decimal digit:

# Test on some random vector
julia> x = rand(N);

julia> (A+B)*x
3-element Array{Float64,1}:
 2.5187054247781594e-17
 1.2659280856994421e-16
 1.856277798773992e-18

julia> A*x + B*x
3-element Array{Float64,1}:
 2.7755575615628914e-17
 1.1102230246251565e-16
 1.734723475976807e-18

PeterSimon · March 13, 2021, 11:13pm

There is a Julia function named fastmatmul to optimally associate the matrices and perform the multiplications in a cascade of matrix products at this site. It uses dynamic programming as documented in the accompanying book Julia 1.0 Programming Cookbook and is free to use (MIT license).

DNF · March 13, 2021, 11:58pm

I think it was GitHub - AustinPrivett/MatrixChainMultiply.jl: Find the fastest way to multiply a chain of matrices and do it.

mcabbott · March 14, 2021, 12:48am

There is also a PR to add rules for up to 4 matrices, where it’s simple enough to solve by brute force: https://github.com/JuliaLang/julia/pull/37898 . But it won’t automatically re-organise the brackets in this example.

Also worth mentioning, TensorOperations.jl has a @tensoropt macro to think about such things, docs: https://jutho.github.io/TensorOperations.jl/stable/indexnotation/#Contraction-order-and-@tensoropt-macro .

jamblejoe · March 14, 2021, 10:14am

I did some further quick benchmarks.

julia> A, B = rand(2000,2000), rand(2000,2000)
julia> C, D = rand(2000,2000), rand(2000,2000)
julia> x,y = rand(2000), rand(2000)

julia> @btime $y'*($A*$B - $C*$D)*$x
  195.672 ms (7 allocations: 91.57 MiB)
223144.76575300365

julia> @btime @tensoropt $y'*($A*$B - $C*$D)*$x
  198.883 ms (7 allocations: 91.57 MiB)
223144.76575300365

@tensoropt does not improve performance over Julias native y'*(A*B-C*D)*x. In the documentation it says: It will however not break apart expressions that have been explicitly grouped with parenthesis. Could this be a reason?

Topic		Replies	Views
Matrix-Vector multiplication complex/real Performance	0	374	March 14, 2021
Efficient approach to multiply three matrices (M1M2M3) and two vectors and a matrix (xMy) Performance linearalgebra	18	5613	July 10, 2023
Vector multiply vector vs for loop, type and size dependent performance New to Julia	13	1082	November 7, 2018
Possible performance improvement in matrix multiplication involving a transpose and a complex matrix Performance linearalgebra	3	839	December 10, 2021
Multiplication of vector of matrices and vector of vectors Performance linearalgebra , tullio	3	413	November 7, 2022

Vector - Matrix - Vector multiplication

Related topics