Possible performance improvement in matrix multiplication involving a transpose and a complex matrix

jishnub · December 9, 2021, 9:28am

julia> ZC = rand(1800, 1680); M = rand(ComplexF64, 1800, 1800);

julia> @time ZC' * M;
  9.169792 seconds (3 allocations: 46.143 MiB)

julia> @time permutedims(ZC) * M;
  0.877853 seconds (6 allocations: 115.357 MiB, 1.55% gc time)

julia> @which ZC' * M
*(A::AbstractMatrix, B::AbstractMatrix) in LinearAlgebra at /home/jishnu/Downloads/julia/julia-1.7.0/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:151

julia> @which permutedims(ZC) * M
*(A::StridedMatrix{var"#s859"} where var"#s859"<:Union{Float32, Float64}, B::StridedMatrix{var"#s858"} where var"#s858"<:Union{Float32, Float64, ComplexF32, ComplexF64}) in LinearAlgebra at /home/jishnu/Downloads/julia/julia-1.7.0/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:158

The multiplication involving the lazy transpose seems to hit a generic fallback if M is complex. This seems to be because

julia> ZC' isa StridedMatrix
false

Why though? Shouldn’t this be a StridedMatrix as well, given that the stride along each dimension is a constant?

Interestingly, despite this method being hit, the performance is almost identical if M is real:

julia> ZC = rand(1800, 1680); M = rand(1800, 1800);

julia> @time ZC' * M;
  0.261731 seconds (3 allocations: 23.071 MiB, 2.57% gc time)

julia> @time permutedims(ZC) * M;
  0.278280 seconds (4 allocations: 46.143 MiB)

julia> @which ZC' * M
*(A::AbstractMatrix, B::AbstractMatrix) in LinearAlgebra at /home/jishnu/Downloads/julia/julia-1.7.0/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:151

julia> @which permutedims(ZC) * M
*(A::StridedMatrix{var"#s859"} where var"#s859"<:Union{Float32, Float64}, B::StridedMatrix{var"#s858"} where var"#s858"<:Union{Float32, Float64, ComplexF32, ComplexF64}) in LinearAlgebra at /home/jishnu/Downloads/julia/julia-1.7.0/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:158

Looking a bit into it, it appears that the mul! for the complex matrix hits generic_matmatmul! which is slow, while it doesn’t for real matrices.

# For complex matrices
@inline function mul!(C::AbstractMatrix, adjA::Adjoint{<:Any,<:AbstractVecOrMat}, B::AbstractVecOrMat,
                 alpha::Number, beta::Number)
    A = adjA.parent
    return generic_matmatmul!(C, 'C', 'N', A, B, MulAddMul(alpha, beta))
end

# For real matrices
@inline function mul!(C::StridedMatrix{T}, adjA::Adjoint{<:Any,<:StridedVecOrMat{T}}, B::StridedVecOrMat{T},
                 alpha::Real, beta::Real) where {T<:BlasReal}
    A = adjA.parent
    return mul!(C, transpose(A), B, alpha, beta)
end

Perhaps such a method may be added for complex numbers as well?

jishnub · December 10, 2021, 6:52am

I’ve created an issue about this

stevengj · December 10, 2021, 1:10pm

TLDR: BLAS doesn’t provide a fast way to multiply a conjugate-transposed complex matrix by a real matrix, unless you explicitly materialize the former via Matrix(ZC') or similar.

Elrod · December 10, 2021, 3:56pm

Octavian does pretty well here, at least on computers with a lot of cores and L2 cache:

julia> using Octavian, BenchmarkTools, LinearAlgebra

julia> ZC = rand(1800, 1680); M = rand(ComplexF64, 1800, 1800);

julia> @benchmark matmul!($C1, $ZC', $M)
BenchmarkTools.Trial: 183 samples with 1 evaluation.
 Range (min … max):  26.761 ms … 40.952 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     27.079 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   27.392 ms ±  1.158 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █▅▄▂
  ▇▇█████▁▆▃▃▃▃▃▃▃▆██▄▅▄▃▁▃▁▁▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  26.8 ms         Histogram: frequency by time        30.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark $ZCt * $M
BenchmarkTools.Trial: 110 samples with 1 evaluation.
 Range (min … max):  39.913 ms … 59.297 ms  ┊ GC (min … max):  0.00% … 30.26%
 Time  (median):     40.437 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   45.758 ms ±  7.155 ms  ┊ GC (mean ± σ):  12.41% ± 13.10%

  █▂                                   ▂                ▃
  ███▄▄▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇██▆▆▁▁▁▄▁▁▁▁▁▁▁▁▄█▆▆▁▇ ▄
  39.9 ms      Histogram: log(frequency) by time      58.2 ms <

 Memory estimate: 92.29 MiB, allocs estimate: 4.

julia> @benchmark mul!($C1, $ZCt, $M)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  4.384 s …  4.396 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.390 s             ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.390 s ± 8.446 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                      █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  4.38 s        Histogram: frequency by time         4.4 s <

 Memory estimate: 24.94 KiB, allocs estimate: 3.

julia> versioninfo()
Julia Version 1.8.0-DEV.1112
Commit 971d18b11d* (2021-12-08 18:29 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz

It has a really naive implementation for complex numbers that doesn’t use any fancy instructions or blocking, but even a naive implementation is better than nothing.

These results were on an 18 core machine (7980XE).

Topic		Replies	Views
Is it possible for complex matrix multiplication and adjoint speed to catch up with matlab? New to Julia question	38	1413	October 11, 2023
Puzzling performance when multiplying real and complex matrices General Usage	3	495	February 12, 2019
In-place multiplication is too much slower for ComplexF64 Performance complex-numbers	19	1232	February 10, 2022
Vector{Float} times Vector{Complex} slower than Vector{Complex} times Vector{Complex} Performance question , linearalgebra	6	563	July 26, 2019
Matrix-Vector multiplication complex/real Performance	0	379	March 14, 2021

Possible performance improvement in matrix multiplication involving a transpose and a complex matrix

Related topics