Possible performance improvement in matrix multiplication involving a transpose and a complex matrix

julia> ZC = rand(1800, 1680); M = rand(ComplexF64, 1800, 1800);

julia> @time ZC' * M;
  9.169792 seconds (3 allocations: 46.143 MiB)

julia> @time permutedims(ZC) * M;
  0.877853 seconds (6 allocations: 115.357 MiB, 1.55% gc time)

julia> @which ZC' * M
*(A::AbstractMatrix, B::AbstractMatrix) in LinearAlgebra at /home/jishnu/Downloads/julia/julia-1.7.0/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:151

julia> @which permutedims(ZC) * M
*(A::StridedMatrix{var"#s859"} where var"#s859"<:Union{Float32, Float64}, B::StridedMatrix{var"#s858"} where var"#s858"<:Union{Float32, Float64, ComplexF32, ComplexF64}) in LinearAlgebra at /home/jishnu/Downloads/julia/julia-1.7.0/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:158

The multiplication involving the lazy transpose seems to hit a generic fallback if M is complex. This seems to be because

julia> ZC' isa StridedMatrix
false

Why though? Shouldn’t this be a StridedMatrix as well, given that the stride along each dimension is a constant?

Interestingly, despite this method being hit, the performance is almost identical if M is real:

julia> ZC = rand(1800, 1680); M = rand(1800, 1800);

julia> @time ZC' * M;
  0.261731 seconds (3 allocations: 23.071 MiB, 2.57% gc time)

julia> @time permutedims(ZC) * M;
  0.278280 seconds (4 allocations: 46.143 MiB)

julia> @which ZC' * M
*(A::AbstractMatrix, B::AbstractMatrix) in LinearAlgebra at /home/jishnu/Downloads/julia/julia-1.7.0/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:151

julia> @which permutedims(ZC) * M
*(A::StridedMatrix{var"#s859"} where var"#s859"<:Union{Float32, Float64}, B::StridedMatrix{var"#s858"} where var"#s858"<:Union{Float32, Float64, ComplexF32, ComplexF64}) in LinearAlgebra at /home/jishnu/Downloads/julia/julia-1.7.0/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:158

Looking a bit into it, it appears that the mul! for the complex matrix hits generic_matmatmul! which is slow, while it doesn’t for real matrices.

# For complex matrices
@inline function mul!(C::AbstractMatrix, adjA::Adjoint{<:Any,<:AbstractVecOrMat}, B::AbstractVecOrMat,
                 alpha::Number, beta::Number)
    A = adjA.parent
    return generic_matmatmul!(C, 'C', 'N', A, B, MulAddMul(alpha, beta))
end

# For real matrices
@inline function mul!(C::StridedMatrix{T}, adjA::Adjoint{<:Any,<:StridedVecOrMat{T}}, B::StridedVecOrMat{T},
                 alpha::Real, beta::Real) where {T<:BlasReal}
    A = adjA.parent
    return mul!(C, transpose(A), B, alpha, beta)
end

Perhaps such a method may be added for complex numbers as well?

4 Likes

I’ve created an issue about this

TLDR: BLAS doesn’t provide a fast way to multiply a conjugate-transposed complex matrix by a real matrix, unless you explicitly materialize the former via Matrix(ZC') or similar.

1 Like

Octavian does pretty well here, at least on computers with a lot of cores and L2 cache:

julia> using Octavian, BenchmarkTools, LinearAlgebra

julia> ZC = rand(1800, 1680); M = rand(ComplexF64, 1800, 1800);

julia> @benchmark matmul!($C1, $ZC', $M)
BenchmarkTools.Trial: 183 samples with 1 evaluation.
 Range (min … max):  26.761 ms … 40.952 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     27.079 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   27.392 ms Β±  1.158 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–ˆβ–…β–„β–‚
  β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–β–†β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–†β–ˆβ–ˆβ–„β–…β–„β–ƒβ–β–ƒβ–β–β–β–β–β–ƒβ–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–ƒ β–ƒ
  26.8 ms         Histogram: frequency by time        30.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark $ZCt * $M
BenchmarkTools.Trial: 110 samples with 1 evaluation.
 Range (min … max):  39.913 ms … 59.297 ms  β”Š GC (min … max):  0.00% … 30.26%
 Time  (median):     40.437 ms              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   45.758 ms Β±  7.155 ms  β”Š GC (mean Β± Οƒ):  12.41% Β± 13.10%

  β–ˆβ–‚                                   β–‚                β–ƒ
  β–ˆβ–ˆβ–ˆβ–„β–„β–„β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‡β–ˆβ–ˆβ–†β–†β–β–β–β–„β–β–β–β–β–β–β–β–β–„β–ˆβ–†β–†β–β–‡ β–„
  39.9 ms      Histogram: log(frequency) by time      58.2 ms <

 Memory estimate: 92.29 MiB, allocs estimate: 4.

julia> @benchmark mul!($C1, $ZCt, $M)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  4.384 s …  4.396 s  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     4.390 s             β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.390 s Β± 8.446 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ                                                      β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  4.38 s        Histogram: frequency by time         4.4 s <

 Memory estimate: 24.94 KiB, allocs estimate: 3.

julia> versioninfo()
Julia Version 1.8.0-DEV.1112
Commit 971d18b11d* (2021-12-08 18:29 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz

It has a really naive implementation for complex numbers that doesn’t use any fancy instructions or blocking, but even a naive implementation is better than nothing.

These results were on an 18 core machine (7980XE).

3 Likes