# Possible performance improvement in matrix multiplication involving a transpose and a complex matrix

``````julia> ZC = rand(1800, 1680); M = rand(ComplexF64, 1800, 1800);

julia> @time ZC' * M;
9.169792 seconds (3 allocations: 46.143 MiB)

julia> @time permutedims(ZC) * M;
0.877853 seconds (6 allocations: 115.357 MiB, 1.55% gc time)

julia> @which ZC' * M

julia> @which permutedims(ZC) * M
*(A::StridedMatrix{var"#s859"} where var"#s859"<:Union{Float32, Float64}, B::StridedMatrix{var"#s858"} where var"#s858"<:Union{Float32, Float64, ComplexF32, ComplexF64}) in LinearAlgebra at /home/jishnu/Downloads/julia/julia-1.7.0/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:158
``````

The multiplication involving the lazy transpose seems to hit a generic fallback if `M` is complex. This seems to be because

``````julia> ZC' isa StridedMatrix
false
``````

Why though? Shouldnβt this be a `StridedMatrix` as well, given that the stride along each dimension is a constant?

Interestingly, despite this method being hit, the performance is almost identical if `M` is real:

``````julia> ZC = rand(1800, 1680); M = rand(1800, 1800);

julia> @time ZC' * M;
0.261731 seconds (3 allocations: 23.071 MiB, 2.57% gc time)

julia> @time permutedims(ZC) * M;
0.278280 seconds (4 allocations: 46.143 MiB)

julia> @which ZC' * M

julia> @which permutedims(ZC) * M
*(A::StridedMatrix{var"#s859"} where var"#s859"<:Union{Float32, Float64}, B::StridedMatrix{var"#s858"} where var"#s858"<:Union{Float32, Float64, ComplexF32, ComplexF64}) in LinearAlgebra at /home/jishnu/Downloads/julia/julia-1.7.0/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:158
``````

Looking a bit into it, it appears that the `mul!` for the complex matrix hits `generic_matmatmul!` which is slow, while it doesnβt for real matrices.

``````# For complex matrices
alpha::Number, beta::Number)
return generic_matmatmul!(C, 'C', 'N', A, B, MulAddMul(alpha, beta))
end

# For real matrices
alpha::Real, beta::Real) where {T<:BlasReal}
return mul!(C, transpose(A), B, alpha, beta)
end
``````

Perhaps such a method may be added for complex numbers as well?

4 Likes

TLDR: BLAS doesnβt provide a fast way to multiply a conjugate-transposed complex matrix by a real matrix, unless you explicitly materialize the former via `Matrix(ZC')` or similar.

1 Like

Octavian does pretty well here, at least on computers with a lot of cores and L2 cache:

``````julia> using Octavian, BenchmarkTools, LinearAlgebra

julia> ZC = rand(1800, 1680); M = rand(ComplexF64, 1800, 1800);

julia> @benchmark matmul!(\$C1, \$ZC', \$M)
BenchmarkTools.Trial: 183 samples with 1 evaluation.
Range (min β¦ max):  26.761 ms β¦ 40.952 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     27.079 ms              β GC (median):    0.00%
Time  (mean Β± Ο):   27.392 ms Β±  1.158 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
26.8 ms         Histogram: frequency by time        30.2 ms <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark \$ZCt * \$M
BenchmarkTools.Trial: 110 samples with 1 evaluation.
Range (min β¦ max):  39.913 ms β¦ 59.297 ms  β GC (min β¦ max):  0.00% β¦ 30.26%
Time  (median):     40.437 ms              β GC (median):     0.00%
Time  (mean Β± Ο):   45.758 ms Β±  7.155 ms  β GC (mean Β± Ο):  12.41% Β± 13.10%

ββ                                   β                β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
39.9 ms      Histogram: log(frequency) by time      58.2 ms <

Memory estimate: 92.29 MiB, allocs estimate: 4.

julia> @benchmark mul!(\$C1, \$ZCt, \$M)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min β¦ max):  4.384 s β¦  4.396 s  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     4.390 s             β GC (median):    0.00%
Time  (mean Β± Ο):   4.390 s Β± 8.446 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

β                                                      β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
4.38 s        Histogram: frequency by time         4.4 s <

Memory estimate: 24.94 KiB, allocs estimate: 3.

julia> versioninfo()
Julia Version 1.8.0-DEV.1112
Commit 971d18b11d* (2021-12-08 18:29 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
``````

It has a really naive implementation for complex numbers that doesnβt use any fancy instructions or blocking, but even a naive implementation is better than nothing.

These results were on an 18 core machine (7980XE).

3 Likes