Is it possible for complex matrix multiplication and adjoint speed to catch up with matlab?

I need to process a large number of complex matrix multiplications in some programs, but this is very slow, especially after using Distributed parallelism, matrix multiplication will be more than 10 times slower. The speed will be much slower than matlab.Especially if there is a adjoint .

#julia1.9.2
using BenchmarkTools
H=randn(Complex{Float64},1000,1000)
realH=real(H)
Omega=randn(size(H,2),10)
Y=Matrix{Complex}(undef,1000,10);
Y1=similar(Y)
@btime for j=1:1000
    Y=H*Omega
end #julia:0.6s    matlab:0.2s

@btime for j=1:1000
    Y=H'*Omega
end  #julia:7.3s  matlab:0.35s !!!

@btime for j=1:1000
    Y=realH'*Omega
end  #0.31s    matlab:0.1s

using MKL 

@btime for j=1:1000
    Y=H*Omega
end #julia:0.6=>0.2s    matlab:0.2s

@btime for j=1:1000
    Y=H'*Omega
end  #julia:7.3s =>6.6s   matlab:0.35s


MKL can speed multiplication ,but adjoint is a problem. I checked the community information and tried other acceleration matrix methods. Maybe the principle is not very clear to me, so I don’t know if I am doing it right, but the effect of these methods is not obvious.

### try other methods###

#mul
@btime for j=1:1000
    mul!(Y1,H,Omega)
end   #6.62S  

@btime for j=1:1000
    mul!(Y1,H',Omega)
end   #10.8S 

#LoopVectorization @turbo
using LoopVectorization
function mygemmavx!(C, A, B)
    @turbo for m ∈ axes(A,1), n ∈ axes(B,2)
        Cmn = zero(eltype(C))
        for k ∈ axes(A,2)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end

@btime for j=1:1000
    mygemmavx!(Y1,H,Omega)
end  #9.6s

@btime for j=1:1000
    mygemmavx!(Y1,H',Omega)
end  #2.7s

versioninfo()

Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
  Threads: 1 on 20 virtual cores
Environment:
  JULIA_NUM_THREADS1 = 1
  JULIA_PKG_SERVER = https://mirrors.bfsu.edu.cn/julia
  JULIA_PYTHONCALL_EXE = @PyCall
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

MKL should help a decent amount. You also want Y=Matrix{Complex{Float64}} or else you have an abstract type.

1 Like

This might be an artifact of your benchmarking method.
It looks like you are using @time and operate in the global scope. For better accuracy, please put everything you want to benchmark into a function and use @btime or @benchmark from BenchmarkTools.jl.

Here are general performance tips for further reference.

3 Likes

Thank you, MKL can speed up * because the thread is opened to the maximum. It can indeed speed up the multiplication , but there is still a problem. It cannot speed up the transposition of complex matrices.

using MKL
@time for j=1:1000
     Y=H'*Omega
end #julia:8s => 7.4s matlab:0.35s

Please go through the performance tips, and place the code within a function to start with. Secondly, the first run in Julia involves compilation, which might be slow. The performance from the second run should be comparable to Matlab, as both are internally calling MKL.

2 Likes

Thank you for reminding me, I have modified it .

MKL can speed multiplication ,but complex matrix transposition is a problem.

This is still not very good benchmarking practice. Please do

@btime $(H')*$Omega  

Note I am interpolating the variables into this expression to not measure overhead from accessing untyped globals.

Also a note: ' is the adjoint of a complex matrix not the transpose. Is this consistent with your Matlab version?

1 Like

So on my machine I get using OpenBLAS (8 BLAS Threads).
Real case:

julia> Y = zeros(100,10);H=randn(100,100);Omega= rand(100,10);
julia> @btime mul!($Y, $(H), $(Omega));
  15.626 μs (0 allocations: 0 bytes)
julia> @btime mul!($Y, $(H'), $(Omega));
  15.494 μs (0 allocations: 0 bytes)

Complex case:

julia> Y = zeros(ComplexF64,100,10);H=randn(ComplexF64, 100,100);Omega= rand(100,10);
julia> @btime mul!($Y, $(H), $(Omega));
  30.296 μs (0 allocations: 0 bytes)
julia> @btime mul!($Y, $(H'), $(Omega));
  156.296 μs (0 allocations: 0 bytes)

So I agree that the Complex-adjoint case is 5x slower than the normal complex multiplication, while the Real transpose case is just as fast as the normal real case.

3 Likes

There does seem to be a slowdown here, that arises from the transposed matrix being materialized:

julia> @btime $H * $Omega;
  1.069 ms (2 allocations: 156.30 KiB)

julia> @btime $H' * $Omega;
  5.382 ms (4 allocations: 15.41 MiB)

julia> @btime collect($H');
  4.038 ms (2 allocations: 15.26 MiB)
1 Like

Interesting!

I tried to follow the functions that each call takes and it seems that H'*Y ends up in a generic matmul implementation in Julia whereas H*Y is dispatched to BLAS.

julia> H=randn(ComplexF64, 100,100);Omega=rand(100,10);Y=zeros(ComplexF64, 100,10)
mul!(Y, H, Omega)
mul!(Y, H, Omega, true, false)
LinearAlgebra.gemm_wrapper!(Y, 'N', 'N', H, Omega, LinearAlgebra.MulAddMul(true, false))
probably BLAS.gemm!
maybe generic_matmatmul!

For adjoint:

mul!(Y, H', Omega)
mul!(Y, H', Omega, true, false)
LinearAlgebra.generic_matmatmul!(Y, 'C', 'N', H, B, LinearAlgebra.MulAddMul(true, false))
1 Like

For me the adjoint seems slightly faster, though only if created outside of the timing. With creating the adjoint they are quite similar. This is with MKL, though the timings without MKL were all just half the speed so still same relative timings.

julia> @btime $A * $B;
  965.559 μs (2 allocations: 156.30 KiB)

julia> @btime $(A') * $B;
  941.128 μs (2 allocations: 156.30 KiB)

julia> @btime ($A)' * $B;
  969.349 μs (2 allocations: 156.30 KiB)

julia> typeof(A), typeof(B)
(Matrix{ComplexF64}, Matrix{ComplexF64})

What’s your versioninfo? This seems not to match what I see on nightly (or 1.9/1.10), where the adjoint is collected.

This is not an issue, as generic_matmatmul! calls BLAS.gemm! internally in compatible cases.

Ah, but in OP’s example, Omega (or B in your case) is a real vector!
That hits a different code path than both arguments being complex, where A of size n by n is reinterpreted as a real matrix of size 2n by n, and thus makes use of fast, real BLAS routines, which isn’t possible if both A and B are complex.

If A (or H) is adjoined, and B is real, the call ends in a slow Julia implementation of matrix multiplication instead of BLAS, as @abraemer has already stated. If B is complex though, complex BLAS routines are used.

4 Likes

No I don’t think so. At least not on Julia 1.9.3 that I am using.

Here is the definition of generic_matmatmul!. That only handles special cases for 2x2, 3x3 matrices or when a coefficient is zero. But in the general case it just calls _generic_matmatmul! which implements the matrix-matrix product in pure Julia. I don’t see a way to end up in BLAS by the point generic_matmatmul! is reached.

1 Like

Sorry, I should have said that there’s a BLAS.gemm! branch in gemm_wrapper! that I would expect to be used for floating-point types. The generic_matmatmul! branch is a fallback for other cases. Are you certain that BLAS isn’t being called here?

Edit: I see the difference: H' * Omega doesn’t call mul! with H', but in fact materializes the lazy adjoint. This ensures that the BLAS method is called instead of hitting the generic fallback, as would be the case if mul! is called directly.

1 Like
versioninfo()
Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
  Threads: 1 on 20 virtual cores
Environment:
  JULIA_NUM_THREADS1 = 1
  JULIA_PKG_SERVER = https://mirrors.bfsu.edu.cn/julia
  JULIA_PYTHONCALL_EXE = @PyCall
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

I did the same test as you. On my PC, it’s nearly 10x slower, probably an issue with my threads.

julia> @btime $(H)*$Omega;
  416.800 μs (2 allocations: 156.30 KiB)

julia> @btime $H' * $Omega;
  5.399 ms (4 allocations: 15.41 MiB)

julia> @btime collect($H');
  3.737 ms (2 allocations: 15.26 MiB)

I did a test later, and it can be seen that the function made by the standard example of LoopVectorization can be 2 times faster for complex matrices. It may skip the adjoint, but if it is replaced by multiplication without the adjoint, it will become very slow.

julia> @btime $H' * $Omega;
   5.399 ms (4 allocations: 15.41 MiB)
julia> @btime collect($(H'));
   3.919 ms (2 allocations: 15.26 MiB)

using LoopVectorization
function mygemmavx!(C, A, B)
     @turbo for m ∈ axes(A,1), n ∈ axes(B,2)
         Cmn = zero(eltype(C))
         for k ∈ axes(A,2)
             Cmn += A[m,k] * B[k,n]
         end
         C[m,n] = Cmn
     end
end

julia> @btime mygemmavx!($Y1,$(H'),$Omega)
   2.187 ms (79 allocations: 4.58 KiB)

julia> @btime mygemmavx!($Y1,$(H),$Omega)
   8.192 ms (79 allocations: 4.58 KiB)

That is unfortunate. Poking around in the debugger, it looks like * will use the gemm option for working with transposed matrices if both matrices are real, or both matrices are complex. But not if H' is complex and Omega is real. By way of confirmation, the following code

@btime for j=1:1000
    Y = H'*Omega
end  #julia:7.3s  matlab:0.35s !!!

using LinearAlgebra.BLAS
@btime for j=1:1000
  Y = gemm('C', 'N', 1.0+0.0im, H, convert(typeof(H), Omega))
end

returns

  14.256 s (5000 allocations: 15.05 GiB)
  3.177 s (4000 allocations: 305.27 MiB)

on my old and slow desktop.

4 Likes