Is it possible for complex matrix multiplication and adjoint speed to catch up with matlab?

13299118606 · October 10, 2023, 5:22am

I need to process a large number of complex matrix multiplications in some programs, but this is very slow, especially after using Distributed parallelism, matrix multiplication will be more than 10 times slower. The speed will be much slower than matlab.Especially if there is a adjoint .

#julia1.9.2
using BenchmarkTools
H=randn(Complex{Float64},1000,1000)
realH=real(H)
Omega=randn(size(H,2),10)
Y=Matrix{Complex}(undef,1000,10);
Y1=similar(Y)
@btime for j=1:1000
    Y=H*Omega
end #julia:0.6s    matlab:0.2s

@btime for j=1:1000
    Y=H'*Omega
end  #julia:7.3s  matlab:0.35s ！！！

@btime for j=1:1000
    Y=realH'*Omega
end  #0.31s    matlab：0.1s

using MKL 

@btime for j=1:1000
    Y=H*Omega
end #julia:0.6=>0.2s    matlab:0.2s

@btime for j=1:1000
    Y=H'*Omega
end  #julia:7.3s =>6.6s   matlab:0.35s

MKL can speed multiplication ，but adjoint is a problem. I checked the community information and tried other acceleration matrix methods. Maybe the principle is not very clear to me, so I don’t know if I am doing it right, but the effect of these methods is not obvious.

### try other methods###

#mul
@btime for j=1:1000
    mul!(Y1,H,Omega)
end   #6.62S  

@btime for j=1:1000
    mul!(Y1,H',Omega)
end   #10.8S 

#LoopVectorization @turbo
using LoopVectorization
function mygemmavx!(C, A, B)
    @turbo for m ∈ axes(A,1), n ∈ axes(B,2)
        Cmn = zero(eltype(C))
        for k ∈ axes(A,2)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end

@btime for j=1:1000
    mygemmavx!(Y1,H,Omega)
end  #9.6s

@btime for j=1:1000
    mygemmavx!(Y1,H',Omega)
end  #2.7s

versioninfo()

Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
  Threads: 1 on 20 virtual cores
Environment:
  JULIA_NUM_THREADS1 = 1
  JULIA_PKG_SERVER = https://mirrors.bfsu.edu.cn/julia
  JULIA_PYTHONCALL_EXE = @PyCall
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

Oscar_Smith · October 10, 2023, 5:33am

MKL should help a decent amount. You also want Y=Matrix{Complex{Float64}} or else you have an abstract type.

abraemer · October 10, 2023, 5:39am

This might be an artifact of your benchmarking method.
It looks like you are using @time and operate in the global scope. For better accuracy, please put everything you want to benchmark into a function and use @btime or @benchmark from BenchmarkTools.jl.

Here are general performance tips for further reference.

13299118606 · October 10, 2023, 5:42am

Thank you, MKL can speed up * because the thread is opened to the maximum. It can indeed speed up the multiplication , but there is still a problem. It cannot speed up the transposition of complex matrices.

using MKL
@time for j=1:1000
     Y=H'*Omega
end #julia:8s => 7.4s matlab:0.35s

jishnub · October 10, 2023, 5:49am

Please go through the performance tips, and place the code within a function to start with. Secondly, the first run in Julia involves compilation, which might be slow. The performance from the second run should be comparable to Matlab, as both are internally calling MKL.

13299118606 · October 10, 2023, 5:58am

Thank you for reminding me, I have modified it .

13299118606 · October 10, 2023, 6:00am

MKL can speed multiplication ，but complex matrix transposition is a problem.

abraemer · October 10, 2023, 6:16am

This is still not very good benchmarking practice. Please do

@btime $(H')*$Omega

Note I am interpolating the variables into this expression to not measure overhead from accessing untyped globals.

Also a note: ' is the adjoint of a complex matrix not the transpose. Is this consistent with your Matlab version?

abraemer · October 10, 2023, 6:24am

So on my machine I get using OpenBLAS (8 BLAS Threads).
Real case:

julia> Y = zeros(100,10);H=randn(100,100);Omega= rand(100,10);
julia> @btime mul!($Y, $(H), $(Omega));
  15.626 μs (0 allocations: 0 bytes)
julia> @btime mul!($Y, $(H'), $(Omega));
  15.494 μs (0 allocations: 0 bytes)

Complex case:

julia> Y = zeros(ComplexF64,100,10);H=randn(ComplexF64, 100,100);Omega= rand(100,10);
julia> @btime mul!($Y, $(H), $(Omega));
  30.296 μs (0 allocations: 0 bytes)
julia> @btime mul!($Y, $(H'), $(Omega));
  156.296 μs (0 allocations: 0 bytes)

So I agree that the Complex-adjoint case is 5x slower than the normal complex multiplication, while the Real transpose case is just as fast as the normal real case.

jishnub · October 10, 2023, 6:29am

There does seem to be a slowdown here, that arises from the transposed matrix being materialized:

julia> @btime $H * $Omega;
  1.069 ms (2 allocations: 156.30 KiB)

julia> @btime $H' * $Omega;
  5.382 ms (4 allocations: 15.41 MiB)

julia> @btime collect($H');
  4.038 ms (2 allocations: 15.26 MiB)

abraemer · October 10, 2023, 6:47am

Interesting!

I tried to follow the functions that each call takes and it seems that H'*Y ends up in a generic matmul implementation in Julia whereas H*Y is dispatched to BLAS.

julia> H=randn(ComplexF64, 100,100);Omega=rand(100,10);Y=zeros(ComplexF64, 100,10)
mul!(Y, H, Omega)
mul!(Y, H, Omega, true, false)
LinearAlgebra.gemm_wrapper!(Y, 'N', 'N', H, Omega, LinearAlgebra.MulAddMul(true, false))
probably BLAS.gemm!
maybe generic_matmatmul!

For adjoint:

mul!(Y, H', Omega)
mul!(Y, H', Omega, true, false)
LinearAlgebra.generic_matmatmul!(Y, 'C', 'N', H, B, LinearAlgebra.MulAddMul(true, false))

albheim · October 10, 2023, 6:50am

For me the adjoint seems slightly faster, though only if created outside of the timing. With creating the adjoint they are quite similar. This is with MKL, though the timings without MKL were all just half the speed so still same relative timings.

julia> @btime $A * $B;
  965.559 μs (2 allocations: 156.30 KiB)

julia> @btime $(A') * $B;
  941.128 μs (2 allocations: 156.30 KiB)

julia> @btime ($A)' * $B;
  969.349 μs (2 allocations: 156.30 KiB)

julia> typeof(A), typeof(B)
(Matrix{ComplexF64}, Matrix{ComplexF64})

jishnub · October 10, 2023, 6:59am

What’s your versioninfo? This seems not to match what I see on nightly (or 1.9/1.10), where the adjoint is collected.

jishnub · October 10, 2023, 7:02am

This is not an issue, as generic_matmatmul! calls BLAS.gemm! internally in compatible cases.

skleinbo · October 10, 2023, 7:38am

Ah, but in OP’s example, Omega (or B in your case) is a real vector!
That hits a different code path than both arguments being complex, where A of size n by n is reinterpreted as a real matrix of size 2n by n, and thus makes use of fast, real BLAS routines, which isn’t possible if both A and B are complex.

If A (or H) is adjoined, and B is real, the call ends in a slow Julia implementation of matrix multiplication instead of BLAS, as @abraemer has already stated. If B is complex though, complex BLAS routines are used.

abraemer · October 10, 2023, 7:39am

No I don’t think so. At least not on Julia 1.9.3 that I am using.

Here is the definition of generic_matmatmul!. That only handles special cases for 2x2, 3x3 matrices or when a coefficient is zero. But in the general case it just calls _generic_matmatmul! which implements the matrix-matrix product in pure Julia. I don’t see a way to end up in BLAS by the point generic_matmatmul! is reached.

jishnub · October 10, 2023, 9:23am

Sorry, I should have said that there’s a BLAS.gemm! branch in gemm_wrapper! that I would expect to be used for floating-point types. The generic_matmatmul! branch is a fallback for other cases. Are you certain that BLAS isn’t being called here?

Edit: I see the difference: H' * Omega doesn’t call mul! with H', but in fact materializes the lazy adjoint. This ensures that the BLAS method is called instead of hitting the generic fallback, as would be the case if mul! is called directly.

13299118606 · October 10, 2023, 11:07am

versioninfo()
Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
  Threads: 1 on 20 virtual cores
Environment:
  JULIA_NUM_THREADS1 = 1
  JULIA_PKG_SERVER = https://mirrors.bfsu.edu.cn/julia
  JULIA_PYTHONCALL_EXE = @PyCall
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

I did the same test as you. On my PC, it’s nearly 10x slower, probably an issue with my threads.

julia> @btime $(H)*$Omega;
  416.800 μs (2 allocations: 156.30 KiB)

julia> @btime $H' * $Omega;
  5.399 ms (4 allocations: 15.41 MiB)

julia> @btime collect($H');
  3.737 ms (2 allocations: 15.26 MiB)

13299118606 · October 10, 2023, 11:35am

I did a test later, and it can be seen that the function made by the standard example of LoopVectorization can be 2 times faster for complex matrices. It may skip the adjoint, but if it is replaced by multiplication without the adjoint, it will become very slow.

julia> @btime $H' * $Omega;
   5.399 ms (4 allocations: 15.41 MiB)
julia> @btime collect($(H'));
   3.919 ms (2 allocations: 15.26 MiB)

using LoopVectorization
function mygemmavx!(C, A, B)
     @turbo for m ∈ axes(A,1), n ∈ axes(B,2)
         Cmn = zero(eltype(C))
         for k ∈ axes(A,2)
             Cmn += A[m,k] * B[k,n]
         end
         C[m,n] = Cmn
     end
end

julia> @btime mygemmavx!($Y1,$(H'),$Omega)
   2.187 ms (79 allocations: 4.58 KiB)

julia> @btime mygemmavx!($Y1,$(H),$Omega)
   8.192 ms (79 allocations: 4.58 KiB)

mstewart · October 10, 2023, 12:16pm

That is unfortunate. Poking around in the debugger, it looks like * will use the gemm option for working with transposed matrices if both matrices are real, or both matrices are complex. But not if H' is complex and Omega is real. By way of confirmation, the following code

@btime for j=1:1000
    Y = H'*Omega
end  #julia:7.3s  matlab:0.35s ！！！

using LinearAlgebra.BLAS
@btime for j=1:1000
  Y = gemm('C', 'N', 1.0+0.0im, H, convert(typeof(H), Omega))
end

returns

  14.256 s (5000 allocations: 15.05 GiB)
  3.177 s (4000 allocations: 305.27 MiB)

on my old and slow desktop.

Topic		Replies	Views
Multiplication times for complex matrices: julia vs Matlab New to Julia question	18	2869	March 21, 2020
Vector - Matrix - Vector multiplication Performance	19	4237	March 14, 2021
In-place multiplication is too much slower for ComplexF64 Performance complex-numbers	19	1232	February 10, 2022
Possible performance improvement in matrix multiplication involving a transpose and a complex matrix Performance linearalgebra	3	841	December 10, 2021
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8400	March 5, 2017

Is it possible for complex matrix multiplication and adjoint speed to catch up with matlab?

Related topics