I need to process a large number of complex matrix multiplications in some programs, but this is very slow, especially after using Distributed parallelism, matrix multiplication will be more than 10 times slower. The speed will be much slower than matlab.**Especially if there is a adjoint .**

```
#julia1.9.2
using BenchmarkTools
H=randn(Complex{Float64},1000,1000)
realH=real(H)
Omega=randn(size(H,2),10)
Y=Matrix{Complex}(undef,1000,10);
Y1=similar(Y)
@btime for j=1:1000
Y=H*Omega
end #julia:0.6s matlab:0.2s
@btime for j=1:1000
Y=H'*Omega
end #julia:7.3s matlab:0.35s ！！！
@btime for j=1:1000
Y=realH'*Omega
end #0.31s matlab：0.1s
using MKL
@btime for j=1:1000
Y=H*Omega
end #julia:0.6=>0.2s matlab:0.2s
@btime for j=1:1000
Y=H'*Omega
end #julia:7.3s =>6.6s matlab:0.35s
```

MKL can speed multiplication ，but adjoint is a problem. I checked the community information and tried other acceleration matrix methods. Maybe the principle is not very clear to me, so I don’t know if I am doing it right, but the effect of these methods is not obvious.

```
### try other methods###
#mul
@btime for j=1:1000
mul!(Y1,H,Omega)
end #6.62S
@btime for j=1:1000
mul!(Y1,H',Omega)
end #10.8S
#LoopVectorization @turbo
using LoopVectorization
function mygemmavx!(C, A, B)
@turbo for m ∈ axes(A,1), n ∈ axes(B,2)
Cmn = zero(eltype(C))
for k ∈ axes(A,2)
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
@btime for j=1:1000
mygemmavx!(Y1,H,Omega)
end #9.6s
@btime for j=1:1000
mygemmavx!(Y1,H',Omega)
end #2.7s
```

versioninfo()

```
Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
Threads: 1 on 20 virtual cores
Environment:
JULIA_NUM_THREADS1 = 1
JULIA_PKG_SERVER = https://mirrors.bfsu.edu.cn/julia
JULIA_PYTHONCALL_EXE = @PyCall
JULIA_EDITOR = code
JULIA_NUM_THREADS =
```