Is it possible for complex matrix multiplication and adjoint speed to catch up with matlab?

Thank you very much. This can indeed solve my problem. The speed is completely acceptable. I hope to add a judgment condition and replace your code in julia complex matrix multiplication in the future.

Transposition is not a problem in Julia. In fact it (or adjont what’s it called for complex) is as fast as it possibly can be, with you getting back a type that didn’t do anything with the actual data, e.g.:

2×1 adjoint(::Matrix{ComplexF64}) with eltype ComplexF64

So it’s the same speed independent of the size of the matrix. I don’t know (but would like) if that’s the case with Matlab too, it’s implementation details are hidden from me. Since the adjoint/transpose isn’t doing “anything” you are shifting around the problem to the multiplication, i.e. to a different complex matrix one (multiple dispatch is your friend and relies on the changed type information). That one should be as fast as Matlab, or arguably could be.

@btime for j=1:1000
    Y=H'*Omega
end  #julia:7.3s  matlab:0.35s !!!

Any time you see a divergence with Matlab (or any language; Fortran, C, C++ Rust, Zig), at least this large it should tell you you are doing things wrong. Julia should match or be faster always, or at least within 2x; in some exceptional cases it might still be, I think I know of one such.

Also (while that’s not the reason), it seems pointless for you to have a (for) loop. It may be required for timing in Matlab, I’m not sure, but in Julia that’s one of the points of @btime over @time (you are just making your benchmarking time 1000x slower, at least potentially(?), I see though here only 4.5x slower, still 1000x more allocations).

About “MKL can speed multiplication”, it did for you, it can but I’m not sure why exactly. It’s multithreaded, but so is Julia’s OpenBLAS. You could also use LoopVectorization.jl (or whatever is based on it, rather than your naive implementation) or BLIS.jl. It might be as fast or faster than MKL.

That’s ok and doesn’t require you to put your code in a function.

This gives a very different picture (and 1 extra allocation):

julia> @time H'*Omega;
  0.025457 seconds (5 allocations: 15.411 MiB, 5.45% gc time)

I did try to put into a function:

julia> test() = H'*Omega

julia> @time test()
  0.012793 seconds (5 allocations: 15.411 MiB)

Note, it’s still not good since you’re accessing global variables, and get 1 extra allocation for some reason, as opposed to:

julia> test(H, Omega) = H'*Omega

julia> @time test(H, Omega);  # Beware first run is much slower with 742 allocations, because of compilation
  0.011499 seconds (4 allocations: 15.411 MiB)

I also tried with an anonymous function (timing seems suspect, why doesn't this work?):
julia> @time ()->(H')*Omega
  0.000071 seconds (25 allocations: 1.675 KiB)

julia> @time (H, Omega)->H'*Omega  # not accessing globals directly
  0.000055 seconds (25 allocations: 1.700 KiB)
#33 (generic function with 1 method)

And here clearly wrong (way too good to be true, code must be optimized out):
julia> @btime ()->H'*Omega
  1.948 ns (0 allocations: 0 bytes)
#25 (generic function with 1 method)

julia> @btime ()->$H'*$Omega
  3.731 ns (0 allocations: 0 bytes)
#27 (generic function with 1 method)

julia> @btime ()->($H')*$Omega
  3.732 ns (0 allocations: 0 bytes)
#29 (generic function with 1 method)
1 Like

Thank you. I know that “btime” conflicts with “for”. This is for comparison with matlab.

%maltab code
  tic;for i=1:1000;Y=H'*Omega;end;toc

This scholar has tried a good way to solve it, and the speed is close to the same as other languages ​​(matlab).

requesting new method for * for: (real array)*(complex array) · Issue #20060 · JuliaLang/julia · GitHub discusses this.

6 Likes

There is a performance issue, and as seen by the above it may be a missing method. Either way it is just a starting point of investigation.

MKL is fairly optimised on Intel architectures, with a slight speedup over OpenBLAS at times.

The methods below all constuct a function but don’t call it.

This is surely true in the sense that no copying is done, but for instance, we do know that for ordinary arrays, assesing columns first is optimal for cache. The asymmetry might lead to unexpected slowdowns. The only way is to test.

1 Like

Just for laughs, Consider trying the calculation in Tullio.

I’m on my phone so can’t easily give an example but the docs should be straightforward to figure it out.

Trying gemm seemed like a good test, but if you want to use it as a work-around, it seems nicer to use * but give it types for which it will use the transposed calls to gemm:

@btime for j=1:1000
  H1, Omega1 = promote(H, Omega)
  Y = H1' * Omega1
end

This has essentially the same performance as the call to gemm on my machine.

On my computer now so:

using LoopVectorization, Tullio

function adjmul(H,Omega)
    @tullio M[i,j] := conj(H[k,i]) * Omega[k,j]
    return M
end
H=randn(Complex{Float64},1000,1000)
realH=real(H)
Omega=randn(size(H,2),10)
Y=Matrix{Complex}(undef,1000,10);
Y1=similar(Y)

julia> @btime adjmul(H,Omega)
  1.033 ms (51 allocations: 158.83 KiB)

For what it’s worth.

3 Likes

Thank you very much. I did an experiment and it can increase the speed , but it is still slower than other languages.

@btime for j=1:1000
           Y =  adjmul(H,Omega)
         end #julia:7.3s=>6.3s  matlab 0.35s

So it should probably be fixed by adding a better dispatch as discussed in the github issue linked above, but in the meantime you could try to explicitly convert Omega to complex to get a better dispatch. On my machine that + MKL makes it the same speed as Matlab.
Julia:

julia> @time for j=1:1000
           Y = H' * Omega
       end
 11.166064 seconds (5.00 k allocations: 15.050 GiB, 9.83% gc time)

julia> @time for j=1:1000
           Y = H' * complex(Omega)
       end
  1.206935 seconds (5.00 k allocations: 305.283 MiB, 0.39% gc time)

Matlab:

>> tic;for i=1:1000;Y=H'*Omega;end;toc
Elapsed time is 1.173394 seconds.
4 Likes

Thank you, it is true that converting a complex matrix’ * real matrix into a complex matrix’ * complex matrix will speed up. This can also mean that the mechanism did not consider this situation.

In Matlab, you should be using the timeit function for more robust performance measurements. It is similar to Julia’s @btime.

I am a bit confused by your benchmark results.

Disclaimer, I have never used MATLAB, so it’s entirely possible I am doing something wrong, but this is what I get

H = randn(1000,1000,'like',1i);
Omega = randn(1000,10, 'double');

timeit(@() adjmul(H,Omega))
% ~ 2 seconds, i.e. about 2ms per

function [] = adjmul(H,Omega)
  for i = 1:1000
      ctranspose(H)*Omega;
  end
end

which is much closer to the naive Julia version

julia> @benchmark $H'*$Omega samples=1000 evals=1 seconds=20
BenchmarkTools.Trial: 1000 samples with 1 evaluation.
 Range (min … max):  3.588 ms …   5.725 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.727 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.971 ms ± 398.095 μs  ┊ GC (mean ± σ):  6.15% ± 7.69%

    ▂▄█▅▃▁
  ▄▆██████▇▅▃▃▂▂▂▂▂▁▂▁▂▂▁▁▂▁▂▂▁▂▁▁▁▂▃▃▃▃▃▃▃▃▄▄▄▄▅▆▅▄▅▄▄▄▂▃▃▂▂ ▃
  3.59 ms         Histogram: frequency by time        4.75 ms <

 Memory estimate: 15.41 MiB, allocs estimate: 4.

than what you showed.
The Tullio version is in fact faster by a factor of two.

julia> @benchmark adjmul($H,$Omega) samples=1000 evals=1 seconds=20
BenchmarkTools.Trial: 1000 samples with 1 evaluation.
 Range (min … max):  970.667 μs …   4.508 ms  ┊ GC (min … max): 0.00% … 72.25%
 Time  (median):       1.001 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.048 ms ± 207.799 μs  ┊ GC (mean ± σ):  0.53% ±  3.16%

  ▅█▆▅▄▂▁
  ███████▅▆▇▄▅▁▄▁▄▅▁▁▄▁▁▁▁▁▁▁▄▁▄▁▄▁▄▁▄▄▁▄▄▁▅▁▄▄▅▅▅▆▆▄▅▅▅▄▁▄▄▁▁▄ █
  971 μs        Histogram: log(frequency) by time       1.91 ms <

 Memory estimate: 158.83 KiB, allocs estimate: 51.

This is on an Apple M1 running native Julia 1.9.3, and x64 MATLAB R2023b.

Note: Please get in the habit of interpolating variables in benchmark expressions. It doesn’t matter much here, but it can have a tremendous impact.

1 Like

Excuse me, what is the meaning of the above expression?

It’s MATLAB for declaring a function without outputs. I believe…

I did not use “timeit” in matlab, but used this command. This was my mistake. This had little impact on the results of this experiment. The results of this scholar were exactly the same as mine. I used julia1.9.2 maltab In 2022b, Windows 64-bit, he solved this problem by promoting the real matrix type to a complex matrix. After testing Tullio, it was indeed 2 times faster than the original one, but it was still not as good as this scholar’s solution. His improvement was nearly 10 times. , and the speed is almost the same as matlab.

1 Like

I think the case was considered, but I don’t understand why it works the way it does. The code in question is

function (*)(A::AdjOrTransStridedMat{<:BlasComplex}, B::StridedMaybeAdjOrTransMat{<:BlasReal})
    TS = promote_type(eltype(A), eltype(B))
    mul!(similar(B, TS, (size(A, 1), size(B, 2))),
         copymutable_oftype(A, TS), # remove AdjOrTrans to use reinterpret trick below
         wrapperop(B)(convert(AbstractArray{real(TS)}, _unwrap(B))))
end

The call to copymutable_oftype is what is causing the big allocations. The comment suggests this case was done knowingly and for some reason it was seen as preferable to get rid of the adjoint. I’m not sure what the “reinterpret trick” is. But this does seem to result in * doing the extra allocation and then resolving eventually to

 gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul(alpha, beta))

which does not seem like the best BLAS call if one matrix is transposed. If both matrices are real you end up with

 gemm_wrapper!(C, 'T', 'N', A, B, MulAddMul(alpha, beta))

which is how I would have expected a (conjugate) transposed matrices to be handled in both the real and complex cases. The only guess I have is that it simplifies some already complicated dispatch, but if that’s it, it seems a little mysterious that the real and complex cases for the transposed/conjugate transposed matrices are handled so differently.

The trick is to interpret a n x m complex Matrix as a 2n x m array of reals, and perform a real * real matrix multiplication. This trick only works if the first dimension is contiguous, which is why the array is being copied. I think the justification here must have been that copying the array is O(n^2) whereas the actual matrix multiplication is approx O(n^3), so the copying step shouldn’t dominate for large n. However, clearly it seems to for mid-sized arrays (which one may commonly encounter). The performance of the copying will be improved somewhat by

but the alternate approach of real -> complex conversion followed by a complex * complex matrix multiplication may still be faster, and should be considered here.

The main reason the copying is slow is that copying an adjoint to a Matrix is not cache-friendly, whereas converting a real matrix to a complex one is cache-friendly and fast.

5 Likes

Thanks. That’s a good explanation. I assumed I knew approximately what would happen when I saw the gemm_wrapper! calls and didn’t dig deeply enough to see the calls to reinterpret. There’s a lot more going on there than was immediately apparent to me. So I guess there are cases where this is a performance win, especially without the adjoint. Getting some blocking into the copy does seem like it would be a nice improvement.