In-place mul! much slower than * for real-complex matrix multiplication

In some use cases of Luna.jl, the runtime is dominated by large matrix-matrix multiplications inside Hankel.jl. I see a dramatic performance drop when using complex-valued matrices, which I’ve traced to the mul! operation. Somehow this is much slower than even the (massively allocating) equivalent *:

using BenchmarkTools
import LinearAlgebra: mul!

N1 = 256
N2 = 1024

A = rand(Float64, N1, N1)
B = rand(ComplexF64, N1, N2)
out = similar(B)

@btime A*B;
@btime mul!(out, A, B);

Results on my M1 Pro MacBook Pro are

998.542 μs (12 allocations: 10.00 MiB)
  46.952 ms (0 allocations: 0 bytes)

Looking at the profiling, it seems that mul! doesn’t call BLAS and instead reverts to _generic_matmatmul_nonadjtrans!.

If I make A complex, the difference disappears completely:

A_complex = rand(ComplexF64, N1, N1)
@btime A_complex*B;
@btime mul!(out, A_complex, B);
  954.250 μs (3 allocations: 4.00 MiB)
  944.333 μs (0 allocations: 0 bytes)

My understanding is that * internally calls mul! after allocating an appropriate output array. Where could this difference come from? Am I missing something obvious?

version info below:

julia> versioninfo()
Julia Version 1.12.3
Commit 966d0af0fdf (2025-12-15 11:20 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 10 × Apple M1 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, apple-m1)
  GC: Built with stock GC
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_VSCODE_REPL = 1
  JULIA_NUM_THREADS = 1
2 Likes

I can’t swear that no one has ever added it to a BLAS library, but conventionally there are no BLAS routines for multiplying matrices of mixed numeric types. So the options for mul! are to copy A into a Matrix{ComplexF64} and then call BLAS or to fall back to some generic routine. Presumably the expectation is that mul! will not allocate a matrix, so it does the latter.

Some real * complex operations can be done by BLAS by calling reinterpret. I think Julia has a (possibly imperfect) set of fast paths for such cases. The fast case here does something else, simply two real operations (at least on Julia 1.11):

julia> @btime A*B;
  1.397 ms (12 allocations: 10.00 MiB)

julia> @btime mul!(out, A, B);
  47.326 ms (0 allocations: 0 bytes)

julia> @which A*B
*(A::StridedMatrix{var"#s4937"} where var"#s4937"<:Union{Float32, Float64}, B::StridedMatrix{var"#s4936"} where var"#s4936"<:Union{ComplexF64, ComplexF32})
     @ LinearAlgebra /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:150

julia> @less A*B
function (*)(A::StridedMatrix{<:BlasReal}, B::StridedMatrix{<:BlasComplex})
    temp = real(B)
    R = A * temp
    temp .= imag.(B)
    I = A * temp
    Complex.(R, I)
end

Edit: if we reverse the matrices, then we hit the reinterpret case:

julia> N1 = 512; N2 = 512;  # now square matrices

julia> A = rand(Float64, N1, N1);

julia> B = rand(ComplexF64, N1, N2);

julia> out = similar(B);

julia> @btime mul!(out, B, A);  # reinterpret(B, Float64, ...)
  1.554 ms (0 allocations: 0 bytes)

julia> @btime mul!(out, A, B);  # slow case, as above
  93.720 ms (0 allocations: 0 bytes)

julia> @btime A * B;  # fast case above
  2.419 ms (12 allocations: 10.00 MiB)

julia> @btime B * A;  # also reinterpret
  1.578 ms (3 allocations: 4.00 MiB)
1 Like

But then, the appropriate fallback should be to allocate, use and then free a temporary matrix, once the size is larger than some threshold (but definitely not wait with freeing the backing memory until the garbage collector reclaims it).

Nice. I have seen this trick before with direct calls to BLAS, but I did not know that mul!(out, B, A) did this and have never thought to check whether Julia did it. However, it doesn’t seem to work on Julia 1.12.3. It resolves to the same slow generic multiplication. In Julia 1.11.8 it does work for me, eventually resolving to a call

BLAS.gemm!(tA, tB, alpha, reinterpret(T, A), B, beta, reinterpret(T, C))

where in the case T is Float64 and A is the complex matrix. However, I don’t think this works with the order of A and B reversed, which might explain the difference between mul!(out, B, A) and mul!(out, A, B). BLAS assumes that stride(A,1) == 1 (and likewise for all matrices). reinterpret gives real and complex parts in submatrices that have stride 2 in the first index.

I’m not quite sure where this breaks in Julia 1.12.3…

Thanks @mstewart and @mcabbott, this is exactly my problem! I had missed the specialised method splitting it into real and imaginary (which will hopefully teach me to use @which next time, rather than trying to work out the method calls myself).

Since my matrices are not square, I’ve resorted to converting A to be complex whenever B is complex. Thankfully A is constant, so this doesn’t add significant runtime.

Perhaps. It would be faster. But I do sometimes use mul! when I have several large dense matrices and my memory can’t accommodate more. The docstring for mul! doesn’t seem to promise anything regarding allocations, but it’s very GEMM-like in what it does, so doing a big allocation is something that would at least surprise me.

There was an issue… Poor performance for product Matrix{ComplexF64} with Matrix{Float64} · Issue #1519 · JuliaLang/LinearAlgebra.jl · GitHub fixed in 1.12.4 although that’s not released yet.

2 Likes

Thanks. I need to get my eyes checked. I looked and didn’t see that. I was getting ready to file an issue.

This issue has been discussed in this topic: