In some use cases of Luna.jl, the runtime is dominated by large matrix-matrix multiplications inside Hankel.jl. I see a dramatic performance drop when using complex-valued matrices, which I’ve traced to the mul! operation. Somehow this is much slower than even the (massively allocating) equivalent *:
using BenchmarkTools
import LinearAlgebra: mul!
N1 = 256
N2 = 1024
A = rand(Float64, N1, N1)
B = rand(ComplexF64, N1, N2)
out = similar(B)
@btime A*B;
@btime mul!(out, A, B);
Results on my M1 Pro MacBook Pro are
998.542 μs (12 allocations: 10.00 MiB)
46.952 ms (0 allocations: 0 bytes)
Looking at the profiling, it seems that mul! doesn’t call BLAS and instead reverts to _generic_matmatmul_nonadjtrans!.
If I make A complex, the difference disappears completely:
A_complex = rand(ComplexF64, N1, N1)
@btime A_complex*B;
@btime mul!(out, A_complex, B);
954.250 μs (3 allocations: 4.00 MiB)
944.333 μs (0 allocations: 0 bytes)
My understanding is that * internally calls mul! after allocating an appropriate output array. Where could this difference come from? Am I missing something obvious?
version info below:
julia> versioninfo()
Julia Version 1.12.3
Commit 966d0af0fdf (2025-12-15 11:20 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: macOS (arm64-apple-darwin24.0.0)
CPU: 10 × Apple M1 Pro
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, apple-m1)
GC: Built with stock GC
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
JULIA_EDITOR = code
JULIA_VSCODE_REPL = 1
JULIA_NUM_THREADS = 1