Slow matrix multiplication in CUBLAS.gemm_strided_batched with ComplexF64

JackC · October 2, 2025, 1:13pm

I’ve been developing some GPU code that seems does a lot of matrix multiplications with complex numbers. I figured I’d use the CUBLAS implementations that are available in the CUDA.jl package. However, I’ve found that using ComplexF64 type with the CUBLAS.gemm_strided_batched! function is significantly slower than expected I was expecting (around 8-10 times slower than ComplexF32). I am aware that GPUs are typically optimized for single precision numbers but after messing around with some code, I’m wondering if it is something else in this case. Here is an example of what I’m talking about. On my machine, when I run this code I get

a = CUDA.rand(ComplexF64,10,10,10000)

b = CUDA.rand(ComplexF32,10,10,10000)

test1 = CUDA.@time CUBLAS.gemm_strided_batched('N', 'N', a, a);

test2 = CUDA.@time CUBLAS.gemm_strided_batched('N', 'N', b, b);

 0.041176 seconds (24 CPU allocations: 528 bytes) (1 GPU allocation: 15.259 MiB, 0.05% memmgmt time)
  0.003239 seconds (24 CPU allocations: 528 bytes) (1 GPU allocation: 7.629 MiB, 0.57% memmgmt time)

The ComplexF64 code runs 12 times slower than the ComplexF32. This is a significant slowdown. However, here’s where things get a bit more interesting. If I write a kernel to do a simple batched matrix multiplication, I can get better speeds than the CUBLAS function for ComplexF64.

C = CUDA.zeros(eltype(a),size(a))
thread_num = 64

CUDA.@time  begin 
@cuda threads = thread_num blocks = ceil(Int32,size(a,3)/thread_num) kernel_matmul_batched!(a,a,C);
end

and I get:

0.028469 seconds (31 CPU allocations: 1024 bytes)

This is significantly faster than the CUBLAS implementation, which doesn’t make a whole lot of sense to me, since I’m not very good at writing efficient GPU kernels and NVIDIA is. Maybe I’ve just selected an odd data size for all of these operations just so it perfectly lines up to create this weird scenario, but if anybody has any insight into what exactly is going on, that would be great.

maleadt · October 7, 2025, 6:25am

You didn’t mention which GPU you have, but often the single vs double precision performance ration is 1/64, so that kind of slowdown is expected. To confirm the performance improvement of your kernel, and whether the inefficiency isn’t with anything leading up to the CUBLAS kernel (instead of the kernel itself), try using CUDA.@profile.

Topic		Replies	Views
Optimizing Complex Batch Matrix Multiplication Performance question	2	454	October 25, 2023
In-place multiplication is too much slower for ComplexF64 Performance complex-numbers	19	1313	February 10, 2022
Puzzling performance when multiplying real and complex matrices General Usage	3	530	February 12, 2019
BLAS vs CUBLAS benchmark Performance question , blas , cuda	13	5992	September 11, 2020
Vector{Float} times Vector{Complex} slower than Vector{Complex} times Vector{Complex} Performance question , linearalgebra	6	584	July 26, 2019

Slow matrix multiplication in CUBLAS.gemm_strided_batched with ComplexF64

Related topics