Slow matrix multiplication in CUBLAS.gemm_strided_batched with ComplexF64

I’ve been developing some GPU code that seems does a lot of matrix multiplications with complex numbers. I figured I’d use the CUBLAS implementations that are available in the CUDA.jl package. However, I’ve found that using ComplexF64 type with the CUBLAS.gemm_strided_batched! function is significantly slower than expected I was expecting (around 8-10 times slower than ComplexF32). I am aware that GPUs are typically optimized for single precision numbers but after messing around with some code, I’m wondering if it is something else in this case. Here is an example of what I’m talking about. On my machine, when I run this code I get

a = CUDA.rand(ComplexF64,10,10,10000)

b = CUDA.rand(ComplexF32,10,10,10000)

test1 = CUDA.@time CUBLAS.gemm_strided_batched('N', 'N', a, a);

test2 = CUDA.@time CUBLAS.gemm_strided_batched('N', 'N', b, b);
 0.041176 seconds (24 CPU allocations: 528 bytes) (1 GPU allocation: 15.259 MiB, 0.05% memmgmt time)
  0.003239 seconds (24 CPU allocations: 528 bytes) (1 GPU allocation: 7.629 MiB, 0.57% memmgmt time)

The ComplexF64 code runs 12 times slower than the ComplexF32. This is a significant slowdown. However, here’s where things get a bit more interesting. If I write a kernel to do a simple batched matrix multiplication, I can get better speeds than the CUBLAS function for ComplexF64.

C = CUDA.zeros(eltype(a),size(a))
thread_num = 64

CUDA.@time  begin 
@cuda threads = thread_num blocks = ceil(Int32,size(a,3)/thread_num) kernel_matmul_batched!(a,a,C);
end

and I get:

0.028469 seconds (31 CPU allocations: 1024 bytes)

This is significantly faster than the CUBLAS implementation, which doesn’t make a whole lot of sense to me, since I’m not very good at writing efficient GPU kernels and NVIDIA is. Maybe I’ve just selected an odd data size for all of these operations just so it perfectly lines up to create this weird scenario, but if anybody has any insight into what exactly is going on, that would be great.