FFT'ing lots of short vectors on GPU

I am wondering if there is a way to Fourier-transform lots of short vectors on GPU.

I have more than 10,000 vectors. The lengths of all vectors are the same and around 2^12. Each vector is not long enough to get a significant performance boost by performing FFT on GPU. For example, running the following code

using CUDA, FFTW

n = 12
N = 2^n

v = rand(Complex{Float64}), N)
w = copy(v)
F = plan_fft(v)
@btime $w .= $F * $v

vg = cu(v)
wg = copy(vg)
Fg = plan_fft(vg)
@btime $wg .= $Fg * $vg

produces

  24.818 µs (2 allocations: 64.05 KiB)
  9.964 µs (20 allocations: 960 bytes)

so GPU is only 2–3 times faster. In comparison, for a vector of length 2^20, which is much longer than 2^12 tested above, I get

  39.214 ms (2 allocations: 16.00 MiB)
  15.603 µs (20 allocations: 960 bytes)

so GPU is 3-orders-of-magnitude faster than CPU.

The above examples demonstrate that in order to get a significant performance boost on GPU, the amount of data needs to be sufficiently large. In my case, each vector is short but there are lots of them, so the total amount of data to FFT seems sufficiently large. I am wondering if there is a clever way to take advantage of my situation to perform FFT fast on GPU.

A side comment / question: Don’t you want to use CUDA.CUFFT (instead of FFTW) for FFTs on GPU? Or is it perhaps automatically used for CuArrays?

Actually I’m not sure exactly how it works, but if you import both CUDA and FFTW, the AbstractFFT functions implemented in CUDA.CUFFT are exposed somehow:

julia> using CUDA, FFTW

julia> methods(plan_fft)
# 8 methods for generic function "plan_fft":
[1] plan_fft(X::CuArray{T, N}, region) where {T<:Union{ComplexF32, ComplexF64}, N} in CUDA.CUFFT at /home/gridsan/WO28768/.julia/packages/CUDA/sCev8/lib/cufft/fft.jl:296
...

One potential method I hoped possible was to create a 2¹²-by-10,000 matrix V whose columns are the vectors I want to FFT. Because F = plan_fft(...) creates an operator that can be represented by a 2¹²-by-2¹² matrix, I hoped F * V to work, but it didn’t.

I suspect in principle F * V could be implemented such that it can be performed fast on GPU for a GPU matrix V, but I’m not sure. If this is possible and seems useful to other people, I may want to create a GitHub issue requesting this feature.