I am wondering if there is a way to Fourier-transform lots of short vectors on GPU.

I have more than 10,000 vectors. The lengths of all vectors are the same and around `2^12`

. Each vector is not long enough to get a significant performance boost by performing FFT on GPU. For example, running the following code

```
using CUDA, FFTW
n = 12
N = 2^n
v = rand(Complex{Float64}), N)
w = copy(v)
F = plan_fft(v)
@btime $w .= $F * $v
vg = cu(v)
wg = copy(vg)
Fg = plan_fft(vg)
@btime $wg .= $Fg * $vg
```

produces

```
24.818 µs (2 allocations: 64.05 KiB)
9.964 µs (20 allocations: 960 bytes)
```

so GPU is only 2–3 times faster. In comparison, for a vector of length `2^20`

, which is much longer than `2^12`

tested above, I get

```
39.214 ms (2 allocations: 16.00 MiB)
15.603 µs (20 allocations: 960 bytes)
```

so GPU is 3-orders-of-magnitude faster than CPU.

The above examples demonstrate that in order to get a significant performance boost on GPU, the amount of data needs to be sufficiently large. In my case, each vector is short but there are lots of them, so the total amount of data to FFT seems sufficiently large. I am wondering if there is a clever way to take advantage of my situation to perform FFT fast on GPU.