FFT'ing lots of short vectors on GPU

wsshin · January 31, 2022, 6:58pm

I am wondering if there is a way to Fourier-transform lots of short vectors on GPU.

I have more than 10,000 vectors. The lengths of all vectors are the same and around 2^12. Each vector is not long enough to get a significant performance boost by performing FFT on GPU. For example, running the following code

using CUDA, FFTW

n = 12
N = 2^n

v = rand(Complex{Float64}), N)
w = copy(v)
F = plan_fft(v)
@btime $w .= $F * $v

vg = cu(v)
wg = copy(vg)
Fg = plan_fft(vg)
@btime $wg .= $Fg * $vg

produces

  24.818 µs (2 allocations: 64.05 KiB)
  9.964 µs (20 allocations: 960 bytes)

so GPU is only 2–3 times faster. In comparison, for a vector of length 2^20, which is much longer than 2^12 tested above, I get

  39.214 ms (2 allocations: 16.00 MiB)
  15.603 µs (20 allocations: 960 bytes)

so GPU is 3-orders-of-magnitude faster than CPU.

The above examples demonstrate that in order to get a significant performance boost on GPU, the amount of data needs to be sufficiently large. In my case, each vector is short but there are lots of them, so the total amount of data to FFT seems sufficiently large. I am wondering if there is a clever way to take advantage of my situation to perform FFT fast on GPU.

carstenbauer · January 31, 2022, 8:37pm

A side comment / question: Don’t you want to use CUDA.CUFFT (instead of FFTW) for FFTs on GPU? Or is it perhaps automatically used for CuArrays?

wsshin · January 31, 2022, 9:17pm

Actually I’m not sure exactly how it works, but if you import both CUDA and FFTW, the AbstractFFT functions implemented in CUDA.CUFFT are exposed somehow:

julia> using CUDA, FFTW

julia> methods(plan_fft)
# 8 methods for generic function "plan_fft":
[1] plan_fft(X::CuArray{T, N}, region) where {T<:Union{ComplexF32, ComplexF64}, N} in CUDA.CUFFT at /home/gridsan/WO28768/.julia/packages/CUDA/sCev8/lib/cufft/fft.jl:296
...

wsshin · January 31, 2022, 9:28pm

One potential method I hoped possible was to create a 2¹²-by-10,000 matrix V whose columns are the vectors I want to FFT. Because F = plan_fft(...) creates an operator that can be represented by a 2¹²-by-2¹² matrix, I hoped F * V to work, but it didn’t.

I suspect in principle F * V could be implemented such that it can be performed fast on GPU for a GPU matrix V, but I’m not sure. If this is possible and seems useful to other people, I may want to create a GitHub issue requesting this feature.

Topic		Replies	Views
FFT is too fast compared to assignment in CUDA GPU fftw , cuda	3	2597	October 28, 2021
Usage of CUDA.CUFFT.cufftPlanMany GPU fft	1	794	August 30, 2022
Is this a fair GPU benchmark? GPU	1	1120	June 1, 2019
Calculate FFT on GPU for every row of a 2D array Performance gpu	2	1015	August 14, 2018
Using CUDA fft General Usage	1	1140	March 13, 2019

FFT'ing lots of short vectors on GPU

Related topics