Unreasonably fast FFT on CUDA

jabru · January 29, 2024, 2:12pm

Hey there,

so I am currently working on an algorithm that will likely strongly depend on the FFT very significantly. Therefore I am considering to do the FFT in FFTW on Cuda to speed up the algorithm. It is a 3d FFT with about 353 x 353 x 353 points in the grid.
To benchmark the behaviour, I wrote the following code

using BenchmarkTools
function try_FFT_on_cuda()
    values = rand(353, 353, 353)
    value_complex = ComplexF32.(values)
    cvalues = similar(cu(value_complex), ComplexF32)
    copyto!(cvalues, values)
    cy = similar(cvalues)
    cF = plan_fft!(cvalues)
    @btime a = ($cF*$cy)
    return nothing
end

try_FFT_on_cuda()

The output is the following

 3.384 μs (0 allocations: 0 bytes)

only measuring the line

@btime a = ($cF*$cy)

seems reasonable, because in my algorithm will do the same FFT over and over again in place. Nontheless, I compared this two an implementaion on a 32 core cpu and received values on the order of 70 ms. So the CUDA implementation seems to be 10000x faster. As a result I have trouble believing what I am seeing there. Are these values realistic?

My Graphics card is an NVIDIA RTX A4000

Obviously if these numbers are correct that would be awsome, but it seems very fast.

Thank you in advance!

Oscar_Smith · January 29, 2024, 2:18pm

I think you might only be measuring the time to submit the kernel, not receive results back.

jabru · January 29, 2024, 2:31pm

I see, and what would I have to do in order to get the time required to get the results back?

barucden · January 29, 2024, 2:39pm

Adding CUDA.@sync after @btime should do it.

roflmaostc · January 29, 2024, 3:42pm

@btime CUDA.@sync $p * $x

Palli · January 29, 2024, 5:06pm

I’m not sure if this helps you but it’s pretty intriguing to me, FFT has 2% utilization, at least when doing a convolution with it, but 8x speedup possible:

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

2311.05908.pdf

[…] FlashFFTConv speeds up exact FFT convolutions by up to 7.93× over
PyTorch and achieves up to 4.4× speedup end-to-end.

[Monarch mixer is also a pretty intriguing new type of neural network to replace Transformers, and I point to the timestamp on the second paper (above) related to it.]

[I know convolutions are 2D, usually, e.g. there I believe, but I don’t know if you’re doing 3D convolutions, and this would also apply, or if this helps for 1D.]

jabru · January 29, 2024, 6:09pm

Thank you very much all!

This discussion was most helpful!

roflmaostc · January 29, 2024, 7:00pm

Usually 3D convolutions are much faster with FFTs, if the kernel is not like (2,2,2).

Also just dumping my numbers.

julia> using FFTW, CUDA, BenchmarkTools

julia> function try_FFT_on_cuda()
           values = rand(353, 353, 353)
           value_complex = ComplexF32.(values)
           cvalues = similar(cu(value_complex), ComplexF32)
           copyto!(cvalues, values)
           cy = similar(cvalues)
           cF = plan_fft!(cvalues)
           @btime CUDA.@sync a = ($cF*$cy)
           return nothing
       end
try_FFT_on_cuda (generic function with 1 method)

julia> try_FFT_on_cuda()
  20.319 ms (6 allocations: 304 bytes)

julia> function try_FFT_on_cpu()
           values = rand(353, 353, 353)
           value_complex = ComplexF32.(values)
           cvalues = similar((value_complex), ComplexF32)
           copyto!(cvalues, values)
           cy = similar(cvalues)
           cF = plan_fft!(cvalues, flags=FFTW.MEASURE)
           @btime a = ($cF*$cy)
           return nothing
       end
try_FFT_on_cpu (generic function with 1 method)

julia> try_FFT_on_cpu()
  100.503 ms (314 allocations: 27.09 KiB)



# power of 2
julia> function try_FFT_on_cpu()
           values = rand(256, 256, 256)
           value_complex = ComplexF32.(values)
           cvalues = similar((value_complex), ComplexF32)
           copyto!(cvalues, values)
           cy = similar(cvalues)
           cF = plan_fft!(cvalues, flags=FFTW.MEASURE)
           @btime a = ($cF*$cy)
           return nothing
       end
try_FFT_on_cpu (generic function with 1 method)

julia> function try_FFT_on_cuda()
           values = rand(256, 256, 256)
           value_complex = ComplexF32.(values)
           cvalues = similar(cu(value_complex), ComplexF32)
           copyto!(cvalues, values)
           cy = similar(cvalues)
           cF = plan_fft!(cvalues)
           @btime CUDA.@sync a = ($cF*$cy)
           return nothing
       end
try_FFT_on_cuda (generic function with 1 method)

julia> try_FFT_on_cpu()
  18.883 ms (314 allocations: 27.09 KiB)

julia> try_FFT_on_cuda()
  2.483 ms (6 allocations: 304 bytes)

jabru · February 3, 2024, 3:56pm

Thank you everyone for the great help!

It makes a lot more sense now!

Topic		Replies	Views
Why is CUDA.FFT slow only when performed over the second dimension of a 3D array? GPU cuda , fft	0	74	January 29, 2025
Why fft with MEASURE plan 10x slower than calling fft directly with CUDA.CUFFT? Performance gpu , cuda	7	171	September 22, 2024
FFT is too fast compared to assignment in CUDA GPU fftw , cuda	3	2628	October 28, 2021
CuPy CuFFT ~2x faster than CUDA.jl CuFFT GPU performance , cuda , fft	15	2827	February 27, 2023
FFTW scales pretty well (some @btime benchmarks) Performance fftw , gpu , parallel , multithreading	1	1713	February 4, 2025

Unreasonably fast FFT on CUDA

Related topics