Unreasonably fast FFT on CUDA

Hey there,

so I am currently working on an algorithm that will likely strongly depend on the FFT very significantly. Therefore I am considering to do the FFT in FFTW on Cuda to speed up the algorithm. It is a 3d FFT with about 353 x 353 x 353 points in the grid.
To benchmark the behaviour, I wrote the following code

using BenchmarkTools
function try_FFT_on_cuda()
    values = rand(353, 353, 353)
    value_complex = ComplexF32.(values)
    cvalues = similar(cu(value_complex), ComplexF32)
    copyto!(cvalues, values)
    cy = similar(cvalues)
    cF = plan_fft!(cvalues)
    @btime a = ($cF*$cy)
    return nothing
end

try_FFT_on_cuda()

The output is the following

 3.384 μs (0 allocations: 0 bytes)

only measuring the line

@btime a = ($cF*$cy)

seems reasonable, because in my algorithm will do the same FFT over and over again in place. Nontheless, I compared this two an implementaion on a 32 core cpu and received values on the order of 70 ms. So the CUDA implementation seems to be 10000x faster. As a result I have trouble believing what I am seeing there. Are these values realistic?

My Graphics card is an NVIDIA RTX A4000

Obviously if these numbers are correct that would be awsome, but it seems very fast.

Thank you in advance!

1 Like

I think you might only be measuring the time to submit the kernel, not receive results back.

3 Likes

I see, and what would I have to do in order to get the time required to get the results back?

Adding CUDA.@sync after @btime should do it.

2 Likes

@btime CUDA.@sync $p * $x

3 Likes

I’m not sure if this helps you but it’s pretty intriguing to me, FFT has 2% utilization, at least when doing a convolution with it, but 8x speedup possible:

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

2311.05908.pdf

[…] FlashFFTConv speeds up exact FFT convolutions by up to 7.93× over
PyTorch and achieves up to 4.4× speedup end-to-end.

[Monarch mixer is also a pretty intriguing new type of neural network to replace Transformers, and I point to the timestamp on the second paper (above) related to it.]

[I know convolutions are 2D, usually, e.g. there I believe, but I don’t know if you’re doing 3D convolutions, and this would also apply, or if this helps for 1D.]

1 Like

Thank you very much all!

This discussion was most helpful!

1 Like

Usually 3D convolutions are much faster with FFTs, if the kernel is not like (2,2,2).

Also just dumping my numbers.

julia> using FFTW, CUDA, BenchmarkTools

julia> function try_FFT_on_cuda()
           values = rand(353, 353, 353)
           value_complex = ComplexF32.(values)
           cvalues = similar(cu(value_complex), ComplexF32)
           copyto!(cvalues, values)
           cy = similar(cvalues)
           cF = plan_fft!(cvalues)
           @btime CUDA.@sync a = ($cF*$cy)
           return nothing
       end
try_FFT_on_cuda (generic function with 1 method)

julia> try_FFT_on_cuda()
  20.319 ms (6 allocations: 304 bytes)

julia> function try_FFT_on_cpu()
           values = rand(353, 353, 353)
           value_complex = ComplexF32.(values)
           cvalues = similar((value_complex), ComplexF32)
           copyto!(cvalues, values)
           cy = similar(cvalues)
           cF = plan_fft!(cvalues, flags=FFTW.MEASURE)
           @btime a = ($cF*$cy)
           return nothing
       end
try_FFT_on_cpu (generic function with 1 method)

julia> try_FFT_on_cpu()
  100.503 ms (314 allocations: 27.09 KiB)



# power of 2
julia> function try_FFT_on_cpu()
           values = rand(256, 256, 256)
           value_complex = ComplexF32.(values)
           cvalues = similar((value_complex), ComplexF32)
           copyto!(cvalues, values)
           cy = similar(cvalues)
           cF = plan_fft!(cvalues, flags=FFTW.MEASURE)
           @btime a = ($cF*$cy)
           return nothing
       end
try_FFT_on_cpu (generic function with 1 method)

julia> function try_FFT_on_cuda()
           values = rand(256, 256, 256)
           value_complex = ComplexF32.(values)
           cvalues = similar(cu(value_complex), ComplexF32)
           copyto!(cvalues, values)
           cy = similar(cvalues)
           cF = plan_fft!(cvalues)
           @btime CUDA.@sync a = ($cF*$cy)
           return nothing
       end
try_FFT_on_cuda (generic function with 1 method)

julia> try_FFT_on_cpu()
  18.883 ms (314 allocations: 27.09 KiB)

julia> try_FFT_on_cuda()
  2.483 ms (6 allocations: 304 bytes)

1 Like

Thank you everyone for the great help!

It makes a lot more sense now!