so I am currently working on an algorithm that will likely strongly depend on the FFT very significantly. Therefore I am considering to do the FFT in FFTW on Cuda to speed up the algorithm. It is a 3d FFT with about 353 x 353 x 353 points in the grid.
To benchmark the behaviour, I wrote the following code

using BenchmarkTools
function try_FFT_on_cuda()
values = rand(353, 353, 353)
value_complex = ComplexF32.(values)
cvalues = similar(cu(value_complex), ComplexF32)
copyto!(cvalues, values)
cy = similar(cvalues)
cF = plan_fft!(cvalues)
@btime a = ($cF*$cy)
return nothing
end
try_FFT_on_cuda()

The output is the following

3.384 μs (0 allocations: 0 bytes)

only measuring the line

@btime a = ($cF*$cy)

seems reasonable, because in my algorithm will do the same FFT over and over again in place. Nontheless, I compared this two an implementaion on a 32 core cpu and received values on the order of 70 ms. So the CUDA implementation seems to be 10000x faster. As a result I have trouble believing what I am seeing there. Are these values realistic?

My Graphics card is an NVIDIA RTX A4000

Obviously if these numbers are correct that would be awsome, but it seems very fast.

I’m not sure if this helps you but it’s pretty intriguing to me, FFT has 2% utilization, at least when doing a convolution with it, but 8x speedup possible:

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

[…] FlashFFTConv speeds up exact FFT convolutions by up to 7.93× over
PyTorch and achieves up to 4.4× speedup end-to-end.

[Monarch mixer is also a pretty intriguing new type of neural network to replace Transformers, and I point to the timestamp on the second paper (above) related to it.]

[I know convolutions are 2D, usually, e.g. there I believe, but I don’t know if you’re doing 3D convolutions, and this would also apply, or if this helps for 1D.]