Unreasonably fast FFT on CUDA

Palli · January 29, 2024, 5:06pm

I’m not sure if this helps you but it’s pretty intriguing to me, FFT has 2% utilization, at least when doing a convolution with it, but 8x speedup possible:

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

2311.05908.pdf

[…] FlashFFTConv speeds up exact FFT convolutions by up to 7.93× over
PyTorch and achieves up to 4.4× speedup end-to-end.

[Monarch mixer is also a pretty intriguing new type of neural network to replace Transformers, and I point to the timestamp on the second paper (above) related to it.]

[I know convolutions are 2D, usually, e.g. there I believe, but I don’t know if you’re doing 3D convolutions, and this would also apply, or if this helps for 1D.]

Topic		Replies	Views
Why is CUDA.FFT slow only when performed over the second dimension of a 3D array? GPU cuda , fft	0	97	January 29, 2025
Why fft with MEASURE plan 10x slower than calling fft directly with CUDA.CUFFT? Performance gpu , cuda	7	227	September 22, 2024
FFT is too fast compared to assignment in CUDA GPU fftw , cuda	3	2729	October 28, 2021
CuPy CuFFT ~2x faster than CUDA.jl CuFFT GPU performance , cuda , fft	15	3099	February 27, 2023
FFTW scales pretty well (some @btime benchmarks) Performance fftw , gpu , parallel , multithreading	1	1774	February 4, 2025

Unreasonably fast FFT on CUDA

Related topics