I’m not sure if this helps you but it’s pretty intriguing to me, FFT has 2% utilization, at least when doing a convolution with it, but 8x speedup possible:
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
[…] FlashFFTConv speeds up exact FFT convolutions by up to 7.93× over
PyTorch and achieves up to 4.4× speedup end-to-end.
[Monarch mixer is also a pretty intriguing new type of neural network to replace Transformers, and I point to the timestamp on the second paper (above) related to it.]
[I know convolutions are 2D, usually, e.g. there I believe, but I don’t know if you’re doing 3D convolutions, and this would also apply, or if this helps for 1D.]