Why is CUDA.FFT slow only when performed over the second dimension of a 3D array?

marcsgil · January 29, 2025, 6:49pm

Consider the following test on the CPU:

using CUDA, FFTW, BenchmarkTools

x = randn(ComplexF32, 2^7, 2^7, 2^7)

for dim ∈ 1:ndims(x)
    @info "FFT along dimension $dim"
    display(@benchmark CUDA.@sync fft($x, $dim))
end

I would expect the fastest case to be the one with dim=1, because then the points involved in the fft are contiguous in memory. The results are consistent with this intuition:

[ Info: FFT along dimension 1
BenchmarkTools.Trial: 929 samples with 1 evaluation per sample.
 Range (min … max):  2.109 ms … 9.301 ms  ┊ GC (min … max):  0.00% …  9.20%
 Time  (median):     5.512 ms             ┊ GC (median):     0.00%
 Time  (mean ± σ):   5.386 ms ± 2.190 ms  ┊ GC (mean ± σ):  13.02% ± 20.68%

   █                                                         
  ▄█▅▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▂▃▅▆▆▄▃▂▃▂▂▂▂▂▁▁▁▁▁▁▁▂▁▁▃▆▇▅▅▃▅▅▇▆▃▄▂▃ ▃
  2.11 ms        Histogram: frequency by time       8.22 ms <

 Memory estimate: 16.00 MiB, allocs estimate: 8.
[ Info: FFT along dimension 2
BenchmarkTools.Trial: 739 samples with 1 evaluation per sample.
 Range (min … max):  3.519 ms … 12.589 ms  ┊ GC (min … max):  0.00% … 63.60%
 Time  (median):     7.176 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   6.769 ms ±  2.003 ms  ┊ GC (mean ± σ):  12.64% ± 17.06%

   ▅█                        ▁             ▄        ▄         
  ▂███▅▃▂▂▂▂▁▁▁▁▁▁▁▂▁▂▁▁▁▂▃▅██▇▅▄▃▃▃▂▃▁▁▁▁▃██▇▅▄▄▃▄▆██▄▅▅▄▄▃ ▃
  3.52 ms        Histogram: frequency by time        9.52 ms <

 Memory estimate: 16.00 MiB, allocs estimate: 8.
[ Info: FFT along dimension 3
BenchmarkTools.Trial: 563 samples with 1 evaluation per sample.
 Range (min … max):  5.128 ms … 16.052 ms  ┊ GC (min … max):  0.00% … 52.46%
 Time  (median):     9.486 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   8.879 ms ±  2.213 ms  ┊ GC (mean ± σ):  10.83% ± 14.65%

     ▇█▆                      ▄▂ ▂            ▆▆▂ ▁ ▁▁▃▁ ▁    
  ▅▇▇███▇▅▂▂▁▂▁▁▁▁▁▁▁▁▁▁▂▁▄▄▄▇████▆▄▄▄▃▃▂▂▄▅▄▆█████▇██████▇▆ ▄
  5.13 ms        Histogram: frequency by time        11.7 ms <

 Memory estimate: 16.00 MiB, allocs estimate: 8.

When running the same thing with but with x on the GPU (x = CUDA.randn(ComplexF32, 2^7, 2^7, 2^7), I get

[ Info: FFT along dimension 1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  20.250 μs …  2.765 ms  ┊ GC (min … max):  0.00% … 95.44%
 Time  (median):     24.590 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   35.452 μs ± 67.646 μs  ┊ GC (mean ± σ):  25.01% ± 13.52%

  █▆▃                                                         ▁
  ████▅▃▁▁▄▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▆▇▇▇▇▇▇▇ █
  20.2 μs      Histogram: log(frequency) by time       356 μs <

 Memory estimate: 1.12 KiB, allocs estimate: 30.
[ Info: FFT along dimension 2
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  399.892 μs …   3.124 ms  ┊ GC (min … max): 0.00% … 69.76%
 Time  (median):     406.122 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   436.497 μs ± 159.030 μs  ┊ GC (mean ± σ):  4.26% ±  7.97%

  █▅▃▁                                                          ▁
  █████▆▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▆▇▆▅▅▆▅▆▇▇▇▇▇ █
  400 μs        Histogram: log(frequency) by time       1.33 ms <

 Memory estimate: 64.89 KiB, allocs estimate: 3221.
[ Info: FFT along dimension 3
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  21.680 μs … 504.712 μs  ┊ GC (min … max):  0.00% … 82.88%
 Time  (median):     25.130 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   36.442 μs ±  61.370 μs  ┊ GC (mean ± σ):  25.38% ± 13.70%

  █▄▂                                                          ▁
  ███▇▄▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▇▇▇▇▇▇▇ █
  21.7 μs       Histogram: log(frequency) by time       388 μs <

 Memory estimate: 1.12 KiB, allocs estimate: 30.

For me, there are two odd behaviors here:

The slowdown is HUGE when dim=2.
The slowdown is negligible when dim=3.

Could anyone help me understand this difference?

Topic		Replies	Views
Unreasonably fast FFT on CUDA Performance question	8	1194	February 3, 2024
Why fft with MEASURE plan 10x slower than calling fft directly with CUDA.CUFFT? Performance gpu , cuda	7	170	September 22, 2024
Map Performance with CuArrays GPU question , fftw , cuda , broadcast	15	5175	January 4, 2021
Calculate FFT on GPU for every row of a 2D array Performance gpu	2	1021	August 14, 2018
FFTW scales pretty well (some @btime benchmarks) Performance fftw , gpu , parallel , multithreading	1	1713	February 4, 2025

Why is CUDA.FFT slow only when performed over the second dimension of a 3D array?

Related topics