Why is CUDA.FFT slow only when performed over the second dimension of a 3D array?

Consider the following test on the CPU:

using CUDA, FFTW, BenchmarkTools

x = randn(ComplexF32, 2^7, 2^7, 2^7)

for dim ∈ 1:ndims(x)
    @info "FFT along dimension $dim"
    display(@benchmark CUDA.@sync fft($x, $dim))
end

I would expect the fastest case to be the one with dim=1, because then the points involved in the fft are contiguous in memory. The results are consistent with this intuition:

[ Info: FFT along dimension 1
BenchmarkTools.Trial: 929 samples with 1 evaluation per sample.
 Range (min â€Ķ max):  2.109 ms â€Ķ 9.301 ms  ┊ GC (min â€Ķ max):  0.00% â€Ķ  9.20%
 Time  (median):     5.512 ms             ┊ GC (median):     0.00%
 Time  (mean Âą σ):   5.386 ms Âą 2.190 ms  ┊ GC (mean Âą σ):  13.02% Âą 20.68%

   █                                                         
  ▄█▅▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▂▃▅▆▆▄▃▂▃▂▂▂▂▂▁▁▁▁▁▁▁▂▁▁▃▆▇▅▅▃▅▅▇▆▃▄▂▃ ▃
  2.11 ms        Histogram: frequency by time       8.22 ms <

 Memory estimate: 16.00 MiB, allocs estimate: 8.
[ Info: FFT along dimension 2
BenchmarkTools.Trial: 739 samples with 1 evaluation per sample.
 Range (min â€Ķ max):  3.519 ms â€Ķ 12.589 ms  ┊ GC (min â€Ķ max):  0.00% â€Ķ 63.60%
 Time  (median):     7.176 ms              ┊ GC (median):     0.00%
 Time  (mean Âą σ):   6.769 ms Âą  2.003 ms  ┊ GC (mean Âą σ):  12.64% Âą 17.06%

   ▅█                        ▁             ▄        ▄         
  ▂███▅▃▂▂▂▂▁▁▁▁▁▁▁▂▁▂▁▁▁▂▃▅██▇▅▄▃▃▃▂▃▁▁▁▁▃██▇▅▄▄▃▄▆██▄▅▅▄▄▃ ▃
  3.52 ms        Histogram: frequency by time        9.52 ms <

 Memory estimate: 16.00 MiB, allocs estimate: 8.
[ Info: FFT along dimension 3
BenchmarkTools.Trial: 563 samples with 1 evaluation per sample.
 Range (min â€Ķ max):  5.128 ms â€Ķ 16.052 ms  ┊ GC (min â€Ķ max):  0.00% â€Ķ 52.46%
 Time  (median):     9.486 ms              ┊ GC (median):     0.00%
 Time  (mean Âą σ):   8.879 ms Âą  2.213 ms  ┊ GC (mean Âą σ):  10.83% Âą 14.65%

     ▇█▆                      ▄▂ ▂            ▆▆▂ ▁ ▁▁▃▁ ▁    
  ▅▇▇███▇▅▂▂▁▂▁▁▁▁▁▁▁▁▁▁▂▁▄▄▄▇████▆▄▄▄▃▃▂▂▄▅▄▆█████▇██████▇▆ ▄
  5.13 ms        Histogram: frequency by time        11.7 ms <

 Memory estimate: 16.00 MiB, allocs estimate: 8.

When running the same thing with but with x on the GPU (x = CUDA.randn(ComplexF32, 2^7, 2^7, 2^7), I get

[ Info: FFT along dimension 1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min â€Ķ max):  20.250 Ξs â€Ķ  2.765 ms  ┊ GC (min â€Ķ max):  0.00% â€Ķ 95.44%
 Time  (median):     24.590 ξs              ┊ GC (median):     0.00%
 Time  (mean Âą σ):   35.452 Ξs Âą 67.646 Ξs  ┊ GC (mean Âą σ):  25.01% Âą 13.52%

  █▆▃                                                         ▁
  ████▅▃▁▁▄▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▆▇▇▇▇▇▇▇ █
  20.2 Ξs      Histogram: log(frequency) by time       356 Ξs <

 Memory estimate: 1.12 KiB, allocs estimate: 30.
[ Info: FFT along dimension 2
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min â€Ķ max):  399.892 Ξs â€Ķ   3.124 ms  ┊ GC (min â€Ķ max): 0.00% â€Ķ 69.76%
 Time  (median):     406.122 ξs               ┊ GC (median):    0.00%
 Time  (mean Âą σ):   436.497 Ξs Âą 159.030 Ξs  ┊ GC (mean Âą σ):  4.26% Âą  7.97%

  █▅▃▁                                                          ▁
  █████▆▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▆▇▆▅▅▆▅▆▇▇▇▇▇ █
  400 Ξs        Histogram: log(frequency) by time       1.33 ms <

 Memory estimate: 64.89 KiB, allocs estimate: 3221.
[ Info: FFT along dimension 3
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min â€Ķ max):  21.680 Ξs â€Ķ 504.712 Ξs  ┊ GC (min â€Ķ max):  0.00% â€Ķ 82.88%
 Time  (median):     25.130 ξs               ┊ GC (median):     0.00%
 Time  (mean Âą σ):   36.442 Ξs Âą  61.370 Ξs  ┊ GC (mean Âą σ):  25.38% Âą 13.70%

  █▄▂                                                          ▁
  ███▇▄▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▇▇▇▇▇▇▇ █
  21.7 Ξs       Histogram: log(frequency) by time       388 Ξs <

 Memory estimate: 1.12 KiB, allocs estimate: 30.

For me, there are two odd behaviors here:

  • The slowdown is HUGE when dim=2.
  • The slowdown is negligible when dim=3.

Could anyone help me understand this difference?

2 Likes