Consider the following test on the CPU:
using CUDA, FFTW, BenchmarkTools
x = randn(ComplexF32, 2^7, 2^7, 2^7)
for dim â 1:ndims(x)
@info "FFT along dimension $dim"
display(@benchmark CUDA.@sync fft($x, $dim))
end
I would expect the fastest case to be the one with dim=1
, because then the points involved in the fft
are contiguous in memory. The results are consistent with this intuition:
[ Info: FFT along dimension 1
BenchmarkTools.Trial: 929 samples with 1 evaluation per sample.
Range (min âĶ max): 2.109 ms âĶ 9.301 ms â GC (min âĶ max): 0.00% âĶ 9.20%
Time (median): 5.512 ms â GC (median): 0.00%
Time (mean Âą Ï): 5.386 ms Âą 2.190 ms â GC (mean Âą Ï): 13.02% Âą 20.68%
â
âââ
âââââââââââââââââââ
âââââââââââââââââââââââââ
â
ââ
â
ââââââ â
2.11 ms Histogram: frequency by time 8.22 ms <
Memory estimate: 16.00 MiB, allocs estimate: 8.
[ Info: FFT along dimension 2
BenchmarkTools.Trial: 739 samples with 1 evaluation per sample.
Range (min âĶ max): 3.519 ms âĶ 12.589 ms â GC (min âĶ max): 0.00% âĶ 63.60%
Time (median): 7.176 ms â GC (median): 0.00%
Time (mean Âą Ï): 6.769 ms Âą 2.003 ms â GC (mean Âą Ï): 12.64% Âą 17.06%
â
â â â â
âââââ
âââââââââââââââââââââ
ââââ
âââââââââââââââ
âââââââââ
â
âââ â
3.52 ms Histogram: frequency by time 9.52 ms <
Memory estimate: 16.00 MiB, allocs estimate: 8.
[ Info: FFT along dimension 3
BenchmarkTools.Trial: 563 samples with 1 evaluation per sample.
Range (min âĶ max): 5.128 ms âĶ 16.052 ms â GC (min âĶ max): 0.00% âĶ 52.46%
Time (median): 9.486 ms â GC (median): 0.00%
Time (mean Âą Ï): 8.879 ms Âą 2.213 ms â GC (mean Âą Ï): 10.83% Âą 14.65%
âââ ââ â âââ â ââââ â
â
âââââââ
ââââââââââââââââââââââââââââââââââ
ââââââââââââââââ â
5.13 ms Histogram: frequency by time 11.7 ms <
Memory estimate: 16.00 MiB, allocs estimate: 8.
When running the same thing with but with x
on the GPU (x = CUDA.randn(ComplexF32, 2^7, 2^7, 2^7)
, I get
[ Info: FFT along dimension 1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min âĶ max): 20.250 Ξs âĶ 2.765 ms â GC (min âĶ max): 0.00% âĶ 95.44%
Time (median): 24.590 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 35.452 Ξs Âą 67.646 Ξs â GC (mean Âą Ï): 25.01% Âą 13.52%
âââ â
âââââ
ââââââââââââââââââââââââââââââââââââââââââââââ
ââââââââ â
20.2 Ξs Histogram: log(frequency) by time 356 Ξs <
Memory estimate: 1.12 KiB, allocs estimate: 30.
[ Info: FFT along dimension 2
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min âĶ max): 399.892 Ξs âĶ 3.124 ms â GC (min âĶ max): 0.00% âĶ 69.76%
Time (median): 406.122 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 436.497 Ξs Âą 159.030 Ξs â GC (mean Âą Ï): 4.26% Âą 7.97%
ââ
ââ â
âââââââ
âââââââââââââââââââââââââââââââââââââââââ
ââââ
â
ââ
ââââââ â
400 Ξs Histogram: log(frequency) by time 1.33 ms <
Memory estimate: 64.89 KiB, allocs estimate: 3221.
[ Info: FFT along dimension 3
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min âĶ max): 21.680 Ξs âĶ 504.712 Ξs â GC (min âĶ max): 0.00% âĶ 82.88%
Time (median): 25.130 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 36.442 Ξs Âą 61.370 Ξs â GC (mean Âą Ï): 25.38% Âą 13.70%
âââ â
âââââââââââââââââââââââââââââââââââââââââââââââââââââ
âââââââ â
21.7 Ξs Histogram: log(frequency) by time 388 Ξs <
Memory estimate: 1.12 KiB, allocs estimate: 30.
For me, there are two odd behaviors here:
- The slowdown is HUGE when
dim=2
. - The slowdown is negligible when
dim=3
.
Could anyone help me understand this difference?