FFT is too fast compared to assignment in CUDA

FFT is a pretty fast algorithm, but its performance on CUDA seems even comparable to simple element-wise assignment. I am wondering if this is something expected.

Here is the contents of a performance test code named test_fft_vs_assign.jl for a fairly large number of sampling points (N = 2^20):

using CUDA
using FFTW
using BenchmarkTools

n = 20
N = 2^n

x = rand(ComplexF64, N)
y = similar(x)
F = plan_fft!(x)

cx = similar(cu(x), ComplexF64)  # make sure eltype(cx) = ComplexF64
copyto!(cx, x)  # copy contents of x to cx
cy = similar(cx)
cF = plan_fft!(cx)

print("CPU assignment:")
@btime $y .= $x

print("CPU assignment + in-place planned FFT:")
@btime ($y .= $x; $F * $y)  # assignment first to keep input same

print("GPU assignment:")
@btime $cy .= $cx

print("GPU assignment + in-place planned FFT:")
@btime ($cy .= $cx; $cF * $cy)

and here is the result:

julia> VERSION

(@v1.6) pkg> st CUDA
      Status `~/.julia/environments/v1.6/Project.toml`
  [052768ef] CUDA v3.4.2

(@v1.6) pkg> st FFTW
      Status `~/.julia/environments/v1.6/Project.toml`
  [7a1cc6ca] FFTW v1.4.5

julia> include("test_fft_vs_assign.jl")
CPU assignment:  2.442 ms (0 allocations: 0 bytes)
CPU assignment + in-place planned FFT:  16.181 ms (0 allocations: 0 bytes)  # FFT alone: 16.181 − 2.442 = 13.739 ms
GPU assignment:  3.903 μs (27 allocations: 1.47 KiB)
GPU assignment + in-place planned FFT:  8.951 μs (28 allocations: 1.50 KiB)  # FFT alone: 8.951 − 3.903 = 5.048 µs

So, in this specific experiment,

  • on CPU: FFT of a vector is slower than element-wise assignment by a factor of 13.739 ms / 2.442 ms ≈ 5.6.
  • on GPU: FFT of a vector is slower than element-wise assignment by a factor of 5.048 µs / 3.903 µs ≈ 1.3.

This means that FFT is nearly as cheap as element-wise assignment on GPU.

I am implementing an algorithm in which FFT operations are known to be the most time-consuming part. So, I have been trying to minimize the number of FFTs in my implementation of the algorithm. But the algorithm uses lots of basic element-wise operations (assignment, addition, multiplication, …) as well between CUDA vectors. Because those basic operations are nearly as expensive as FFT, it seems that I don’t have to be too concerned about reducing the number of FFTs in implementing the algorithm on GPU.

Of course, the result would be different for different values of N, but I am wondering if the above conclusion is more or less correct in dealing with FFT on GPU.

1M element FFT is not much for a modern GPU. How bad is it to launch many small kernels in CUDA? - Stack Overflow suggests you might just be seeing a kernel launch overhead floor. You might try larger problem sizes for better comparison. I’d also suggest computing expected copy performance based on your device’s global memory bandwidth.

Try again with synchronization on the CUDA side to make sure you’re capturing the full execution time: Profiling · CUDA.jl. It’s possible only the async launch time is being measured as @maedoc mentioned.

1 Like

Thanks for the tip for CUDA.@sync. On my card, with the sync, the memcpy takes about 100 us, while the FFT takes 924 us, for 1M elements. This makes more sense.