FFT is a pretty fast algorithm, but its performance on CUDA seems even comparable to simple element-wise assignment. I am wondering if this is something expected.
Here is the contents of a performance test code named test_fft_vs_assign.jl
for a fairly large number of sampling points (N = 2^20
):
using CUDA
using FFTW
using BenchmarkTools
n = 20
N = 2^n
x = rand(ComplexF64, N)
y = similar(x)
F = plan_fft!(x)
cx = similar(cu(x), ComplexF64) # make sure eltype(cx) = ComplexF64
copyto!(cx, x) # copy contents of x to cx
cy = similar(cx)
cF = plan_fft!(cx)
print("CPU assignment:")
@btime $y .= $x
print("CPU assignment + in-place planned FFT:")
@btime ($y .= $x; $F * $y) # assignment first to keep input same
print("GPU assignment:")
@btime $cy .= $cx
print("GPU assignment + in-place planned FFT:")
@btime ($cy .= $cx; $cF * $cy)
and here is the result:
julia> VERSION
v"1.6.2-pre.0"
(@v1.6) pkg> st CUDA
Status `~/.julia/environments/v1.6/Project.toml`
[052768ef] CUDA v3.4.2
(@v1.6) pkg> st FFTW
Status `~/.julia/environments/v1.6/Project.toml`
[7a1cc6ca] FFTW v1.4.5
julia> include("test_fft_vs_assign.jl")
CPU assignment: 2.442 ms (0 allocations: 0 bytes)
CPU assignment + in-place planned FFT: 16.181 ms (0 allocations: 0 bytes) # FFT alone: 16.181 − 2.442 = 13.739 ms
GPU assignment: 3.903 μs (27 allocations: 1.47 KiB)
GPU assignment + in-place planned FFT: 8.951 μs (28 allocations: 1.50 KiB) # FFT alone: 8.951 − 3.903 = 5.048 µs
So, in this specific experiment,
- on CPU: FFT of a vector is slower than element-wise assignment by a factor of 13.739 ms / 2.442 ms ≈ 5.6.
- on GPU: FFT of a vector is slower than element-wise assignment by a factor of 5.048 µs / 3.903 µs ≈ 1.3.
This means that FFT is nearly as cheap as element-wise assignment on GPU.
I am implementing an algorithm in which FFT operations are known to be the most time-consuming part. So, I have been trying to minimize the number of FFTs in my implementation of the algorithm. But the algorithm uses lots of basic element-wise operations (assignment, addition, multiplication, …) as well between CUDA vectors. Because those basic operations are nearly as expensive as FFT, it seems that I don’t have to be too concerned about reducing the number of FFTs in implementing the algorithm on GPU.
Of course, the result would be different for different values of N
, but I am wondering if the above conclusion is more or less correct in dealing with FFT on GPU.