Why fft with MEASURE plan 10x slower than calling fft directly with CUDA.CUFFT?

using CUDA
using CUDA.CUFFT
using BenchmarkTools: @benchmark
import FFTW

a_gpu = CUDA.rand(Float32, 128, 128, 128)
@benchmark CUDA.@sync fft($a_gpu)

which gives:

BenchmarkTools.Trial: 1744 samples with 1 evaluation.
 Range (min … max):  2.390 ms … 116.107 ms  ┊ GC (min … max): 0.00% … 96.93%
 Time  (median):     2.596 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.842 ms ±   2.762 ms  ┊ GC (mean ± σ):  3.00% ±  3.53%

    ▄▆█▅                       ▁ ▄▁                            
  █▇████▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▆▅▇▇█████▅▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▁▅▅▆▄▄▅▅ █
  2.39 ms      Histogram: log(frequency) by time      4.97 ms <

 Memory estimate: 5.03 KiB, allocs estimate: 162.

While the following code:

fp = plan_fft(a_gpu, flags=FFTW.MEASURE)
@benchmark CUDA.@sync $fp * $a_gpu

which gives:

BenchmarkTools.Trial: 202 samples with 1 evaluation.
 Range (min … max):  22.257 ms … 32.898 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     24.642 ms              ┊ GC (median):    4.01%
 Time  (mean ± σ):   24.843 ms ±  1.587 ms  ┊ GC (mean ± σ):  4.25% ± 3.75%

        ▁     ▂▄      █                                        
  ▂▂▁▃▆██▅█▄▂▅██▇▄▃▁▄▇█▇▅▂▁▂▁▁▁▁▁▁▁▁▃▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂ ▃
  22.3 ms         Histogram: frequency by time        32.6 ms <

 Memory estimate: 40.00 MiB, allocs estimate: 11.

Am I doing something wrong?

Output of CUDA.versioninfo():

CUDA runtime 12.6, artifact installation
CUDA driver 12.6
NVIDIA driver 535.183.1

CUDA libraries: 
- CUBLAS: 12.6.1
- CURAND: 10.3.7
- CUFFT: 11.2.6
- CUSOLVER: 11.6.4
- CUSPARSE: 12.5.3
- CUPTI: 2024.3.1 (API 24.0.0)
- NVML: 12.0.0+535.183.1

Julia packages: 
- CUDA: 5.5.0
- CUDA_Driver_jll: 0.10.2+0
- CUDA_Runtime_jll: 0.15.2+0

Toolchain:
- Julia: 1.10.5
- LLVM: 15.0.7

1 device:
  0: NVIDIA T400 4GB (sm_75, 347.000 MiB / 4.000 GiB available)

CUDA FFTs don’t use FFTW.

1 Like

I think what happens with the FFTW.MEASURE flag that it somehow makes a FFTW plan instead CUFFT plan. So for me it errors if applied to a_gpu:

julia> using FFTW, CUDA, CUDA.CUFFT

julia> a_gpu = CUDA.zeros(2,2)
2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:
 0.0  0.0
 0.0  0.0

julia> fp = plan_fft(a_gpu, flags=FFTW.MEASURE)
FFTW forward plan for 2×2 array of ComplexF32
(dft-rank>=2/1
  (dft-direct-2-x2 "n1fv_2_sse2")
  (dft-direct-2-x2 "n1fv_2_sse2"))

julia> fp * a_gpu
ERROR: MethodError: *(::FFTW.cFFTWPlan{ComplexF32, -1, false, 2, UnitRange{Int64}}, ::CuArray{ComplexF32, 2, CUDA.DeviceMemory}) is ambiguous.

Candidates:
  *(p::FFTW.cFFTWPlan{T, K, false}, x::StridedArray{T, N}) where {T, K, N}
    @ FFTW ~/.julia/packages/FFTW/6nZei/src/fft.jl:826
  *(p::AbstractFFTs.Plan{T}, x::CuArray) where T
    @ CUDA.CUFFT ~/.julia/packages/CUDA/Tl08O/lib/cufft/fft.jl:11

Possible fix, define
  *(::FFTW.cFFTWPlan{T, K, false}, ::CuArray{T, N}) where {T, K, N}

Stacktrace:
 [1] *(p::FFTW.cFFTWPlan{ComplexF32, -1, false, 2, UnitRange{Int64}}, x::CuArray{Float32, 2, CUDA.DeviceMemory})
   @ CUDA.CUFFT ~/.julia/packages/CUDA/Tl08O/lib/cufft/fft.jl:11
 [2] top-level scope
   @ REPL[19]:1

julia> fp * Array(a_gpu)
2×2 Matrix{ComplexF32}:
 0.0+0.0im  0.0+0.0im
 0.0+0.0im  0.0+0.0im

julia> plan_fft(a_gpu)
CUFFT complex forward plan for 2×2 CuArray of ComplexF32

julia> using FFTW, CUDA, CUDA.CUFFT

julia> a_gpu = CUDA.zeros(2,2)
2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:
 0.0  0.0
 0.0  0.0

julia> fp = plan_fft(a_gpu, flags=FFTW.MEASURE)
FFTW forward plan for 2×2 array of ComplexF32
(dft-rank>=2/1
  (dft-direct-2-x2 "n1fv_2_sse2")
  (dft-direct-2-x2 "n1fv_2_sse2"))

julia> fp * a_gpu
ERROR: MethodError: *(::FFTW.cFFTWPlan{ComplexF32, -1, false, 2, UnitRange{Int64}}, ::CuArray{ComplexF32, 2, CUDA.DeviceMemory}) is ambiguous.

Candidates:
  *(p::FFTW.cFFTWPlan{T, K, false}, x::StridedArray{T, N}) where {T, K, N}
    @ FFTW ~/.julia/packages/FFTW/6nZei/src/fft.jl:826
  *(p::AbstractFFTs.Plan{T}, x::CuArray) where T
    @ CUDA.CUFFT ~/.julia/packages/CUDA/Tl08O/lib/cufft/fft.jl:11

Possible fix, define
  *(::FFTW.cFFTWPlan{T, K, false}, ::CuArray{T, N}) where {T, K, N}

Stacktrace:
 [1] *(p::FFTW.cFFTWPlan{ComplexF32, -1, false, 2, UnitRange{Int64}}, x::CuArray{Float32, 2, CUDA.DeviceMemory})
   @ CUDA.CUFFT ~/.julia/packages/CUDA/Tl08O/lib/cufft/fft.jl:11
 [2] top-level scope
   @ REPL[19]:1

julia> fp * Array(a_gpu)
2×2 Matrix{ComplexF32}:
 0.0+0.0im  0.0+0.0im
 0.0+0.0im  0.0+0.0im

julia> plan_fft(a_gpu)
CUFFT complex forward plan for 2×2 CuArray of ComplexF32

(@CUDA) pkg> st
Status `~/.julia/environments/CUDA/Project.toml`
⌃ [052768ef] CUDA v5.4.3
  [7a1cc6ca] FFTW v1.8.0

I think in your case using FFTW export plan_fft which shadows CUDA.CUFFT.plan_fft.

I don’t quite understand. If CUFFT does not support plan_fft and FFTW.MEASURE. Why does it export this function?
And I do test FFTW on CPU, its performance is

BenchmarkTools.Trial: 645 samples with 1 evaluation.
 Range (min … max):  7.331 ms …  10.558 ms  ┊ GC (min … max): 0.00% … 3.88%
 Time  (median):     7.584 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.747 ms ± 453.441 μs  ┊ GC (mean ± σ):  0.92% ± 2.11%

   ▂▆█▄▂                                                       
  ▃█████▆▆▇▇▆▄▃▃▄▅▄▆▆▅▄▃▂▂▃▁▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▃ ▃
  7.33 ms         Histogram: frequency by time        9.89 ms <

 Memory estimate: 8.13 MiB, allocs estimate: 2.

which is also much faster than the CUFFT.plan_fft what ever it is doing.

So my question is really: what is the best practice to perform FFT using CUFFT if performance is the first priority? And I wish to perform similar FFT for many times. Ideally, I want to use fft plan and mul! just as FFTW.

This functionality could be available: cuFFT if wrapped correctly.

This issue indicates that plan_fft is available: GitHub - JuliaGPU/CUDA.jl: CUDA programming in Julia.

You could create an issue like “Usage of plan_fft not documented”.

“”"

Can you wrap this or that CUDA API?

If a certain API isn’t wrapped with some high-level functionality, you can always use the underlying C APIs which are always available as unexported methods. For example, you can access the CUDA driver library as cu prefixed, unexported functions like CUDA.cuDriverGetVersion. Similarly, vendor libraries like CUBLAS are available through their exported submodule handles, e.g., CUBLAS.cublasGetVersion_v2.

Any help on designing or implementing high-level wrappers for this low-level functionality is greatly appreciated, so please consider contributing your uses of these APIs on the respective repositories.
“”"

The OP’s confusion is indeed someone else’s fault.

The main documentation for plan_fft lives in AbstractFFTs, but includes the flags details only appropriate for FFTW and faithful mimics. We did not add specific documentation for CUFFT.plan_fft, but that would be otiose if the generic docs were proper.

The presence of a keyword argument changes the dispatch in a confusing way. Ideally CUFFT.plan_fft should be invoked here and throw an error, but enforcing that seems like a surprising burden for library authors.

Edit to clarify: If CUDA and AbstractFFTs are loaded (but not FFTW), then attempting to provide a flags keyword does generate the appropriate method error. But if FFTW is loaded, then the keyword forces construction of an FFTW (CPU) plan, and the CuArray is silently converted to an ordinary Array by the product method and passed to FFTW. Hence the slowness in the OP. Would this be prevented by trapping on unrecognized keywords in the CUFFT method?

1 Like