CuPy CuFFT ~2x faster than CUDA.jl CuFFT

I am working on a simulation whose bottleneck is lots of FFT-based convolutions performed on the GPU. I wanted to see how FFT’s from CUDA.jl would compare with one of bigger Python GPU libraries CuPy. I was surprised to see that CUDA.jl FFT’s were slower than CuPy for moderately sized arrays. Here is the Julia code I was benchmarking

using CUDA 
using CUDA.CUFFT 
using BenchmarkTools

A = CUDA.rand(Float32, 200,200)

function fft_func(A)
    return fft(A) 
end

@benchmark @CUDA.sync fft_func(A)

Here is the python code I used. For profiling CuPy, I used their recommended method.

import cupyx.scipy.fft as cufft
import cupy as cp 
from cupyx.profiler import benchmark

A = cp.random.random((200,200)).astype(cp.float32)

def fft_func(A):

    return cufft.fftn(A)

print(benchmark(fft_func, (A,), n_repeat=10000))

The Julia code gives me this result
image

The Python code gives me this

It’s a similar story for other FFT’s like rfft, ifft, and FFT’s with plans.

Is there anything I could do to improve the performance of the Julia FFT’s to be more competitive with CuPy? If there isn’t, I would still be interested in understanding why there can be such a difference in performance if, presumably, both libraries are calling out to CUDA CuFFT.

Thank you all for your time!

2 Likes

I ran your code and go the same result. CUDA.jl being ~2X slower. would be nice to get this sorted out,

That’s not great. Sadly I’m not too familiar with FFT APIs, so I don’t have any initial thoughts on what could be wrong. I’d start with checking the API calls we perform and see if there’s anything unexpected in there (if possible, comparing to the API calls Python makes). Since CUFFT doesn’t have a logger, you can do this by devving CUDA.jl, opening libcufft.jl and changing every ccall to @debug_ccall. That will generate a trace of API calls in your terminal. Since these API calls are supposed to be similar to what you would do with FFTW, maybe some people here will notice what could be wrong.

Alternatively, if you are familiar with CUDA, running both under NSight Systems might also reveal something (e.g., it might be that the CUFFT kernels are just as fast, but that we have some inefficiency in our wrappers causing what I presume would need to be a 100us slowdown).

If you don’t manage to resolve this, please make sure to open an issue on the CUDA.jl repo. Having a look with NSight is something I could do.

1 Like

Since you are using real numbers as your input, I wonder if CuPy is automatically applying an RFFT routine instead of complex FFT. That means, it would only have to calculate half as many Fourier coefficients (due to symmetry) which fits with the observed speed up.

That doesn’t seem to be the case (explicitly invoking rfftn yields another speed-up of factor 2`).

However, using a complex64 in Python (equivalent to a ComplexF32) the speed-up in comparison is even larger!

julia> using CUDA

julia> using CUDA.CUFFT

julia> using BenchmarkTools

julia> A = CUDA.rand(Float32, 200,200)
200×200 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 

julia> B = CUDA.rand(ComplexF32, 200,200)
200×200 CuArray{ComplexF32, 2, CUDA.Mem.DeviceBuffer}:

julia> function fft_func(A)
           return fft(A) 
       end
fft_func (generic function with 1 method)

julia> @benchmark @CUDA.sync fft_func(A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   98.100 μs …  64.329 ms  ┊ GC (min … max): 0.00% … 10.07%
 Time  (median):     101.407 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   108.737 μs ± 642.544 μs  ┊ GC (mean ± σ):  0.60% ±  0.10%

        ▂▅█▇▆█▇▆▅▄▃▂▂▁▁                                          
  ▁▁▂▃▅▇████████████████▇▇▆▅▅▄▃▃▃▃▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  98.1 μs          Histogram: frequency by time          112 μs <

 Memory estimate: 4.00 KiB, allocs estimate: 72.

julia> @benchmark @CUDA.sync fft_func(B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  85.190 μs …  3.431 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     87.794 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   89.531 μs ± 36.505 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▃▆█▇▄▃                                                 
  ▁▁▂▅█████████▆▄▃▃▂▂▂▂▄▄▆▆▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  85.2 μs         Histogram: frequency by time         100 μs <

 Memory estimate: 640 bytes, allocs estimate: 14.
import cupyx.scipy.fft as cufft
import cupy as cp  
from cupyx.profiler import benchmark

A = cp.ones((200,200)).astype(cp.float32)
B = A.astype(cp.complex64)

def fft_func(A):
    return cufft.fftn(A)

print(benchmark(fft_func, (A,), n_repeat=1000))
print(benchmark(fft_func, (B,), n_repeat=1000))

print(cp.all(cp.isclose(fft_func(B), fft_func(A))))



╰─➤  python cupy_bench.py    
fft_func            :    CPU:   42.536 us   +/- 1.695 (min:   40.738 / max:   58.292) us     GPU-0:   45.728 us   +/- 1.902 (min:   43.200 / max:   62.496) us
fft_func            :    CPU:   21.589 us   +/- 1.234 (min:   20.202 / max:   43.247) us     GPU-0:   25.755 us   +/-20.762 (min:    9.760 / max:  633.856) us
True
1 Like

The fact that complex arrays lead to an additional speed up makes me think that allocations are influencing the measurement…

1 Like

Yeah, the sizes of the arrays are pretty small and FFTW for example always copies a real input array to a complex array which is always one allocation.

For 4000x4000, where Julia seems to be slightly faster for the Complex Array.

╰─➤ python cupy_bench.py
fft_func           :   CPU:  49.242 us  +/- 3.073 (min:  45.263 / max:  94.069) us    GPU-0: 3974.743 us  +/-65.493 (min: 3913.984 / max: 5259.232) us
fft_func           :   CPU:  49.422 us  +/- 4.618 (min:  44.617 / max: 108.146) us    GPU-0: 3973.748 us  +/-42.647 (min: 3918.848 / max: 4378.624) us
True


julia> @benchmark @CUDA.sync fft_func(A)
BenchmarkTools.Trial: 831 samples with 1 evaluation.
 Range (min … max): 4.996 ms … 36.608 ms ┊ GC (min … max): 0.00% … 2.28%
 Time (median):    5.639 ms             ┊ GC (median):   0.00%
 Time (mean ± σ):  5.990 ms ±  2.002 ms ┊ GC (mean ± σ): 1.39% ± 3.18%
 ▄▂ ▆█▅
 ███████▁▅▅▁▁▅▅▄▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▅▄▅▇▇▅▅▅▄▅ ▇
 5 ms        Histogram: log(frequency) by time    13.4 ms <
 Memory estimate: 7.08 KiB, allocs estimate: 124.
julia> @benchmark @CUDA.sync fft_func(B)
BenchmarkTools.Trial: 1038 samples with 1 evaluation.
 Range (min … max): 3.685 ms … 30.347 ms ┊ GC (min … max): 0.00% … 0.00%
 Time (median):    4.678 ms             ┊ GC (median):   0.00%
 Time (mean ± σ):  4.797 ms ±  1.765 ms ┊ GC (mean ± σ): 0.84% ± 2.47%
 ▅▆▆▄▅▆█▆▂
 █████████▅▄▅▅▄▅▅▆▆▄▅▅▅▅▄▁▅▁▄▅▁▅▁▁▁▁▄▁▁▁▁▁▁▄▁▁▁▁▄▁▄▅▁▄▅▄▁▁▄ █
 3.69 ms     Histogram: log(frequency) by time    13.7 ms <
 Memory estimate: 3.59 KiB, allocs estimate: 64.

For 1000x1000 where Julia is again a little slower.

─➤ python cupy_bench.py
fft_func           :   CPU:  42.674 us  +/- 2.080 (min:  40.792 / max:  52.721) us    GPU-0: 172.429 us  +/- 1.735 (min: 168.960 / max: 180.224) us
fft_func           :   CPU:  42.063 us  +/- 0.844 (min:  40.640 / max:  45.095) us    GPU-0: 171.843 us  +/- 1.971 (min: 168.960 / max: 181.056) us
True

julia> @benchmark @CUDA.sync fft_func(A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max): 214.694 μs …  27.955 ms ┊ GC (min … max): 0.00% … 0.00%
 Time (median):    294.872 μs              ┊ GC (median):   0.00%
 Time (mean ± σ):  341.049 μs ± 643.172 μs ┊ GC (mean ± σ): 0.93% ± 1.17%
               ▆█▅▂
 ▂▂▂▂▂▂▂▂▂▂▂▂▃█████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▁▂▂▂ ▃
 215 μs          Histogram: frequency by time         521 μs <
 Memory estimate: 4.00 KiB, allocs estimate: 72.
julia> @benchmark @CUDA.sync fft_func(B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max): 190.666 μs …  27.762 ms ┊ GC (min … max): 0.00% … 0.00%
 Time (median):    273.990 μs              ┊ GC (median):   0.00%
 Time (mean ± σ):  344.191 μs ± 855.073 μs ┊ GC (mean ± σ): 0.54% ± 0.79%
       ▇█▂
 ▂▂▂▃▃▆███▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▁▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▂ ▂
 191 μs          Histogram: frequency by time         890 μs <
 Memory estimate: 640 bytes, allocs estimate: 14.

So I’m not sure what’s going on but I would think that benchmarks indicate that one would need to dig deeper into CUDA, memory allocation and CuFFT…

2 Likes

Thank you all for your insight! I’ll start by looking into the memory allocation that happens during the FFT’s and report back. Hopefully, I can get that done sometime this week.

2 Likes

I apologize if I’m missing something obvious but in your 4000x4000 benchmark doesn’t the complex array still lose to CuPy? The median and average time are ~4.7 ms vs the CuPy average of ~4 ms.

Looking at the min, CUDA.jl seems a little better.

But, yes there seems to be something odd.

I apologize for the delay. I’ve tried looking through the code used by CuPy and CUDA.jl for accessing cuFFT but I don’t see any obvious reason for why there would be this performance difference. I’ll open an issue on the CUDA.jl repo.

3 Likes