CuPy CuFFT ~2x faster than CUDA.jl CuFFT

I am working on a simulation whose bottleneck is lots of FFT-based convolutions performed on the GPU. I wanted to see how FFT’s from CUDA.jl would compare with one of bigger Python GPU libraries CuPy. I was surprised to see that CUDA.jl FFT’s were slower than CuPy for moderately sized arrays. Here is the Julia code I was benchmarking

using CUDA 
using CUDA.CUFFT 
using BenchmarkTools

A = CUDA.rand(Float32, 200,200)

function fft_func(A)
    return fft(A) 
end

@benchmark @CUDA.sync fft_func(A)

Here is the python code I used. For profiling CuPy, I used their recommended method.

import cupyx.scipy.fft as cufft
import cupy as cp 
from cupyx.profiler import benchmark

A = cp.random.random((200,200)).astype(cp.float32)

def fft_func(A):

    return cufft.fftn(A)

print(benchmark(fft_func, (A,), n_repeat=10000))

The Julia code gives me this result
image

The Python code gives me this

It’s a similar story for other FFT’s like rfft, ifft, and FFT’s with plans.

Is there anything I could do to improve the performance of the Julia FFT’s to be more competitive with CuPy? If there isn’t, I would still be interested in understanding why there can be such a difference in performance if, presumably, both libraries are calling out to CUDA CuFFT.

Thank you all for your time!

3 Likes

I ran your code and go the same result. CUDA.jl being ~2X slower. would be nice to get this sorted out,

That’s not great. Sadly I’m not too familiar with FFT APIs, so I don’t have any initial thoughts on what could be wrong. I’d start with checking the API calls we perform and see if there’s anything unexpected in there (if possible, comparing to the API calls Python makes). Since CUFFT doesn’t have a logger, you can do this by devving CUDA.jl, opening libcufft.jl and changing every ccall to @debug_ccall. That will generate a trace of API calls in your terminal. Since these API calls are supposed to be similar to what you would do with FFTW, maybe some people here will notice what could be wrong.

Alternatively, if you are familiar with CUDA, running both under NSight Systems might also reveal something (e.g., it might be that the CUFFT kernels are just as fast, but that we have some inefficiency in our wrappers causing what I presume would need to be a 100us slowdown).

If you don’t manage to resolve this, please make sure to open an issue on the CUDA.jl repo. Having a look with NSight is something I could do.

1 Like

Since you are using real numbers as your input, I wonder if CuPy is automatically applying an RFFT routine instead of complex FFT. That means, it would only have to calculate half as many Fourier coefficients (due to symmetry) which fits with the observed speed up.

That doesn’t seem to be the case (explicitly invoking rfftn yields another speed-up of factor 2`).

However, using a complex64 in Python (equivalent to a ComplexF32) the speed-up in comparison is even larger!

julia> using CUDA

julia> using CUDA.CUFFT

julia> using BenchmarkTools

julia> A = CUDA.rand(Float32, 200,200)
200×200 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 

julia> B = CUDA.rand(ComplexF32, 200,200)
200×200 CuArray{ComplexF32, 2, CUDA.Mem.DeviceBuffer}:

julia> function fft_func(A)
           return fft(A) 
       end
fft_func (generic function with 1 method)

julia> @benchmark @CUDA.sync fft_func(A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   98.100 μs …  64.329 ms  ┊ GC (min … max): 0.00% … 10.07%
 Time  (median):     101.407 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   108.737 μs ± 642.544 μs  ┊ GC (mean ± σ):  0.60% ±  0.10%

        ▂▅█▇▆█▇▆▅▄▃▂▂▁▁                                          
  ▁▁▂▃▅▇████████████████▇▇▆▅▅▄▃▃▃▃▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  98.1 μs          Histogram: frequency by time          112 μs <

 Memory estimate: 4.00 KiB, allocs estimate: 72.

julia> @benchmark @CUDA.sync fft_func(B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  85.190 μs …  3.431 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     87.794 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   89.531 μs ± 36.505 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▃▆█▇▄▃                                                 
  ▁▁▂▅█████████▆▄▃▃▂▂▂▂▄▄▆▆▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  85.2 μs         Histogram: frequency by time         100 μs <

 Memory estimate: 640 bytes, allocs estimate: 14.
import cupyx.scipy.fft as cufft
import cupy as cp  
from cupyx.profiler import benchmark

A = cp.ones((200,200)).astype(cp.float32)
B = A.astype(cp.complex64)

def fft_func(A):
    return cufft.fftn(A)

print(benchmark(fft_func, (A,), n_repeat=1000))
print(benchmark(fft_func, (B,), n_repeat=1000))

print(cp.all(cp.isclose(fft_func(B), fft_func(A))))



╰─➤  python cupy_bench.py    
fft_func            :    CPU:   42.536 us   +/- 1.695 (min:   40.738 / max:   58.292) us     GPU-0:   45.728 us   +/- 1.902 (min:   43.200 / max:   62.496) us
fft_func            :    CPU:   21.589 us   +/- 1.234 (min:   20.202 / max:   43.247) us     GPU-0:   25.755 us   +/-20.762 (min:    9.760 / max:  633.856) us
True
1 Like

The fact that complex arrays lead to an additional speed up makes me think that allocations are influencing the measurement…

1 Like

Yeah, the sizes of the arrays are pretty small and FFTW for example always copies a real input array to a complex array which is always one allocation.

For 4000x4000, where Julia seems to be slightly faster for the Complex Array.

╰─➤ python cupy_bench.py
fft_func           :   CPU:  49.242 us  +/- 3.073 (min:  45.263 / max:  94.069) us    GPU-0: 3974.743 us  +/-65.493 (min: 3913.984 / max: 5259.232) us
fft_func           :   CPU:  49.422 us  +/- 4.618 (min:  44.617 / max: 108.146) us    GPU-0: 3973.748 us  +/-42.647 (min: 3918.848 / max: 4378.624) us
True


julia> @benchmark @CUDA.sync fft_func(A)
BenchmarkTools.Trial: 831 samples with 1 evaluation.
 Range (min … max): 4.996 ms … 36.608 ms ┊ GC (min … max): 0.00% … 2.28%
 Time (median):    5.639 ms             ┊ GC (median):   0.00%
 Time (mean ± σ):  5.990 ms ±  2.002 ms ┊ GC (mean ± σ): 1.39% ± 3.18%
 ▄▂ ▆█▅
 ███████▁▅▅▁▁▅▅▄▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▅▄▅▇▇▅▅▅▄▅ ▇
 5 ms        Histogram: log(frequency) by time    13.4 ms <
 Memory estimate: 7.08 KiB, allocs estimate: 124.
julia> @benchmark @CUDA.sync fft_func(B)
BenchmarkTools.Trial: 1038 samples with 1 evaluation.
 Range (min … max): 3.685 ms … 30.347 ms ┊ GC (min … max): 0.00% … 0.00%
 Time (median):    4.678 ms             ┊ GC (median):   0.00%
 Time (mean ± σ):  4.797 ms ±  1.765 ms ┊ GC (mean ± σ): 0.84% ± 2.47%
 ▅▆▆▄▅▆█▆▂
 █████████▅▄▅▅▄▅▅▆▆▄▅▅▅▅▄▁▅▁▄▅▁▅▁▁▁▁▄▁▁▁▁▁▁▄▁▁▁▁▄▁▄▅▁▄▅▄▁▁▄ █
 3.69 ms     Histogram: log(frequency) by time    13.7 ms <
 Memory estimate: 3.59 KiB, allocs estimate: 64.

For 1000x1000 where Julia is again a little slower.

─➤ python cupy_bench.py
fft_func           :   CPU:  42.674 us  +/- 2.080 (min:  40.792 / max:  52.721) us    GPU-0: 172.429 us  +/- 1.735 (min: 168.960 / max: 180.224) us
fft_func           :   CPU:  42.063 us  +/- 0.844 (min:  40.640 / max:  45.095) us    GPU-0: 171.843 us  +/- 1.971 (min: 168.960 / max: 181.056) us
True

julia> @benchmark @CUDA.sync fft_func(A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max): 214.694 μs …  27.955 ms ┊ GC (min … max): 0.00% … 0.00%
 Time (median):    294.872 μs              ┊ GC (median):   0.00%
 Time (mean ± σ):  341.049 μs ± 643.172 μs ┊ GC (mean ± σ): 0.93% ± 1.17%
               ▆█▅▂
 ▂▂▂▂▂▂▂▂▂▂▂▂▃█████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▁▂▂▂ ▃
 215 μs          Histogram: frequency by time         521 μs <
 Memory estimate: 4.00 KiB, allocs estimate: 72.
julia> @benchmark @CUDA.sync fft_func(B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max): 190.666 μs …  27.762 ms ┊ GC (min … max): 0.00% … 0.00%
 Time (median):    273.990 μs              ┊ GC (median):   0.00%
 Time (mean ± σ):  344.191 μs ± 855.073 μs ┊ GC (mean ± σ): 0.54% ± 0.79%
       ▇█▂
 ▂▂▂▃▃▆███▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▁▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▂ ▂
 191 μs          Histogram: frequency by time         890 μs <
 Memory estimate: 640 bytes, allocs estimate: 14.

So I’m not sure what’s going on but I would think that benchmarks indicate that one would need to dig deeper into CUDA, memory allocation and CuFFT…

2 Likes

Thank you all for your insight! I’ll start by looking into the memory allocation that happens during the FFT’s and report back. Hopefully, I can get that done sometime this week.

2 Likes

I apologize if I’m missing something obvious but in your 4000x4000 benchmark doesn’t the complex array still lose to CuPy? The median and average time are ~4.7 ms vs the CuPy average of ~4 ms.

Looking at the min, CUDA.jl seems a little better.

But, yes there seems to be something odd.

I apologize for the delay. I’ve tried looking through the code used by CuPy and CUDA.jl for accessing cuFFT but I don’t see any obvious reason for why there would be this performance difference. I’ll open an issue on the CUDA.jl repo.

3 Likes

Seems to be solved by @maleadt!

5 Likes

BTW, did anyone ever compare the fft vs the planning one?

The difference is huge!

Is such a big difference expected @maleadt ?

julia> @benchmark begin 
                  @CUDA.sync p = plan_fft($A)
                  @CUDA.sync p * $A
              end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  40.596 μs … 622.565 μs  ┊ GC (min … max):  0.00% … 61.35%
 Time  (median):     42.259 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   58.095 μs ±  88.510 μs  ┊ GC (mean ± σ):  19.94% ± 11.75%

  █                                                          ▂ ▁
  █▇▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆█ █
  40.6 μs       Histogram: log(frequency) by time       559 μs <

 Memory estimate: 10.05 KiB, allocs estimate: 183.


julia> @benchmark @CUDA.sync begin 
           fft($A)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  190.988 μs … 506.458 μs  ┊ GC (min … max):  0.00% … 74.59%
 Time  (median):     384.980 μs               ┊ GC (median):    88.92%
 Time  (mean ± σ):   385.639 μs ±   5.342 μs  ┊ GC (mean ± σ):  88.91% ±  0.97%

                                                         █▅▄     
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃███▅▃▂ ▂
  191 μs           Histogram: frequency by time          400 μs <

 Memory estimate: 5.48 KiB, allocs estimate: 103.

julia> @benchmark @CUDA.sync begin 
           fft($A)
       end^C

julia> p = plan_fft(A);

julia> @benchmark @CUDA.sync begin 
           $p * $A
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.695 μs … 407.052 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.029 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.347 μs ±   4.893 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▂▃▅█▇▆▄▆▅▅▅▄▄▃▄▂▁▁                                    
  ▁▁▁▂▃▄▅▇██████████████████▇█▇▇▇▇▆▆▄▄▄▃▃▃▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁ ▄
  23.7 μs         Histogram: frequency by time         31.2 μs <

 Memory estimate: 4.72 KiB, allocs estimate: 84.

julia> @benchmark @CUDA.sync plan_fft($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.848 μs … 518.240 μs  ┊ GC (min … max):  0.00% … 82.55%
 Time  (median):     15.769 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   26.950 μs ±  60.849 μs  ┊ GC (mean ± σ):  33.93% ± 13.97%

  █▃                                                         ▂ ▁
  ██▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▃█ █
  14.8 μs       Histogram: log(frequency) by time       365 μs <

 Memory estimate: 5.33 KiB, allocs estimate: 99.


I’ve got no clue what’s going on, but look at the order of my commands. After planning several times without any change, it suddenly increases performance by a factor of ~5-10. Is that some compiler optimization going on?

During the low runtimes, I observe ~10% GPU volatile.
During the better runtimes, like 50%.

(@v1.8) pkg> activate Documents/julia_playground/
  Activating project at `~/Documents/julia_playground`



julia> using CUDA, FFTW

julia> using BenchmarkTools


julia> A = CUDA.rand(Float32, 200,200);

julia> @benchmark CUDA.@sync begin 
           @CUDA.sync fft($A)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  161.011 μs … 844.039 μs  ┊ GC (min … max):  0.00% … 90.94%
 Time  (median):     260.528 μs               ┊ GC (median):    84.54%
 Time  (mean ± σ):   261.357 μs ±  11.484 μs  ┊ GC (mean ± σ):  84.45% ±  1.44%

                                                        ▁█▃      
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃████▄▂▂ ▂
  161 μs           Histogram: frequency by time          270 μs <

 Memory estimate: 5.48 KiB, allocs estimate: 103.

julia> p = plan_fft(A);

julia> @benchmark CUDA.@sync begin 
           @CUDA.sync fft($A)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  201.568 μs … 968.392 μs  ┊ GC (min … max):  0.00% … 92.38%
 Time  (median):     284.763 μs               ┊ GC (median):    85.11%
 Time  (mean ± σ):   284.128 μs ±  13.882 μs  ┊ GC (mean ± σ):  85.18% ±  1.33%

                                                ▁▆▁▃  ▇█▄▄       
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂████▇▆████▇▃▃▂▂ ▃
  202 μs           Histogram: frequency by time          298 μs <

 Memory estimate: 5.48 KiB, allocs estimate: 103.

julia> p = plan_fft(A);

julia> p = plan_fft(A)
CUFFT complex forward plan for 200×200 CuArray of ComplexF32

julia> @benchmark CUDA.@sync begin 
           @CUDA.sync fft($A)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   28.673 μs …   1.308 ms  ┊ GC (min … max):  0.00% … 93.62%
 Time  (median):     224.756 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   156.725 μs ± 127.727 μs  ┊ GC (mean ± σ):  77.48% ± 42.86%

  █                                                         ▃    
  █▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃█▆▃ ▂
  28.7 μs          Histogram: frequency by time          292 μs <

 Memory estimate: 5.48 KiB, allocs estimate: 103.

julia> @benchmark CUDA.@sync begin 
           @CUDA.sync fft($A)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   28.513 μs … 961.339 μs  ┊ GC (min … max):  0.00% … 92.67%
 Time  (median):     126.466 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   156.494 μs ± 126.913 μs  ┊ GC (mean ± σ):  77.59% ± 42.92%

  █                                                          ▃   
  █▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▅█▄ ▂
  28.5 μs          Histogram: frequency by time          289 μs <

 Memory estimate: 5.48 KiB, allocs estimate: 103.


1 Like

It looks like garbage collection went from 85% to 0% in your benchmarks…

Do you think so?

Citing the evals:

Range (min … max):  201.568 μs … 968.392 μs  ┊ GC (min … max):  0.00% … 92.38%

In the min runs, the GC time was alway 0.00%.