I am working on a simulation whose bottleneck is lots of FFT-based convolutions performed on the GPU. I wanted to see how FFT’s from CUDA.jl would compare with one of bigger Python GPU libraries CuPy. I was surprised to see that CUDA.jl FFT’s were slower than CuPy for moderately sized arrays. Here is the Julia code I was benchmarking
using CUDA
using CUDA.CUFFT
using BenchmarkTools
A = CUDA.rand(Float32, 200,200)
function fft_func(A)
return fft(A)
end
@benchmark @CUDA.sync fft_func(A)
Here is the python code I used. For profiling CuPy, I used their recommended method.
import cupyx.scipy.fft as cufft
import cupy as cp
from cupyx.profiler import benchmark
A = cp.random.random((200,200)).astype(cp.float32)
def fft_func(A):
return cufft.fftn(A)
print(benchmark(fft_func, (A,), n_repeat=10000))
It’s a similar story for other FFT’s like rfft, ifft, and FFT’s with plans.
Is there anything I could do to improve the performance of the Julia FFT’s to be more competitive with CuPy? If there isn’t, I would still be interested in understanding why there can be such a difference in performance if, presumably, both libraries are calling out to CUDA CuFFT.
That’s not great. Sadly I’m not too familiar with FFT APIs, so I don’t have any initial thoughts on what could be wrong. I’d start with checking the API calls we perform and see if there’s anything unexpected in there (if possible, comparing to the API calls Python makes). Since CUFFT doesn’t have a logger, you can do this by devving CUDA.jl, opening libcufft.jl and changing every ccall to @debug_ccall. That will generate a trace of API calls in your terminal. Since these API calls are supposed to be similar to what you would do with FFTW, maybe some people here will notice what could be wrong.
Alternatively, if you are familiar with CUDA, running both under NSight Systems might also reveal something (e.g., it might be that the CUFFT kernels are just as fast, but that we have some inefficiency in our wrappers causing what I presume would need to be a 100us slowdown).
If you don’t manage to resolve this, please make sure to open an issue on the CUDA.jl repo. Having a look with NSight is something I could do.
Since you are using real numbers as your input, I wonder if CuPy is automatically applying an RFFT routine instead of complex FFT. That means, it would only have to calculate half as many Fourier coefficients (due to symmetry) which fits with the observed speed up.
Yeah, the sizes of the arrays are pretty small and FFTW for example always copies a real input array to a complex array which is always one allocation.
For 4000x4000, where Julia seems to be slightly faster for the Complex Array.
╰─➤ python cupy_bench.py
fft_func : CPU: 49.242 us +/- 3.073 (min: 45.263 / max: 94.069) us GPU-0: 3974.743 us +/-65.493 (min: 3913.984 / max: 5259.232) us
fft_func : CPU: 49.422 us +/- 4.618 (min: 44.617 / max: 108.146) us GPU-0: 3973.748 us +/-42.647 (min: 3918.848 / max: 4378.624) us
True
julia> @benchmark @CUDA.sync fft_func(A)
BenchmarkTools.Trial: 831 samples with 1 evaluation.
Range (min … max): 4.996 ms … 36.608 ms ┊ GC (min … max): 0.00% … 2.28%
Time (median): 5.639 ms ┊ GC (median): 0.00%
Time (mean ± σ): 5.990 ms ± 2.002 ms ┊ GC (mean ± σ): 1.39% ± 3.18%
▄▂ ▆█▅
███████▁▅▅▁▁▅▅▄▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▅▄▅▇▇▅▅▅▄▅ ▇
5 ms Histogram: log(frequency) by time 13.4 ms <
Memory estimate: 7.08 KiB, allocs estimate: 124.
julia> @benchmark @CUDA.sync fft_func(B)
BenchmarkTools.Trial: 1038 samples with 1 evaluation.
Range (min … max): 3.685 ms … 30.347 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.678 ms ┊ GC (median): 0.00%
Time (mean ± σ): 4.797 ms ± 1.765 ms ┊ GC (mean ± σ): 0.84% ± 2.47%
▅▆▆▄▅▆█▆▂
█████████▅▄▅▅▄▅▅▆▆▄▅▅▅▅▄▁▅▁▄▅▁▅▁▁▁▁▄▁▁▁▁▁▁▄▁▁▁▁▄▁▄▅▁▄▅▄▁▁▄ █
3.69 ms Histogram: log(frequency) by time 13.7 ms <
Memory estimate: 3.59 KiB, allocs estimate: 64.
For 1000x1000 where Julia is again a little slower.
─➤ python cupy_bench.py
fft_func : CPU: 42.674 us +/- 2.080 (min: 40.792 / max: 52.721) us GPU-0: 172.429 us +/- 1.735 (min: 168.960 / max: 180.224) us
fft_func : CPU: 42.063 us +/- 0.844 (min: 40.640 / max: 45.095) us GPU-0: 171.843 us +/- 1.971 (min: 168.960 / max: 181.056) us
True
julia> @benchmark @CUDA.sync fft_func(A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 214.694 μs … 27.955 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 294.872 μs ┊ GC (median): 0.00%
Time (mean ± σ): 341.049 μs ± 643.172 μs ┊ GC (mean ± σ): 0.93% ± 1.17%
▆█▅▂
▂▂▂▂▂▂▂▂▂▂▂▂▃█████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▁▂▂▂ ▃
215 μs Histogram: frequency by time 521 μs <
Memory estimate: 4.00 KiB, allocs estimate: 72.
julia> @benchmark @CUDA.sync fft_func(B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 190.666 μs … 27.762 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 273.990 μs ┊ GC (median): 0.00%
Time (mean ± σ): 344.191 μs ± 855.073 μs ┊ GC (mean ± σ): 0.54% ± 0.79%
▇█▂
▂▂▂▃▃▆███▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▁▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▂ ▂
191 μs Histogram: frequency by time 890 μs <
Memory estimate: 640 bytes, allocs estimate: 14.
So I’m not sure what’s going on but I would think that benchmarks indicate that one would need to dig deeper into CUDA, memory allocation and CuFFT…
Thank you all for your insight! I’ll start by looking into the memory allocation that happens during the FFT’s and report back. Hopefully, I can get that done sometime this week.
I apologize if I’m missing something obvious but in your 4000x4000 benchmark doesn’t the complex array still lose to CuPy? The median and average time are ~4.7 ms vs the CuPy average of ~4 ms.
I apologize for the delay. I’ve tried looking through the code used by CuPy and CUDA.jl for accessing cuFFT but I don’t see any obvious reason for why there would be this performance difference. I’ll open an issue on the CUDA.jl repo.
I’ve got no clue what’s going on, but look at the order of my commands. After planning several times without any change, it suddenly increases performance by a factor of ~5-10. Is that some compiler optimization going on?
During the low runtimes, I observe ~10% GPU volatile.
During the better runtimes, like 50%.
(@v1.8) pkg> activate Documents/julia_playground/
Activating project at `~/Documents/julia_playground`
julia> using CUDA, FFTW
julia> using BenchmarkTools
julia> A = CUDA.rand(Float32, 200,200);
julia> @benchmark CUDA.@sync begin
@CUDA.sync fft($A)
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 161.011 μs … 844.039 μs ┊ GC (min … max): 0.00% … 90.94%
Time (median): 260.528 μs ┊ GC (median): 84.54%
Time (mean ± σ): 261.357 μs ± 11.484 μs ┊ GC (mean ± σ): 84.45% ± 1.44%
▁█▃
▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃████▄▂▂ ▂
161 μs Histogram: frequency by time 270 μs <
Memory estimate: 5.48 KiB, allocs estimate: 103.
julia> p = plan_fft(A);
julia> @benchmark CUDA.@sync begin
@CUDA.sync fft($A)
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 201.568 μs … 968.392 μs ┊ GC (min … max): 0.00% … 92.38%
Time (median): 284.763 μs ┊ GC (median): 85.11%
Time (mean ± σ): 284.128 μs ± 13.882 μs ┊ GC (mean ± σ): 85.18% ± 1.33%
▁▆▁▃ ▇█▄▄
▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂████▇▆████▇▃▃▂▂ ▃
202 μs Histogram: frequency by time 298 μs <
Memory estimate: 5.48 KiB, allocs estimate: 103.
julia> p = plan_fft(A);
julia> p = plan_fft(A)
CUFFT complex forward plan for 200×200 CuArray of ComplexF32
julia> @benchmark CUDA.@sync begin
@CUDA.sync fft($A)
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 28.673 μs … 1.308 ms ┊ GC (min … max): 0.00% … 93.62%
Time (median): 224.756 μs ┊ GC (median): 0.00%
Time (mean ± σ): 156.725 μs ± 127.727 μs ┊ GC (mean ± σ): 77.48% ± 42.86%
█ ▃
█▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃█▆▃ ▂
28.7 μs Histogram: frequency by time 292 μs <
Memory estimate: 5.48 KiB, allocs estimate: 103.
julia> @benchmark CUDA.@sync begin
@CUDA.sync fft($A)
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 28.513 μs … 961.339 μs ┊ GC (min … max): 0.00% … 92.67%
Time (median): 126.466 μs ┊ GC (median): 0.00%
Time (mean ± σ): 156.494 μs ± 126.913 μs ┊ GC (mean ± σ): 77.59% ± 42.92%
█ ▃
█▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▅█▄ ▂
28.5 μs Histogram: frequency by time 289 μs <
Memory estimate: 5.48 KiB, allocs estimate: 103.