CuPy CuFFT ~2x faster than CUDA.jl CuFFT

Dreycen_Foiles · October 23, 2022, 4:57pm

I am working on a simulation whose bottleneck is lots of FFT-based convolutions performed on the GPU. I wanted to see how FFT’s from CUDA.jl would compare with one of bigger Python GPU libraries CuPy. I was surprised to see that CUDA.jl FFT’s were slower than CuPy for moderately sized arrays. Here is the Julia code I was benchmarking

using CUDA 
using CUDA.CUFFT 
using BenchmarkTools

A = CUDA.rand(Float32, 200,200)

function fft_func(A)
    return fft(A) 
end

@benchmark @CUDA.sync fft_func(A)

Here is the python code I used. For profiling CuPy, I used their recommended method.

import cupyx.scipy.fft as cufft
import cupy as cp 
from cupyx.profiler import benchmark

A = cp.random.random((200,200)).astype(cp.float32)

def fft_func(A):

    return cufft.fftn(A)

print(benchmark(fft_func, (A,), n_repeat=10000))

The Julia code gives me this result

The Python code gives me this

It’s a similar story for other FFT’s like rfft, ifft, and FFT’s with plans.

Is there anything I could do to improve the performance of the Julia FFT’s to be more competitive with CuPy? If there isn’t, I would still be interested in understanding why there can be such a difference in performance if, presumably, both libraries are calling out to CUDA CuFFT.

Thank you all for your time!

fft · October 23, 2022, 7:52pm

I ran your code and go the same result. CUDA.jl being ~2X slower. would be nice to get this sorted out,

maleadt · October 24, 2022, 6:01am

That’s not great. Sadly I’m not too familiar with FFT APIs, so I don’t have any initial thoughts on what could be wrong. I’d start with checking the API calls we perform and see if there’s anything unexpected in there (if possible, comparing to the API calls Python makes). Since CUFFT doesn’t have a logger, you can do this by devving CUDA.jl, opening libcufft.jl and changing every ccall to @debug_ccall. That will generate a trace of API calls in your terminal. Since these API calls are supposed to be similar to what you would do with FFTW, maybe some people here will notice what could be wrong.

Alternatively, if you are familiar with CUDA, running both under NSight Systems might also reveal something (e.g., it might be that the CUFFT kernels are just as fast, but that we have some inefficiency in our wrappers causing what I presume would need to be a 100us slowdown).

If you don’t manage to resolve this, please make sure to open an issue on the CUDA.jl repo. Having a look with NSight is something I could do.

josuagrw · October 24, 2022, 7:29am

Since you are using real numbers as your input, I wonder if CuPy is automatically applying an RFFT routine instead of complex FFT. That means, it would only have to calculate half as many Fourier coefficients (due to symmetry) which fits with the observed speed up.

roflmaostc · October 24, 2022, 2:40pm

That doesn’t seem to be the case (explicitly invoking rfftn yields another speed-up of factor 2`).

However, using a complex64 in Python (equivalent to a ComplexF32) the speed-up in comparison is even larger!

julia> using CUDA

julia> using CUDA.CUFFT

julia> using BenchmarkTools

julia> A = CUDA.rand(Float32, 200,200)
200×200 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 

julia> B = CUDA.rand(ComplexF32, 200,200)
200×200 CuArray{ComplexF32, 2, CUDA.Mem.DeviceBuffer}:

julia> function fft_func(A)
           return fft(A) 
       end
fft_func (generic function with 1 method)

julia> @benchmark @CUDA.sync fft_func(A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   98.100 μs …  64.329 ms  ┊ GC (min … max): 0.00% … 10.07%
 Time  (median):     101.407 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   108.737 μs ± 642.544 μs  ┊ GC (mean ± σ):  0.60% ±  0.10%

        ▂▅█▇▆█▇▆▅▄▃▂▂▁▁                                          
  ▁▁▂▃▅▇████████████████▇▇▆▅▅▄▃▃▃▃▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  98.1 μs          Histogram: frequency by time          112 μs <

 Memory estimate: 4.00 KiB, allocs estimate: 72.

julia> @benchmark @CUDA.sync fft_func(B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  85.190 μs …  3.431 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     87.794 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   89.531 μs ± 36.505 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▃▆█▇▄▃                                                 
  ▁▁▂▅█████████▆▄▃▃▂▂▂▂▄▄▆▆▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  85.2 μs         Histogram: frequency by time         100 μs <

 Memory estimate: 640 bytes, allocs estimate: 14.

import cupyx.scipy.fft as cufft
import cupy as cp  
from cupyx.profiler import benchmark

A = cp.ones((200,200)).astype(cp.float32)
B = A.astype(cp.complex64)

def fft_func(A):
    return cufft.fftn(A)

print(benchmark(fft_func, (A,), n_repeat=1000))
print(benchmark(fft_func, (B,), n_repeat=1000))

print(cp.all(cp.isclose(fft_func(B), fft_func(A))))



╰─➤  python cupy_bench.py    
fft_func            :    CPU:   42.536 us   +/- 1.695 (min:   40.738 / max:   58.292) us     GPU-0:   45.728 us   +/- 1.902 (min:   43.200 / max:   62.496) us
fft_func            :    CPU:   21.589 us   +/- 1.234 (min:   20.202 / max:   43.247) us     GPU-0:   25.755 us   +/-20.762 (min:    9.760 / max:  633.856) us
True

josuagrw · October 24, 2022, 4:20pm

The fact that complex arrays lead to an additional speed up makes me think that allocations are influencing the measurement…

roflmaostc · October 24, 2022, 4:28pm

Yeah, the sizes of the arrays are pretty small and FFTW for example always copies a real input array to a complex array which is always one allocation.

For 4000x4000, where Julia seems to be slightly faster for the Complex Array.

╰─➤ python cupy_bench.py
fft_func           :   CPU:  49.242 us  +/- 3.073 (min:  45.263 / max:  94.069) us    GPU-0: 3974.743 us  +/-65.493 (min: 3913.984 / max: 5259.232) us
fft_func           :   CPU:  49.422 us  +/- 4.618 (min:  44.617 / max: 108.146) us    GPU-0: 3973.748 us  +/-42.647 (min: 3918.848 / max: 4378.624) us
True


julia> @benchmark @CUDA.sync fft_func(A)
BenchmarkTools.Trial: 831 samples with 1 evaluation.
 Range (min … max): 4.996 ms … 36.608 ms ┊ GC (min … max): 0.00% … 2.28%
 Time (median):    5.639 ms             ┊ GC (median):   0.00%
 Time (mean ± σ):  5.990 ms ±  2.002 ms ┊ GC (mean ± σ): 1.39% ± 3.18%
 ▄▂ ▆█▅
 ███████▁▅▅▁▁▅▅▄▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▅▄▅▇▇▅▅▅▄▅ ▇
 5 ms        Histogram: log(frequency) by time    13.4 ms <
 Memory estimate: 7.08 KiB, allocs estimate: 124.
julia> @benchmark @CUDA.sync fft_func(B)
BenchmarkTools.Trial: 1038 samples with 1 evaluation.
 Range (min … max): 3.685 ms … 30.347 ms ┊ GC (min … max): 0.00% … 0.00%
 Time (median):    4.678 ms             ┊ GC (median):   0.00%
 Time (mean ± σ):  4.797 ms ±  1.765 ms ┊ GC (mean ± σ): 0.84% ± 2.47%
 ▅▆▆▄▅▆█▆▂
 █████████▅▄▅▅▄▅▅▆▆▄▅▅▅▅▄▁▅▁▄▅▁▅▁▁▁▁▄▁▁▁▁▁▁▄▁▁▁▁▄▁▄▅▁▄▅▄▁▁▄ █
 3.69 ms     Histogram: log(frequency) by time    13.7 ms <
 Memory estimate: 3.59 KiB, allocs estimate: 64.

For 1000x1000 where Julia is again a little slower.

─➤ python cupy_bench.py
fft_func           :   CPU:  42.674 us  +/- 2.080 (min:  40.792 / max:  52.721) us    GPU-0: 172.429 us  +/- 1.735 (min: 168.960 / max: 180.224) us
fft_func           :   CPU:  42.063 us  +/- 0.844 (min:  40.640 / max:  45.095) us    GPU-0: 171.843 us  +/- 1.971 (min: 168.960 / max: 181.056) us
True

julia> @benchmark @CUDA.sync fft_func(A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max): 214.694 μs …  27.955 ms ┊ GC (min … max): 0.00% … 0.00%
 Time (median):    294.872 μs              ┊ GC (median):   0.00%
 Time (mean ± σ):  341.049 μs ± 643.172 μs ┊ GC (mean ± σ): 0.93% ± 1.17%
               ▆█▅▂
 ▂▂▂▂▂▂▂▂▂▂▂▂▃█████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▁▂▂▂ ▃
 215 μs          Histogram: frequency by time         521 μs <
 Memory estimate: 4.00 KiB, allocs estimate: 72.
julia> @benchmark @CUDA.sync fft_func(B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max): 190.666 μs …  27.762 ms ┊ GC (min … max): 0.00% … 0.00%
 Time (median):    273.990 μs              ┊ GC (median):   0.00%
 Time (mean ± σ):  344.191 μs ± 855.073 μs ┊ GC (mean ± σ): 0.54% ± 0.79%
       ▇█▂
 ▂▂▂▃▃▆███▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▁▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▂ ▂
 191 μs          Histogram: frequency by time         890 μs <
 Memory estimate: 640 bytes, allocs estimate: 14.

So I’m not sure what’s going on but I would think that benchmarks indicate that one would need to dig deeper into CUDA, memory allocation and CuFFT…

Dreycen_Foiles · October 25, 2022, 3:19am

Thank you all for your insight! I’ll start by looking into the memory allocation that happens during the FFT’s and report back. Hopefully, I can get that done sometime this week.

Dreycen_Foiles · October 25, 2022, 5:42am

I apologize if I’m missing something obvious but in your 4000x4000 benchmark doesn’t the complex array still lose to CuPy? The median and average time are ~4.7 ms vs the CuPy average of ~4 ms.

roflmaostc · November 8, 2022, 10:05am

Looking at the min, CUDA.jl seems a little better.

But, yes there seems to be something odd.

Dreycen_Foiles · November 30, 2022, 5:05am

I apologize for the delay. I’ve tried looking through the code used by CuPy and CUDA.jl for accessing cuFFT but I don’t see any obvious reason for why there would be this performance difference. I’ll open an issue on the CUDA.jl repo.

roflmaostc · February 9, 2023, 1:50pm

Seems to be solved by @maleadt!

github.com/JuliaGPU/CUDA.jl

CUDA.jl cuFFT underperforming against CuPy cuFFT

opened 05:25AM - 30 Nov 22 UTC

closed 12:57PM - 20 Jan 23 UTC

dreycenfoiles

good first issue performance

**Describe the bug** cuFFT's with Julia are underperforming when compared wit…h CuPy and I consistently see a ~2x performance gap. Below is an example run. ``` BenchmarkTools.Trial: 7144 samples with 1 evaluation. Range (min … max): 475.800 μs … 34.983 ms ┊ GC (min … max): 0.00% … 8.24% Time (median): 620.500 μs ┊ GC (median): 0.00% Time (mean ± σ): 688.121 μs ± 1.282 ms ┊ GC (mean ± σ): 0.61% ± 0.33% ``` ``` fft_func : CPU: 166.498 us +/-64.221 (min: 112.600 / max: 1298.200) us GPU-0: 336.961 us +/-62.120 (min: 238.592 / max: 1489.920) u ``` **To reproduce** The Minimal Working Example (MWE) for this bug: Julia code I ran. ```julia using CUDA using CUDA.CUFFT using BenchmarkTools A = CUDA.rand(Float32, 500,500) function fft_func(A) return fft(A) end @benchmark @CUDA.sync fft_func(A) ``` Python code I ran. ```python import cupyx.scipy.fft as cufft import cupy as cp from cupyx.profiler import benchmark A = cp.random.random((500,500)).astype(cp.float32) def fft_func(A): return cufft.fftn(A) print(benchmark(fft_func, (A,), n_repeat=10000)) ``` </p> </details> **Expected behavior** Given that both CuPy and CUDA.jl should call out to the same cuFFT routines, I would expect their runtimes to be almost identical. **Version info** Details on Julia Julia Version 1.8.1 Commit afb6c60d69 (2022-09-06 15:09 UTC) Platform Info: OS: Windows (x86_64-w64-mingw32) CPU: 6 × Intel(R) Core(TM) i5-9400 CPU @ 2.90GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-13.0.1 (ORCJIT, skylake) Threads: 1 on 6 virtual cores Environment: JULIA_EDITOR = code JULIA_NUM_THREADS = Details on CUDA: CUDA toolkit 11.7, artifact installation NVIDIA driver 516.94.0, for CUDA 11.7 CUDA driver 11.7 Libraries: - CUBLAS: 11.10.1 - CURAND: 10.2.10 - CUFFT: 10.7.1 - CUSOLVER: 11.3.5 - CUSPARSE: 11.7.3 - CUPTI: 17.0.0 - NVML: 11.0.0+516.94 - CUDNN: 8.30.2 (for CUDA 11.5.0) - CUTENSOR: 1.4.0 (for CUDA 11.5.0) Toolchain: - Julia: 1.8.1 - LLVM: 13.0.1 - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2 - Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86 1 device: 0: NVIDIA GeForce GTX 1050 Ti (sm_61, 3.366 GiB / 4.000 GiB available)

roflmaostc · February 27, 2023, 2:58pm

BTW, did anyone ever compare the fft vs the planning one?

The difference is huge!

Is such a big difference expected @maleadt ?

julia> @benchmark begin 
                  @CUDA.sync p = plan_fft($A)
                  @CUDA.sync p * $A
              end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  40.596 μs … 622.565 μs  ┊ GC (min … max):  0.00% … 61.35%
 Time  (median):     42.259 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   58.095 μs ±  88.510 μs  ┊ GC (mean ± σ):  19.94% ± 11.75%

  █                                                          ▂ ▁
  █▇▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆█ █
  40.6 μs       Histogram: log(frequency) by time       559 μs <

 Memory estimate: 10.05 KiB, allocs estimate: 183.


julia> @benchmark @CUDA.sync begin 
           fft($A)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  190.988 μs … 506.458 μs  ┊ GC (min … max):  0.00% … 74.59%
 Time  (median):     384.980 μs               ┊ GC (median):    88.92%
 Time  (mean ± σ):   385.639 μs ±   5.342 μs  ┊ GC (mean ± σ):  88.91% ±  0.97%

                                                         █▅▄     
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃███▅▃▂ ▂
  191 μs           Histogram: frequency by time          400 μs <

 Memory estimate: 5.48 KiB, allocs estimate: 103.

julia> @benchmark @CUDA.sync begin 
           fft($A)
       end^C

julia> p = plan_fft(A);

julia> @benchmark @CUDA.sync begin 
           $p * $A
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.695 μs … 407.052 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.029 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.347 μs ±   4.893 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▂▃▅█▇▆▄▆▅▅▅▄▄▃▄▂▁▁                                    
  ▁▁▁▂▃▄▅▇██████████████████▇█▇▇▇▇▆▆▄▄▄▃▃▃▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁ ▄
  23.7 μs         Histogram: frequency by time         31.2 μs <

 Memory estimate: 4.72 KiB, allocs estimate: 84.

julia> @benchmark @CUDA.sync plan_fft($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.848 μs … 518.240 μs  ┊ GC (min … max):  0.00% … 82.55%
 Time  (median):     15.769 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   26.950 μs ±  60.849 μs  ┊ GC (mean ± σ):  33.93% ± 13.97%

  █▃                                                         ▂ ▁
  ██▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▃█ █
  14.8 μs       Histogram: log(frequency) by time       365 μs <

 Memory estimate: 5.33 KiB, allocs estimate: 99.

roflmaostc · February 27, 2023, 3:18pm

I’ve got no clue what’s going on, but look at the order of my commands. After planning several times without any change, it suddenly increases performance by a factor of ~5-10. Is that some compiler optimization going on?

During the low runtimes, I observe ~10% GPU volatile.
During the better runtimes, like 50%.

(@v1.8) pkg> activate Documents/julia_playground/
  Activating project at `~/Documents/julia_playground`



julia> using CUDA, FFTW

julia> using BenchmarkTools


julia> A = CUDA.rand(Float32, 200,200);

julia> @benchmark CUDA.@sync begin 
           @CUDA.sync fft($A)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  161.011 μs … 844.039 μs  ┊ GC (min … max):  0.00% … 90.94%
 Time  (median):     260.528 μs               ┊ GC (median):    84.54%
 Time  (mean ± σ):   261.357 μs ±  11.484 μs  ┊ GC (mean ± σ):  84.45% ±  1.44%

                                                        ▁█▃      
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃████▄▂▂ ▂
  161 μs           Histogram: frequency by time          270 μs <

 Memory estimate: 5.48 KiB, allocs estimate: 103.

julia> p = plan_fft(A);

julia> @benchmark CUDA.@sync begin 
           @CUDA.sync fft($A)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  201.568 μs … 968.392 μs  ┊ GC (min … max):  0.00% … 92.38%
 Time  (median):     284.763 μs               ┊ GC (median):    85.11%
 Time  (mean ± σ):   284.128 μs ±  13.882 μs  ┊ GC (mean ± σ):  85.18% ±  1.33%

                                                ▁▆▁▃  ▇█▄▄       
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂████▇▆████▇▃▃▂▂ ▃
  202 μs           Histogram: frequency by time          298 μs <

 Memory estimate: 5.48 KiB, allocs estimate: 103.

julia> p = plan_fft(A);

julia> p = plan_fft(A)
CUFFT complex forward plan for 200×200 CuArray of ComplexF32

julia> @benchmark CUDA.@sync begin 
           @CUDA.sync fft($A)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   28.673 μs …   1.308 ms  ┊ GC (min … max):  0.00% … 93.62%
 Time  (median):     224.756 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   156.725 μs ± 127.727 μs  ┊ GC (mean ± σ):  77.48% ± 42.86%

  █                                                         ▃    
  █▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃█▆▃ ▂
  28.7 μs          Histogram: frequency by time          292 μs <

 Memory estimate: 5.48 KiB, allocs estimate: 103.

julia> @benchmark CUDA.@sync begin 
           @CUDA.sync fft($A)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   28.513 μs … 961.339 μs  ┊ GC (min … max):  0.00% … 92.67%
 Time  (median):     126.466 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   156.494 μs ± 126.913 μs  ┊ GC (mean ± σ):  77.59% ± 42.92%

  █                                                          ▃   
  █▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▅█▄ ▂
  28.5 μs          Histogram: frequency by time          289 μs <

 Memory estimate: 5.48 KiB, allocs estimate: 103.

josuagrw · February 27, 2023, 6:28pm

It looks like garbage collection went from 85% to 0% in your benchmarks…

roflmaostc · February 27, 2023, 6:42pm

Do you think so?

Citing the evals:

Range (min … max):  201.568 μs … 968.392 μs  ┊ GC (min … max):  0.00% … 92.38%

In the min runs, the GC time was alway 0.00%.

Topic		Replies	Views
Parallel FFT not that much faster General Usage fftw , parallel , multithreading	10	7070	December 19, 2016
Comparing Python, Julia, and C++ Performance broadcast , python	21	33495	November 1, 2018
Quite bad performance of Julia 0.6.4 vs Python+Numpy General Usage	26	5280	November 13, 2018
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	11079	October 15, 2019
Serious slowdown with FFTW v1.1.0 and v1.2.0 on Windows Performance	21	1754	February 21, 2020

CuPy CuFFT ~2x faster than CUDA.jl CuFFT

Related topics