using CUDA
using CUDA.CUFFT
using BenchmarkTools: @benchmark
import FFTW
a_gpu = CUDA.rand(Float32, 128, 128, 128)
@benchmark CUDA.@sync fft($a_gpu)
which gives:
BenchmarkTools.Trial: 1744 samples with 1 evaluation.
Range (min … max): 2.390 ms … 116.107 ms ┊ GC (min … max): 0.00% … 96.93%
Time (median): 2.596 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.842 ms ± 2.762 ms ┊ GC (mean ± σ): 3.00% ± 3.53%
▄▆█▅ ▁ ▄▁
█▇████▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▆▅▇▇█████▅▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▁▅▅▆▄▄▅▅ █
2.39 ms Histogram: log(frequency) by time 4.97 ms <
Memory estimate: 5.03 KiB, allocs estimate: 162.
While the following code:
fp = plan_fft(a_gpu, flags=FFTW.MEASURE)
@benchmark CUDA.@sync $fp * $a_gpu
which gives:
BenchmarkTools.Trial: 202 samples with 1 evaluation.
Range (min … max): 22.257 ms … 32.898 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 24.642 ms ┊ GC (median): 4.01%
Time (mean ± σ): 24.843 ms ± 1.587 ms ┊ GC (mean ± σ): 4.25% ± 3.75%
▁ ▂▄ █
▂▂▁▃▆██▅█▄▂▅██▇▄▃▁▄▇█▇▅▂▁▂▁▁▁▁▁▁▁▁▃▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂ ▃
22.3 ms Histogram: frequency by time 32.6 ms <
Memory estimate: 40.00 MiB, allocs estimate: 11.
Am I doing something wrong?
Output of CUDA.versioninfo()
:
CUDA runtime 12.6, artifact installation
CUDA driver 12.6
NVIDIA driver 535.183.1
CUDA libraries:
- CUBLAS: 12.6.1
- CURAND: 10.3.7
- CUFFT: 11.2.6
- CUSOLVER: 11.6.4
- CUSPARSE: 12.5.3
- CUPTI: 2024.3.1 (API 24.0.0)
- NVML: 12.0.0+535.183.1
Julia packages:
- CUDA: 5.5.0
- CUDA_Driver_jll: 0.10.2+0
- CUDA_Runtime_jll: 0.15.2+0
Toolchain:
- Julia: 1.10.5
- LLVM: 15.0.7
1 device:
0: NVIDIA T400 4GB (sm_75, 347.000 MiB / 4.000 GiB available)