Performance issue on PencilFFTs
with CuArray
CuArray Performance · Issue #56 · jipolanco/PencilFFTs.jl (github.com)
Post here to find more tests.
Code example:
CUFFT
:
using CUDA
using FFTW
using BenchmarkTools
println("Dimension 8192,32,32")
println("start fft benchmark")
b=@benchmark (CUDA.@sync op*data) setup=(op=plan_fft!(CuArray{ComplexF64}(undef,8192,32,32));data=CUDA.rand(ComplexF64,8192,32,32))
println("complete fft benchmark")
io = IOBuffer()
show(io, "text/plain", b)
s = String(take!(io))
println("FFT benchmark results")
println(s)
PencilFFTs
:
using MPI
using PencilFFTs
using PencilArrays
using BenchmarkTools
using Random
using CUDA
MPI.Init(threadlevel=:funneled)
comm = MPI.COMM_WORLD
dims = (8192, 32, 32)
rank=MPI.Comm_rank(comm)
device!(rank % length(devices()))
sleep(1*rank)
print("rank:",rank,"GPU:",device(),"\n")
pen = Pencil(CuArray,dims, comm)
transform=Transforms.FFT!()
plan = PencilFFTPlan(pen, transform)
u = allocate_input(plan)
if rank == 0
println("has-cuda:",MPI.has_cuda())
print("data size:",dims,"\n")
print("Start data allocationg\n")
end
randn!(first(u))
b = @benchmark $plan*$u evals=1 samples=100 seconds=30 teardown=(MPI.Barrier(comm))
if rank == 0
io = IOBuffer()
show(io, "text/plain", b)
s = String(take!(io))
println(s)
end
Results:
For CUFFT
with single gpu:
Dimension 8192,32,32
start fft benchmark
complete fft benchmark
FFT benchmark results
BenchmarkTools.Trial: 1521 samples with 1 evaluation.
Range (min … max): 2.044 ms … 51.419 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.411 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.737 ms ± 2.174 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▄█▆▃
█████▇▄▁▄▃▃▄▃▄▅▃▁▁▁▁▃▁▃▄▁▁▁▃▁▃▃▁▄▁▃▁▁▁▁▃▃▃▁▁▃▁▁▁▁▁▁▁▁▃▁▃▃▅ █
2.04 ms Histogram: log(frequency) by time 14.5 ms <
Memory estimate: 2.94 KiB, allocs estimate: 49.
For PencilFFTs
with single gpu:
rank:0GPU:CuDevice(0)
has-cuda:true
data size:(8192, 32, 32)
Start data allocationg
BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 93.965 ms … 300.333 ms ┊ GC (min … max): 0.00% … 4.89%
Time (median): 173.977 ms ┊ GC (median): 0.00%
Time (mean ± σ): 181.007 ms ± 39.066 ms ┊ GC (mean ± σ): 0.08% ± 0.49%
█▆ ▂▂ ▃
▄▁▁▁▁▁▁▁▁▁▁▅▇▄▁██▇▇▇▄███▄██▇▅█▅▅▁▅▅▁▁█▅▄▇▅▅▁▄▁▄▅▁▁▄▁▁▄▄▁▁▁▁▁▅ ▄
94 ms Histogram: frequency by time 288 ms <
Memory estimate: 17.64 KiB, allocs estimate: 343.
For PencilFFTs
with 4 gpus in the same node:
rank:0GPU:CuDevice(0)
rank:1GPU:CuDevice(1)
rank:2GPU:CuDevice(2)
rank:3GPU:CuDevice(3)
has-cuda:true
data size:(8192, 32, 32)
Start data allocationg
BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 16.761 ms … 28.305 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 26.601 ms ┊ GC (median): 0.00%
Time (mean ± σ): 26.547 ms ± 1.360 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▄█ ▃ ▃ ▄ ▃▂
▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▃▁▄▇▇████▇█▆███▃██ ▃
16.8 ms Histogram: frequency by time 28.2 ms <
Memory estimate: 13.38 KiB, allocs estimate: 294.