Performance of PencilFFTs with CuArray

Performance issue on PencilFFTs with CuArray
CuArray Performance · Issue #56 · jipolanco/PencilFFTs.jl (github.com)
Post here to find more tests.

Code example:
CUFFT:

using CUDA
using FFTW
using BenchmarkTools

println("Dimension 8192,32,32")
println("start fft benchmark")
b=@benchmark (CUDA.@sync op*data) setup=(op=plan_fft!(CuArray{ComplexF64}(undef,8192,32,32));data=CUDA.rand(ComplexF64,8192,32,32))
println("complete fft benchmark")
io = IOBuffer()
show(io, "text/plain", b)
s = String(take!(io))
println("FFT benchmark results")
println(s)

PencilFFTs:

using MPI
using PencilFFTs
using PencilArrays
using BenchmarkTools
using Random
using CUDA

MPI.Init(threadlevel=:funneled)
comm = MPI.COMM_WORLD
dims = (8192, 32, 32)

rank=MPI.Comm_rank(comm)
device!(rank % length(devices()))
sleep(1*rank)
print("rank:",rank,"GPU:",device(),"\n")

pen = Pencil(CuArray,dims, comm)
transform=Transforms.FFT!()

plan = PencilFFTPlan(pen, transform)
u = allocate_input(plan)
if rank == 0
    println("has-cuda:",MPI.has_cuda())
    print("data size:",dims,"\n")
    print("Start data allocationg\n")
end
randn!(first(u))

b = @benchmark $plan*$u evals=1 samples=100 seconds=30 teardown=(MPI.Barrier(comm))

if rank == 0
    io = IOBuffer()
    show(io, "text/plain", b)
    s = String(take!(io))
    println(s)
end

Results:
For CUFFT with single gpu:

Dimension 8192,32,32
start fft benchmark
complete fft benchmark
FFT benchmark results
BenchmarkTools.Trial: 1521 samples with 1 evaluation.
 Range (min … max):  2.044 ms … 51.419 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.411 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.737 ms ±  2.174 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▆▃                                                        
  █████▇▄▁▄▃▃▄▃▄▅▃▁▁▁▁▃▁▃▄▁▁▁▃▁▃▃▁▄▁▃▁▁▁▁▃▃▃▁▁▃▁▁▁▁▁▁▁▁▃▁▃▃▅ █
  2.04 ms      Histogram: log(frequency) by time     14.5 ms <

 Memory estimate: 2.94 KiB, allocs estimate: 49.

For PencilFFTs with single gpu:

rank:0GPU:CuDevice(0)
has-cuda:true
data size:(8192, 32, 32)
Start data allocationg
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):   93.965 ms … 300.333 ms  ┊ GC (min … max): 0.00% … 4.89%
 Time  (median):     173.977 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   181.007 ms ±  39.066 ms  ┊ GC (mean ± σ):  0.08% ± 0.49%

                 █▆     ▂▂  ▃                                    
  ▄▁▁▁▁▁▁▁▁▁▁▅▇▄▁██▇▇▇▄███▄██▇▅█▅▅▁▅▅▁▁█▅▄▇▅▅▁▄▁▄▅▁▁▄▁▁▄▄▁▁▁▁▁▅ ▄
  94 ms            Histogram: frequency by time          288 ms <

 Memory estimate: 17.64 KiB, allocs estimate: 343.

For PencilFFTs with 4 gpus in the same node:

rank:0GPU:CuDevice(0)
rank:1GPU:CuDevice(1)
rank:2GPU:CuDevice(2)
rank:3GPU:CuDevice(3)
has-cuda:true
data size:(8192, 32, 32)
Start data allocationg
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  16.761 ms … 28.305 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.601 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.547 ms ±  1.360 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                 ▁▄█ ▃ ▃ ▄ ▃▂  
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▃▁▄▇▇████▇█▆███▃██ ▃
  16.8 ms         Histogram: frequency by time        28.2 ms <

 Memory estimate: 13.38 KiB, allocs estimate: 294.

I commented on the related PencilFFTs.jl issue. Please avoid duplicating the discussion, and post new elements over at the linked issue instead of here.

As I mentioned in the issue, there is still room for optimisations regarding GPU arrays in PencilFFTs, but I think it will be very hard to match the performance of native 3D FFTs implemented in cuFFT for single GPUs.

1 Like