Hi, I’m totally new to GPU computing, really enjoying the ease of using Julia GPU libraries, but had a question about whether my benchmark code is correct, or whether I’m leaving something on the table.
The computation I’m thinking of transferring to the GPU looks like a series of alternating 2D FFT’s and inverse FFT’s, with some pointwise multiplication sandwiched in between.
vec_out = c3 .* FFT * (c2 .* (FFT \ (c1 .* (FFT * vec_in)))) # etc...
In reality there’s O(1000) of such steps chained. Runtime here should be dominated by the FFTs, and there’s no way to do these in parallel given the above expression is sequential, so I’m trying to benchmark 1000 sequential FFTs on the GPU.
My code looks like you see below, comparing a GPU and CPU version. The 2D arrays are 512x512, which is exactly the size in my real problem. On e.g. an Nvidia K80 I get 50ms for GPU and 150 for CPU (FFTW w/ MKL).
Definitely a factor of 3 is great, but I was wondering if there’s anything I’m leaving on the table, since other GPU FFT benchmarks I’ve seen (e.g. ArrayFire), suggest factors of >10 are even possible?
using AbstractFFTs
using LinearAlgebra
using CuArrays
n = 512
##
src_gpu = CuArray(rand(Float32,n,n));
dst_gpu = CuArray(Array{Complex{Float32}}(undef,n÷2+1,n))
p_gpu = plan_rfft(src_gpu)
function bench_gpu(p,src,dst)
CuArrays.@sync for i=1:1000
mul!(dst, p, src)
end
end
##
src_cpu = rand(Float32,n,n);
dst_cpu = Array{Complex{Float32}}(undef,n÷2+1,n);
p_cpu = plan_rfft(src_cpu)
function bench_cpu(p,src,dst)
for i=1:1000
mul!(dst, p, src)
end
end
##
using BenchmarkTools
@btime bench_gpu($p_gpu,$src_gpu,$dst_gpu)
@btime bench_cpu($p_cpu,$src_cpu,$dst_cpu)