Is this a fair GPU benchmark?

Hi, I’m totally new to GPU computing, really enjoying the ease of using Julia GPU libraries, but had a question about whether my benchmark code is correct, or whether I’m leaving something on the table.

The computation I’m thinking of transferring to the GPU looks like a series of alternating 2D FFT’s and inverse FFT’s, with some pointwise multiplication sandwiched in between.

vec_out = c3 .* FFT * (c2 .* (FFT \ (c1 .* (FFT * vec_in)))) # etc... 

In reality there’s O(1000) of such steps chained. Runtime here should be dominated by the FFTs, and there’s no way to do these in parallel given the above expression is sequential, so I’m trying to benchmark 1000 sequential FFTs on the GPU.

My code looks like you see below, comparing a GPU and CPU version. The 2D arrays are 512x512, which is exactly the size in my real problem. On e.g. an Nvidia K80 I get 50ms for GPU and 150 for CPU (FFTW w/ MKL).

Definitely a factor of 3 is great, but I was wondering if there’s anything I’m leaving on the table, since other GPU FFT benchmarks I’ve seen (e.g. ArrayFire), suggest factors of >10 are even possible?

using AbstractFFTs
using LinearAlgebra
using CuArrays  

n = 512


src_gpu = CuArray(rand(Float32,n,n));
dst_gpu = CuArray(Array{Complex{Float32}}(undef,n÷2+1,n))
p_gpu = plan_rfft(src_gpu)

function bench_gpu(p,src,dst)
    CuArrays.@sync for i=1:1000
        mul!(dst, p, src)


src_cpu = rand(Float32,n,n);
dst_cpu = Array{Complex{Float32}}(undef,n÷2+1,n);
p_cpu = plan_rfft(src_cpu)

function bench_cpu(p,src,dst)
    for i=1:1000
        mul!(dst, p, src)


using BenchmarkTools

@btime bench_gpu($p_gpu,$src_gpu,$dst_gpu)
@btime bench_cpu($p_cpu,$src_cpu,$dst_cpu)

1 Like (slide 19) is consistent with a speedup of around 3x. I’m always very much confused by CPU vs GPU benchmarks, as they seem to vary enormously. For a given algorithm, at least the following seem to be very important: problem size, precision, hardware, library version.