Is this a fair GPU benchmark?

marius311 · May 31, 2019, 10:09pm

Hi, I’m totally new to GPU computing, really enjoying the ease of using Julia GPU libraries, but had a question about whether my benchmark code is correct, or whether I’m leaving something on the table.

The computation I’m thinking of transferring to the GPU looks like a series of alternating 2D FFT’s and inverse FFT’s, with some pointwise multiplication sandwiched in between.

vec_out = c3 .* FFT * (c2 .* (FFT \ (c1 .* (FFT * vec_in)))) # etc...

In reality there’s O(1000) of such steps chained. Runtime here should be dominated by the FFTs, and there’s no way to do these in parallel given the above expression is sequential, so I’m trying to benchmark 1000 sequential FFTs on the GPU.

My code looks like you see below, comparing a GPU and CPU version. The 2D arrays are 512x512, which is exactly the size in my real problem. On e.g. an Nvidia K80 I get 50ms for GPU and 150 for CPU (FFTW w/ MKL).

Definitely a factor of 3 is great, but I was wondering if there’s anything I’m leaving on the table, since other GPU FFT benchmarks I’ve seen (e.g. ArrayFire), suggest factors of >10 are even possible?

using AbstractFFTs
using LinearAlgebra
using CuArrays  

n = 512

##

src_gpu = CuArray(rand(Float32,n,n));
dst_gpu = CuArray(Array{Complex{Float32}}(undef,n÷2+1,n))
p_gpu = plan_rfft(src_gpu)

function bench_gpu(p,src,dst)
    CuArrays.@sync for i=1:1000
        mul!(dst, p, src)
    end
end

##

src_cpu = rand(Float32,n,n);
dst_cpu = Array{Complex{Float32}}(undef,n÷2+1,n);
p_cpu = plan_rfft(src_cpu)

function bench_cpu(p,src,dst)
    for i=1:1000
        mul!(dst, p, src)
    end
end

##

using BenchmarkTools

@btime bench_gpu($p_gpu,$src_gpu,$dst_gpu)
@btime bench_cpu($p_cpu,$src_cpu,$dst_cpu)

antoine-levitt · June 1, 2019, 2:31pm

http://users.umiacs.umd.edu/~ramani/cmsc828e_gpusci/DeSpain_FFT_Presentation.pdf (slide 19) is consistent with a speedup of around 3x. I’m always very much confused by CPU vs GPU benchmarks, as they seem to vary enormously. For a given algorithm, at least the following seem to be very important: problem size, precision, hardware, library version.

Topic		Replies	Views
FFTW scales pretty well (some @btime benchmarks) Performance fftw , gpu , parallel , multithreading	1	1713	February 4, 2025
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	242	June 4, 2025
Parallelizaton on GPU slower than on CPU...? Performance gpu	10	2333	January 21, 2020
Parallel FFT not that much faster General Usage fftw , parallel , multithreading	10	7020	December 19, 2016
CUDA fft against CPU fft GPU	4	2765	March 19, 2019

Is this a fair GPU benchmark?

Related topics