FFTW scales pretty well (some @btime benchmarks)

I was misled into thinking that FFTW scales poorly in parallel after reading this discourse thread, so I wanted to post some numbers for someone trying to do fast FFTs in the future. My use case requires repeatedly performing moderate-size 2D FFTs for what’s basically parameter estimation + image convolution. I run the benchmarks on a 40-core 2.4 GHz Skylake node.

For 1024x1024, here’s the output for 1/8/40 threads:

  19.280 ms (0 allocations: 0 bytes)
  4.149 ms (0 allocations: 0 bytes)
  1.642 ms (0 allocations: 0 bytes)

For 2048x2048, the timings for 1/8/40 threads are

  307.795 ms (0 allocations: 0 bytes)
  98.475 ms (0 allocations: 0 bytes)
  26.190 ms (0 allocations: 0 bytes)

Finally, Julia has the amazing CuArrays. We can compare to performance for double-precision operations on a Tesla P100. For 1024x1024 2D FFTs,

  335.463 μs (6 allocations: 176 bytes)

For 2048x2048,

  1.185 ms (6 allocations: 176 bytes)

GPU performance is pretty awe-inspiring, assuming I’ve done the benchmark correctly.

Benchmark Code (CPU)

using BenchmarkTools
using FFTW

const nx = 1024  # do 1024 x 1024 2D FFT

FFTW.set_num_threads(1)
p = plan_fft!( randn(Complex{Float64},nx,nx) )
@btime p*x setup=(x=randn(Complex{Float64},nx,nx));

FFTW.set_num_threads(8)
p = plan_fft!( randn(Complex{Float64},nx,nx) )
@btime p*x setup=(x=randn(Complex{Float64},nx,nx));

FFTW.set_num_threads(40)
p = plan_fft!( randn(Complex{Float64},nx,nx) )
@btime p*x setup=(x=randn(Complex{Float64},nx,nx));

Benchmark Code (GPU)

using BenchmarkTools
using FFTW
using CuArrays
using Random

const nx = 1024  # do 1024 x 1024 2D FFT
xc = CuArray{ComplexF64}(CuArrays.randn(Float64, nx, nx))
p = plan_fft!( xc )
@btime CuArrays.@sync(p * x) setup=(
    x=CuArray{ComplexF64}(CuArrays.randn(Float64, nx, nx)));
10 Likes