FFTW scales pretty well (some @btime benchmarks)

xzackli · September 3, 2019, 6:27pm

I was misled into thinking that FFTW scales poorly in parallel after reading this discourse thread, so I wanted to post some numbers for someone trying to do fast FFTs in the future. My use case requires repeatedly performing moderate-size 2D FFTs for what’s basically parameter estimation + image convolution. I run the benchmarks on a 40-core 2.4 GHz Skylake node.

For 1024x1024, here’s the output for 1/8/40 threads:

  19.280 ms (0 allocations: 0 bytes)
  4.149 ms (0 allocations: 0 bytes)
  1.642 ms (0 allocations: 0 bytes)

For 2048x2048, the timings for 1/8/40 threads are

  307.795 ms (0 allocations: 0 bytes)
  98.475 ms (0 allocations: 0 bytes)
  26.190 ms (0 allocations: 0 bytes)

Finally, Julia has the amazing CuArrays. We can compare to performance for double-precision operations on a Tesla P100. For 1024x1024 2D FFTs,

  335.463 μs (6 allocations: 176 bytes)

For 2048x2048,

  1.185 ms (6 allocations: 176 bytes)

GPU performance is pretty awe-inspiring, assuming I’ve done the benchmark correctly.

Benchmark Code (CPU)

using BenchmarkTools
using FFTW

const nx = 1024  # do 1024 x 1024 2D FFT

FFTW.set_num_threads(1)
p = plan_fft!( randn(Complex{Float64},nx,nx) )
@btime p*x setup=(x=randn(Complex{Float64},nx,nx));

FFTW.set_num_threads(8)
p = plan_fft!( randn(Complex{Float64},nx,nx) )
@btime p*x setup=(x=randn(Complex{Float64},nx,nx));

FFTW.set_num_threads(40)
p = plan_fft!( randn(Complex{Float64},nx,nx) )
@btime p*x setup=(x=randn(Complex{Float64},nx,nx));

Benchmark Code (GPU)

using BenchmarkTools
using FFTW
using CuArrays
using Random

const nx = 1024  # do 1024 x 1024 2D FFT
xc = CuArray{ComplexF64}(CuArrays.randn(Float64, nx, nx))
p = plan_fft!( xc )
@btime CuArrays.@sync(p * x) setup=(
    x=CuArray{ComplexF64}(CuArrays.randn(Float64, nx, nx)));

stevengj · February 4, 2025, 2:30pm

CuArrays have their own FFT, they aren’t using FFTW. (So you only need using AbstractFFTs, maybe?)

Topic		Replies	Views
Parallel FFT not that much faster General Usage fftw , parallel , multithreading	10	7009	December 19, 2016
Performance of PencilFFTs with CuArray GPU fftw , parallel , cluster , mpi	1	413	July 21, 2022
Scaled FFT implementation seems to only use one thread despite setting no. of threads to 8 New to Julia fftw , multithreading	2	491	October 27, 2022
Calculate FFT on GPU for every row of a 2D array Performance gpu	2	1021	August 14, 2018
Is this a fair GPU benchmark? GPU	1	1129	June 1, 2019

FFTW scales pretty well (some @btime benchmarks)

Benchmark Code (CPU)

Benchmark Code (GPU)

Related topics