I was misled into thinking that FFTW scales poorly in parallel after reading this discourse thread, so I wanted to post some numbers for someone trying to do fast FFTs in the future. My use case requires repeatedly performing moderate-size 2D FFTs for what’s basically parameter estimation + image convolution. I run the benchmarks on a 40-core 2.4 GHz Skylake node.
For 1024x1024, here’s the output for 1/8/40 threads:
19.280 ms (0 allocations: 0 bytes)
4.149 ms (0 allocations: 0 bytes)
1.642 ms (0 allocations: 0 bytes)
For 2048x2048, the timings for 1/8/40 threads are
307.795 ms (0 allocations: 0 bytes)
98.475 ms (0 allocations: 0 bytes)
26.190 ms (0 allocations: 0 bytes)
Finally, Julia has the amazing CuArrays
. We can compare to performance for double-precision operations on a Tesla P100. For 1024x1024 2D FFTs,
335.463 μs (6 allocations: 176 bytes)
For 2048x2048,
1.185 ms (6 allocations: 176 bytes)
GPU performance is pretty awe-inspiring, assuming I’ve done the benchmark correctly.
Benchmark Code (CPU)
using BenchmarkTools
using FFTW
const nx = 1024 # do 1024 x 1024 2D FFT
FFTW.set_num_threads(1)
p = plan_fft!( randn(Complex{Float64},nx,nx) )
@btime p*x setup=(x=randn(Complex{Float64},nx,nx));
FFTW.set_num_threads(8)
p = plan_fft!( randn(Complex{Float64},nx,nx) )
@btime p*x setup=(x=randn(Complex{Float64},nx,nx));
FFTW.set_num_threads(40)
p = plan_fft!( randn(Complex{Float64},nx,nx) )
@btime p*x setup=(x=randn(Complex{Float64},nx,nx));
Benchmark Code (GPU)
using BenchmarkTools
using FFTW
using CuArrays
using Random
const nx = 1024 # do 1024 x 1024 2D FFT
xc = CuArray{ComplexF64}(CuArrays.randn(Float64, nx, nx))
p = plan_fft!( xc )
@btime CuArrays.@sync(p * x) setup=(
x=CuArray{ComplexF64}(CuArrays.randn(Float64, nx, nx)));