R2r is surprisingly slower than complex fft

const m=60

x=randn(m,m,m)
y=deepcopy(x)
p = FFTW.plan_r2r(y,FFTW.R2HC,flags=FFTW.MEASURE)
@btime mul!($y,$p,$x);

gives the result

572.302 μs (138 allocations: 10.16 KiB)

while

const m=60

x=randn(ComplexF64,m,m,m)
y=deepcopy(x)
p = plan_fft(y,flags=FFTW.MEASURE)
@btime mul!($y,$p,$x);

gives

551.774 μs (138 allocations: 10.16 KiB)

Have I missed some crucial point?

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 4

And any chance StructArrays can be used in FFTW?

Have you tried calling fft on the array directly, to see if you have the same relative performance?

const m=60

x=randn(m,m,m)
@btime FFTW.r2r!($x,FFTW.R2HC);

gives

2.510 ms (55 allocations: 4.16 KiB)

while

const m=60

x=randn(ComplexF64,m,m,m)
@btime fft!($x);

gives

2.868 ms (31 allocations: 2.72 KiB)

Is it normal?

I honstestly have no idea. @stevengj could probably understand what goes on. (I hope it is okay pinging you like this)

No. The complex (and r2c) FFT algorithms are much more optimized in FFTW than the r2r plans, e.g. the former use SIMD and the latter do not. So even though a complex FFT takes about twice as much arithmetic as an R2HC plan of the same size, the former’s greater optimization makes up for it.

As explained in the FFTW manual, we generally recommend the r2c interface if what you want is a real-input DFT, especially in the multidimensional case. The r2hc algorithms are included mainly as stepping stones to other transforms, and because it is occasionally convenient to have a real-input DFT where the output is exactly the same size.

3 Likes

This is too bad since r2r is useful for Chebyshev transforms. I remember for certain degrees this caused ApproxFun to be much slower than Chebfun (which just translated it to an FFT).

3 Likes