I noticed pretty weird runtimes of FFTs using the FFTW-backend, especially on Windows. If I switch to the MKL-backend, the execution time is back to normal (on AMD and Intel systems).
To figure out, whether FFTW is the culprit, I compiled the libfftw3-3 and libfftw3f-3 libraries from scratch using MinGW and ran the benchmark coming along with the FFTW-package. I have listed the results below, right after the julia results.
Comment on the MWE: precompilation is not the issue, using BenchmarkTools with @btime
does not lead to a significant improvement:
System 1: Windows 11, Ryzen 7 Pro 4750U: 8 Cores @1.7 GHz base, 32 GiB RAM DDR4 3200 (dual channel)
With FFTW.set_provider!("fftw")
:
julia> using FFTW
julia> x = rand(128,128,1001);
julia> fft(x); @time fft(x);
5.863041 seconds (21.23 k allocations: 501.661 MiB, 1.88% gc time, 0.26% compilation time)
julia> FFTW.set_num_threads(8)
julia> fft(x); @time fft(x);
0.960498 seconds (38 allocations: 500.502 MiB, 2.41% gc time)
And with FFTW.set_provider!("mkl")
:
julia> fft(x); @time fft(x);
0.524924 seconds (38 allocations: 500.502 MiB, 3.46% gc time)
julia> FFTW.set_num_threads(8)
julia> fft(x); @time fft(x);
0.220142 seconds (38 allocations: 500.502 MiB, 10.94% gc time)
For reference: The FFTW bench yields
.\bench.exe -r 10 -onthreads=1 -oestimate ocf128x128x128
Problem: ocf128x128x128, setup: 64.60 us, time: 144.60 ms, ``mflops'': 1522.8724
.\bench.exe -r 10 -onthreads=8 -oestimate ocf128x128x128
Problem: ocf128x128x128, setup: 196.50 us, time: 24.01 ms, ``mflops'': 9172.4793
Additionally, I have observed even more weird behavior on another system (System 2: Windows Server 2019 Datacenter, Xeon W-3223: 8 Cores (with AVX512) @3.5 GHz base, 128 GiB RAM DDR4 2667, four channels in use)
With FFTW.set_provider!("fftw")
:
julia> using FFTW
julia> x = rand(128,128,1001);
julia> fft(x); @time fft(x);
7.730859 seconds (21.23 k allocations: 501.661 MiB, 1.77% gc time, 0.21% compilation time)
julia> FFTW.set_num_threads(8);
julia> fft(x); @time fft(x);
0.452253 seconds (38 allocations: 500.502 MiB, 12.92% gc time)
How can using eight threads instead of one yield a runtime decrease by a factor ~17? There must be something weird going onā¦
And with FFTW.set_provider!("mkl")
:
julia> fft(x); @time fft(x);
0.608177 seconds (38 allocations: 500.502 MiB, 9.33% gc time)
julia> FFTW.set_num_threads(8);
julia> fft(x); @time fft(x);
0.248317 seconds (38 allocations: 500.502 MiB, 28.68% gc time)
For reference: The FFTW bench yields
.\bench.exe -r 10 -onthreads=1 -oestimate ocf128x128x1001
Problem: ocf128x128x1001, setup: 88.90 us, time: 190.21 ms, ``mflops'': 10332.496
.\bench.exe -r 10 -onthreads=8 -oestimate ocf128x128x1001
Problem: ocf128x128x1001, setup: 327.10 us, time: 41.65 ms, ``mflops'': 47188.614
Can anyone reproduce this behavior?
Note: I compiled the FFTW libraries from the source code of FFTW 3.3.10 on WSL (Ubuntu 20.04) with
./configure --host=x86_64-w64-mingw32 --enable-shared --disable-fortran --disable-mpi --disable-doc --enable-threads --with-combined-threads --enable-sse2 --enable-avx2 --enable-avx512 --with-our-malloc
make -j
./configure --host=x86_64-w64-mingw32 --enable-shared --disable-fortran --disable-mpi --disable-doc --enable-threads --with-combined-threads --enable-sse2 --enable-avx2 --enable-avx512 --with-our-malloc --enable-single
make -j
EDIT: The issue is not limited to Windows! Interestingly, I have been able to observe the same superlinear scaling as on System 2 on System 1 running Fedora 35.
So it seems like Windows is not the issue here but the CPU in useā¦
On Skylake CPUs the performance seems to be better much in the single-threaded scenario on both Windows and Linux. Still, it remains much worse than the fftw benchmark would suggest!