FFTW Performance

jonas-kr · February 28, 2022, 10:39am

I noticed pretty weird runtimes of FFTs using the FFTW-backend, especially on Windows. If I switch to the MKL-backend, the execution time is back to normal (on AMD and Intel systems).

To figure out, whether FFTW is the culprit, I compiled the libfftw3-3 and libfftw3f-3 libraries from scratch using MinGW and ran the benchmark coming along with the FFTW-package. I have listed the results below, right after the julia results.

Comment on the MWE: precompilation is not the issue, using BenchmarkTools with @btime does not lead to a significant improvement:

System 1: Windows 11, Ryzen 7 Pro 4750U: 8 Cores @1.7 GHz base, 32 GiB RAM DDR4 3200 (dual channel)

With FFTW.set_provider!("fftw"):

julia> using FFTW

julia> x = rand(128,128,1001);

julia> fft(x); @time fft(x);
  5.863041 seconds (21.23 k allocations: 501.661 MiB, 1.88% gc time, 0.26% compilation time)

julia> FFTW.set_num_threads(8)

julia> fft(x); @time fft(x);
  0.960498 seconds (38 allocations: 500.502 MiB, 2.41% gc time)

And with FFTW.set_provider!("mkl"):

julia> fft(x); @time fft(x);
  0.524924 seconds (38 allocations: 500.502 MiB, 3.46% gc time)

julia> FFTW.set_num_threads(8)

julia> fft(x); @time fft(x);
  0.220142 seconds (38 allocations: 500.502 MiB, 10.94% gc time)

For reference: The FFTW bench yields

.\bench.exe -r 10 -onthreads=1 -oestimate ocf128x128x128
Problem: ocf128x128x128, setup: 64.60 us, time: 144.60 ms, ``mflops'': 1522.8724

.\bench.exe -r 10 -onthreads=8 -oestimate ocf128x128x128
Problem: ocf128x128x128, setup: 196.50 us, time: 24.01 ms, ``mflops'': 9172.4793

Additionally, I have observed even more weird behavior on another system (System 2: Windows Server 2019 Datacenter, Xeon W-3223: 8 Cores (with AVX512) @3.5 GHz base, 128 GiB RAM DDR4 2667, four channels in use)

With FFTW.set_provider!("fftw"):

julia> using FFTW

julia> x = rand(128,128,1001);

julia> fft(x); @time fft(x);
  7.730859 seconds (21.23 k allocations: 501.661 MiB, 1.77% gc time, 0.21% compilation time)

julia> FFTW.set_num_threads(8);

julia> fft(x); @time fft(x);
  0.452253 seconds (38 allocations: 500.502 MiB, 12.92% gc time)

How can using eight threads instead of one yield a runtime decrease by a factor ~17? There must be something weird going on…

And with FFTW.set_provider!("mkl"):

julia> fft(x); @time fft(x);
  0.608177 seconds (38 allocations: 500.502 MiB, 9.33% gc time)

julia> FFTW.set_num_threads(8);

julia> fft(x); @time fft(x);
  0.248317 seconds (38 allocations: 500.502 MiB, 28.68% gc time)

For reference: The FFTW bench yields

.\bench.exe -r 10 -onthreads=1 -oestimate ocf128x128x1001
Problem: ocf128x128x1001, setup: 88.90 us, time: 190.21 ms, ``mflops'': 10332.496

.\bench.exe -r 10 -onthreads=8 -oestimate ocf128x128x1001
Problem: ocf128x128x1001, setup: 327.10 us, time: 41.65 ms, ``mflops'': 47188.614

Can anyone reproduce this behavior?

Note: I compiled the FFTW libraries from the source code of FFTW 3.3.10 on WSL (Ubuntu 20.04) with

./configure --host=x86_64-w64-mingw32 --enable-shared --disable-fortran --disable-mpi --disable-doc --enable-threads --with-combined-threads --enable-sse2 --enable-avx2 --enable-avx512 --with-our-malloc
make -j
./configure --host=x86_64-w64-mingw32 --enable-shared --disable-fortran --disable-mpi --disable-doc --enable-threads --with-combined-threads --enable-sse2 --enable-avx2 --enable-avx512 --with-our-malloc  --enable-single
make -j

EDIT: The issue is not limited to Windows! Interestingly, I have been able to observe the same superlinear scaling as on System 2 on System 1 running Fedora 35.
So it seems like Windows is not the issue here but the CPU in use…
On Skylake CPUs the performance seems to be better much in the single-threaded scenario on both Windows and Linux. Still, it remains much worse than the fftw benchmark would suggest!

stevengj · February 28, 2022, 3:49pm

Have you tried pre-creating the plan object? That is always the fastest way to use FFTW.

For example:

using FFTW, LinearAlgebra, BenchmarkTools
x = rand(ComplexF64, 128,128,1001);
y = similar(x);
p = plan_fft(x, flags=FFTW.PATIENT)
@btime mul!($y, $p, $x);

Note that you can use pre-allocated output (mul!) only for complex inputs — fft is complex, so if you give it real inputs it will first convert the inputs to a complex array under the hood, making a copy of the input.

If you want to take full advantage of real inputs, you can use plan_rfft, which saves a factor of ≈ 2 (and only computes ≈ half of the outputs, since the other half are redundant complex conjugates).

jonas-kr · February 28, 2022, 4:50pm

Thanks a lot for the suggestions, they did already help me on many occasions!
However, the issue I have observed seems to have another cause. Only if the the runtime is down to “acceptable” levels, preallocating or using rfft makes a significant difference in runtime.

Using the flags FFTW.PATIENT or FFTW.MEASURE does help a lot; however, the same does not hold for the FFTW benchmark where these flags barely reduce the runtime on my machines. This fact makes me wonder what is going on with julia calling the non-MKL FFTW-libs.

stevengj · February 28, 2022, 5:12pm

The FFTW benchmark program bench defaults to MEASURE. Also, the benchmark program always precomputes the plan and preallocates the output.

jonas-kr · March 1, 2022, 8:05am

I noticed that the FFTW bench defaults to MEASURE (hence the long planning duration); that’s why I used the flag -oestimate to ensure that the comparison is fair. For the huge FFTs considered, the planning and allocation stage should take much less time than the execution itself. The fact, that a pre-planned in-place FFT does not really improve the runtime emphasizes this assumption (IMHO).

As a side note: Calling FFTW with pyFFTW from python leads to significantly shorter runtimes than with FFTW.jl.

I have used the following code snippets for the runtime comparison:

Julia

using BenchmarkTools
using FFTW

function fft_test(x,n,flags)
    FFTW.set_num_threads(n)
    p = plan_fft!(x;flags)
    @btime $p*$x
end

function main()
    x0 = rand(ComplexF64,128,128,1001)
    x = copy(x0)

    fft_test(x,1,FFTW.ESTIMATE)
    fft_test(x,8,FFTW.ESTIMATE)
    fft_test(x,1,FFTW.MEASURE)
    fft_test(x,8,FFTW.MEASURE)

    nothing
end

Results in

julia> main()
5.578 s (0 allocations: 0 bytes)
984.713 ms (0 allocations: 0 bytes)
267.310 ms (0 allocations: 0 bytes)
68.117 ms (0 allocations: 0 bytes)

Python

import pyfftw
from pyfftw.interfaces.numpy_fft import *
import numpy as np
from time import time

def fft_test(x0,n,flags):
    x = x0.copy()    
    p = pyfftw.builders.fftn(x,overwrite_input=True,threads=n,planner_effort=flags)
    t = time()
    p()
    print("--- %s seconds ---" % (time() - t))


xr,xi = np.random.rand(2,128,128,1001)
x0 = xr + 1j * xi

fft_test(x0,1,'FFTW_ESTIMATE')
fft_test(x0,8,'FFTW_ESTIMATE')
fft_test(x0,1,'FFTW_MEASURE')
fft_test(x0,8,'FFTW_MEASURE')

Results in

--- 0.248687744140625 seconds ---
--- 0.09710454940795898 seconds ---
--- 0.23300909996032715 seconds ---
--- 0.09301376342773438 seconds ---

stevengj · March 1, 2022, 12:54pm

If you are using Julia multi-threading, than FFTW.jl uses Julia threads & Julia’s scheduler instead of its own threading, maybe that is the difference?

tknopp · March 1, 2022, 1:21pm

One interesting question seems to be why ESTIMATE vs MEASURE makes a large difference for the Julia version and not for the Python version. Is the python version really using the flag?
Best would be to look a C implementation, which would be the ground truth.

jonas-kr · March 2, 2022, 12:00pm

Thanks for the suggestion; however, I think that I have been able to verify that the choice of scheduler is not the culprit. I dev’ed FFTW.jl and commented out the fftw_init_threads() call in line 32 of FFTW.jl. This way, the threading in the ffts should be organized by the library itself, shouldn’t it? Nevertheless, the issue persisted and the performance did not improve at all. The single-threaded FFT of the example array x used above still takes about six seconds. I even made sure to use the modified FFTW package by having it run in a clean julia environment.

jonas-kr · March 2, 2022, 12:14pm

Good point, I also wrote a C example to verify that using ESTIMATE is not the reason for the bad performance:

#include<stdio.h>
#include<stdlib.h>
#include<time.h>
#include<complex.h>
#include "fftw3.h"

int main(void) {
    int N0=128;
    int N1=128;
    int N2=1001;
    fftw_complex *in, *out;
    fftw_plan my_plan;
    printf("Allocating memory...\n");
    in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*N0*N1*N2);
    out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*N0*N1*N2);

    for (int i = 0; i<N0*N1*N2; i++){
        in[i] = (fftw_complex) (rand()+rand()*I);
    }

    printf("Creating plan...\n");
    int fftw_init_threads(void);
    fftw_plan_with_nthreads(8);
    my_plan = fftw_plan_dft_3d(N0, N1, N2, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
    printf("Executing plan...\n");
    clock_t begin = clock();
    fftw_execute(my_plan);
    clock_t end = clock();
    double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("Execution took %f s\n", time_spent);
    printf("Cleaning up...\n");
    fftw_destroy_plan(my_plan);
    fftw_free(in);
    fftw_free(out);
    printf("Done!\n");
    return 0;
}

I built this executable on Windows with GCC11, linking to the FFTW libraries provided by julia:

gcc .\test_fftw.c -L...\.julia\artifacts\b7dd1809d0626eac3bf6f97ba8ccfbb6cc63c509\lib -lfftw3 -lm -o test_fftw.exe

The actual execution takes about 250 ms using a single and about 100 ms using eight threads, much less than using julia to call do the same task.

stevengj · March 2, 2022, 1:30pm

No, you still need the underling fftw_init_threads ccall to use threads. If you want FFTW to always use its own threads, you can comment out these lines to customize the threading backend. Or you can simply start Julia with JULIA_NUM_THREADS=1,

Could it be an alignment issue? If you use fftw_malloc(n) to allocate memory, then on some systems it is 32-byte aligned to improve SIMD performance, but Julia’s arrays are only16-byte aligned by default IIRC (and maybe not even that on Windows?).

You can try allocating an aligned array with:

using FFTW
fftw_malloc(n) = ccall((:fftw_malloc,FFTW.libfftw3[]), Ptr{Cvoid}, (Int,), n)
fftw_array(T, dims...) = unsafe_wrap(Array, Ptr{T}(fftw_malloc(prod(dims) * sizeof(T))), dims)
x = fftw_array(ComplexF64, 128,128,1024)

(Note that this leaks memory unless you call fftw_free on pointer(x) when you are done, but it should be fine for benchmarking)

PS. Note also that 1001 is a pretty bad size for FFTs because it has a large prime factor. If you care about performance, choose only sizes with small prime factors if possible.

stevengj · March 2, 2022, 1:33pm

Oh, I think I see another problem. Julia’s arrays are column major. So, the analogue of a 128x128x1001 array in C or in Numpy is a 1001x128x128 array in Julia. This could make a big difference for SIMD utilization because 1001 is odd.

jonas-kr · March 2, 2022, 1:56pm

That’s the reason! I did not think of the array alignment in memory. Anyway, thanks a lot for helping me figure out the culprit. By the way: Under these suboptimal conditions using far more threads than CPU cores improves the performance quite a lot on my system. Using 128 instead of 8 threads reduces the runtime by a factor of three. Unfortunately, this increase is only possible with the FFTW-internal threading: Using julia with 16 threads and running a 16-thread fft segfaults.

Nevertheless, I still wonder why using MKL avoids this problem.

Jake · March 2, 2022, 7:21pm

I am interested in the documentation regarding plan objects. Where do I find it?

jonas-kr · March 3, 2022, 8:08am

In case you refer to the flags useable with the plan objects, you will find additional information over here FFTW Docs - 4.3.2 Planner Flags

gbellomia · March 17, 2022, 7:34pm

Hi! I’m a little perplexed, I have performed some analogous tests in MATLAB (which should be column major as well)*, and have seen actually very little difference between the 128x128x1001 and the 1001x128x128 cases:

>> x = rand(128,128,1001);
>> f = @() fft(x);
>> timeit(f)              

ans =

    0.0659

>> y = rand(1001,128,128);
>> g = @() fft(y); 
>> timeit(g)              

ans =

    0.0839

I tried several times, with different realizations of x and y, and the times always stay within a (64,71)ms range for the 128x128x1001 case and a (82,91)ms range for the other.

So apparently there is indeed a difference between transforming 128x128 1001-long vectors or 128x1001 128-long ones but:

it is a slight difference overall, nothing comparable with what the problems discussed above… are we sure the column vs major mismatch can account for the problem? It seems quite doubtful to me at this point.
the actual times I get match quite suspiciously the “optimal-option” results reported for julia and python (which would differ for being column and row major respectively), i.e. the 68.117 ms and 93.013 ms written above by @jonas-kr (I am running on a quad-core i5 machine, and I believe MATLAB requests by default all cores, with hyperthreading, so I find the correspondence with the 8-threads tests reasonable)

So… does MATLAB automatically choose the optimal setup and match the ≈ 68 ms result? Why can’t FFTW.jl do the same?

*for sure it feeds columns to FFTW, as explained here: Fast Fourier transform - MATLAB fft

pitsianis · March 17, 2022, 8:13pm

You are running with real and not complex inputs as in the previous tests.

gbellomia · March 17, 2022, 8:30pm

Yes, sorry my bad, I looked at the very first listing and just saw rand(128,128,1001) and did not realize that suddenly we had changed to rand(ComplexF64,128,128,1001).

Anyway it changes almost nothing (and that’s because 1001 has large prime factors I believe):

>> x = rand(128,128,1001) + 1i*rand(128,128,1001);
>> f = @() fft(x);
>> timeit(f)

ans =

    0.0725

>> timeit(f)

ans =

    0.0695

>> timeit(f)

ans =

    0.0698

>> timeit(f)

ans =

    0.0706

>> y = rand(1001,128,128) + 1i*rand(1001,128,128);
>> g = @() fft(y);
>> timeit(g) 

ans =

    0.0952

>> timeit(g) 

ans =

    0.0833

>> timeit(g) 

ans =

    0.0817

>> timeit(g) 

ans =

    0.0812

stevengj · March 18, 2022, 12:20am

Matlab uses a completely different storage format. It stores the real and imaginary parts in separate arrays. If I remember correctly, it used to copy this into a complex-array buffer before calling FFTW. I have no idea what it does these days, but in general it is more relaxed about allocating large temporary buffers whereas in Julia we tend to prefer having the user do this explicitly.

jonas-kr · March 18, 2022, 8:15am

Hi, I noticed the same behavior during my tests and concluded that the lower execution time in MATLAB is due to its use of MKL for the FFT. If we make use of MKL in julia, the same fft of an 128x128x1001 array takes approximately as long as it does in MATLAB. I assume that the FFTW-adaption in MKL features improved heuristics for the planning stage which is why the runtime of the MKL fft with the default ESTIMATE flag is not too far off from the runtime of the FFTW fft with the MEASURE flag.

Please note: Right now I only have a PC with an i7 6700K at hand where the suboptimal default plan does not perform as poorly as on System 1 in the first post. Nevertheless, the difference between FFTW and MKL becomes clear.

Julia with FFTW: More thorough planning (taking quite some time) reduces the runtime significantly

julia> using FFTW, BenchmarkTools

julia> x = rand(ComplexF64,128,128,1001);

julia> FFTW.set_num_threads(4)

julia> p = plan_fft!(x);

julia> @btime $p*$x;
  474.560 ms (82 allocations: 6.31 KiB)

julia> p = plan_fft!(x,flags=FFTW.MEASURE);

julia> @btime $p*$x;
  71.301 ms (82 allocations: 6.31 KiB)

Now we use julia> FFTW.set_provider!("mkl") and restart julia to switch to MKL

Julia with MKL: More thorough planning does not improve runtime

julia> using FFTW, BenchmarkTools

julia> x = rand(ComplexF64,128,128,1001);

julia> FFTW.set_num_threads(4)

julia> p = plan_fft!(x);

julia> @btime $p*$x;
  72.927 ms (0 allocations: 0 bytes)

julia> p = plan_fft!(x,flags=FFTW.MEASURE);

julia> @btime $p*$x;
  71.488 ms (0 allocations: 0 bytes)

Interestingly, the planning stage with MEASURE (same for PATIENT/EXHAUSTIVE) finishes almost immediately using MKL.

MATLAB R2020b

>> x = rand(128,128,1001) + 1j*rand(128,128,1001);
>> f = @() fft(x);
>> timeit(f)

ans =

    0.0404

Hence, MATLAB still has the edge but the difference is much smaller.

gbellomia · March 18, 2022, 8:35am

Oh that’s very interesting, I didn’t know Matlab bundles MKLs, but some googling confirmed that indeed it does.

Anyway, one last doubt/curiosity remains for me: as I read here

the default planning in FFTW should be “measure”, so why in the julia wrapper it seems to be “estimate”? Maybe that’s just the real point… that the MKL version enforces “measure” as the default, couldn’t it be?

Topic		Replies	Views
Parallel FFT not that much faster General Usage fftw , parallel , multithreading	10	7018	December 19, 2016
Serious slowdown with FFTW v1.1.0 and v1.2.0 on Windows Performance	21	1688	February 21, 2020
Is multithreaded FFTW on multiple arrays causing too much overhead? Performance fftw	15	887	August 19, 2023
FFTW multithreading and plans General Usage fftw , multithreading	3	772	July 11, 2023
FFTW.jl support fftw_plan_many_dft? Performance fftw	6	584	August 6, 2023

FFTW Performance

Related topics