I’m trying to do parallel FFT on an 8 core (4 CPUs, 2 hyperthreads / CPU) machine, and I’m basically seeing a little less than 2x speedup going from 1 core to 2, and then none whatsoever between 2 and 8. Is this expected? I don’t know much about parallel FFT algorithms, is this indicating memory is a bottleneck? Anyway, I suppose this could be a libfftw question really, but figured I’d ask here first.
x = rand(2048,2048)
using TimeIt
FFTW.set_num_threads(1)
@timeit fft(x)
FFTW.set_num_threads(2)
@timeit fft(x)
FFTW.set_num_threads(8)
@timeit fft(x)
1 loops, best of 3: 386.88 ms per loop
1 loops, best of 3: 248.82 ms per loop
1 loops, best of 3: 254.14 ms per loop
A lot of what you are measuring is overhead. If you care about performance, you should (a) use a precomputed plan, (b) use complex inputs for a complex FFT or use rfft (as otherwise the input array is converted to complex first), and (c) consider using a preallocated output array or doing an in-place transform.
I have a code whose bottleneck is the computation of a Fourier transform and I’d like to take advantage of parallelization, however I experience the same difficulties as @marius311.
Consider the following code:
function fast_fourier(y)
f = similar(y)
p = plan_fft(y, flags = FFTW.MEASURE)
A_mul_B!(f, p, y)
return f
end
y = cis.(linspace(0, 100, 1001))
And these are benchmarks I performed on my system:
julia> versioninfo()
Julia Version 0.6.0-dev.1378
Commit 1e0e793 (2016-12-03 13:48 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
julia> using BenchmarkTools
julia> include("/path/to/code.jl");
julia> @benchmark fft(y)
BenchmarkTools.Trial:
memory estimate: 19.11 kb
allocs estimate: 57
--------------
minimum time: 56.305 μs (0.00% GC)
median time: 58.056 μs (0.00% GC)
mean time: 58.763 μs (0.32% GC)
maximum time: 4.445 ms (42.22% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark fast_fourier(y)
BenchmarkTools.Trial:
memory estimate: 35.27 kb
allocs estimate: 64
--------------
minimum time: 59.347 μs (0.00% GC)
median time: 61.447 μs (0.00% GC)
mean time: 65.695 μs (3.24% GC)
maximum time: 2.669 ms (54.14% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> FFTW.set_num_threads(2)
julia> @benchmark fft(y)
BenchmarkTools.Trial:
memory estimate: 19.11 kb
allocs estimate: 57
--------------
minimum time: 106.047 μs (0.00% GC)
median time: 109.010 μs (0.00% GC)
mean time: 110.942 μs (0.18% GC)
maximum time: 7.053 ms (28.86% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark fast_fourier(y)
BenchmarkTools.Trial:
memory estimate: 35.27 kb
allocs estimate: 64
--------------
minimum time: 97.113 μs (0.00% GC)
median time: 101.668 μs (0.00% GC)
mean time: 110.067 μs (2.09% GC)
maximum time: 4.322 ms (35.61% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
There is no significant improvement over using directly fft, and there is no improvement when enabling multi-threaded FFTW. What are we missing? How to do it right?
Edit: I’m not sure whether one has to enable multi-threading in Julia by setting JULIA_NUM_THREADS environment variable (but given the fact that FFTW.set_num_threads was there also in Julia 0.4 I’d say the answer is no), however enabling it doesn’t change much the results.
When you @benchmark fast_fourier(y), you’re including planning the FFT in the benchmark, you should do that step separately and only benchmark the A_mul_B! part.
That said it seems like you might be suffering from the same issue (which for me is still unresolved) given that your fft alone is twice as slow with two cores.
Well I suppose it depends what you want to test. But a normal usage of plan_fft is to do it once at the beginning of your code, then use the plan throughout your computations, so that only applying the plan is in a speed-critical section and hence that’s what makes sense to benchmark.
Probably this is not clear to me: plan_fft is needed once for every different input array or once for every different shape of input arrays?
Admittedly, I’ve never used directly FFTW library before, but the docstring is not super clear here: using the same symbol A for the input of plan_fft and the array on which one wants to compute the FFT suggests that one should plan a FFT for each array.
Right, yes, you only need to plan once per every shape/datatype combination (not per contents). And I agree, the docs could be clearer if one of those A’s were different, or at least this said more explicitly.
I wrote a short FFT benchmark script (see gist here) and ran it on the machine I was asking about above, “machine B”, as well as a different one where things seem better, “machine A”. Here’s a plot to summarize what I get:
Your machine B may run into some of the following issues:
The CPU has hyperthreading (2 threads per core). When you do the FFT with 2 threads, how can you be sure that two physical cores are used, and not two threads on the same core? Ideally, the threads should first be distributed over the physical cores, and only if these are exhausted, the 2nd thread on each core should be used. But I’m afraid I don’t know whether FFTW takes care to pin the threads to physical cores. If not, the behavior depends on the OS and maybe on what else is running on the machine.
Modern CPUs have “turbo-boost”, i.e. when just one thread is running, the CPU frequency can be (much) higher than when multiple threads are running (due to the thermal budget). So by using more threads/cores, the CPU might clock down, i.e. the performance won’t scale.
FFT of size N has a runtime of O(N log N), i.e. the work per data element is actually not large. If an optimized implementation (like FFTW) is used, such an algorithm’s performance may actually not be limited by CPU resources, but by memory bandwidth. Note that your machine A has two CPU sockets, i.e. when you use two threads it can probably use twice the memory bandwidth because each CPU has its own memory controller.
For a proper scaling benchmark, it is often recommended to turn off turbo-boost and hyperthreading, though this may require changing settings in the BIOS. Modern Linux kernels may offer tools to change that behavior for a running system, but I’m not up-to-date on that.