Parallel FFT not that much faster

fftw

#1

I’m trying to do parallel FFT on an 8 core (4 CPUs, 2 hyperthreads / CPU) machine, and I’m basically seeing a little less than 2x speedup going from 1 core to 2, and then none whatsoever between 2 and 8. Is this expected? I don’t know much about parallel FFT algorithms, is this indicating memory is a bottleneck? Anyway, I suppose this could be a libfftw question really, but figured I’d ask here first.

x = rand(2048,2048)

using TimeIt

FFTW.set_num_threads(1)
@timeit fft(x)

FFTW.set_num_threads(2)
@timeit fft(x)

FFTW.set_num_threads(8)
@timeit fft(x)


1 loops, best of 3: 386.88 ms per loop
1 loops, best of 3: 248.82 ms per loop
1 loops, best of 3: 254.14 ms per loop


#2

A lot of what you are measuring is overhead. If you care about performance, you should (a) use a precomputed plan, (b) use complex inputs for a complex FFT or use rfft (as otherwise the input array is converted to complex first), and © consider using a preallocated output array or doing an in-place transform.


julia> x = rand(Complex{Float64}, 2048,2048);
julia> p1 = plan_fft(x, flags=FFTW.MEASURE);
julia> FFTW.set_num_threads(2)
julia> p2 = plan_fft(x, flags=FFTW.MEASURE);
julia> y = similar(x);

julia> @time A_mul_B!(y, p1, x);
  0.103225 seconds (4 allocations: 160 bytes)

julia> @time A_mul_B!(y, p2, x);
  0.058490 seconds (4 allocations: 160 bytes)

#3

Thanks for the quick reply. Even with your code I’m more or less seeing the same thing. For me:

julia> @time A_mul_B!(y, p1, x);
  0.068015 seconds (4 allocations: 160 bytes)

julia> @time A_mul_B!(y, p2, x);
  0.051447 seconds (4 allocations: 160 bytes)

julia> @time A_mul_B!(y, p8, x);  #p8 is an 8 core plan
  0.047542 seconds (4 allocations: 160 bytes)

Any ideas how I might further track down what’s going on?


#4

I have a code whose bottleneck is the computation of a Fourier transform and I’d like to take advantage of parallelization, however I experience the same difficulties as @marius311.

Consider the following code:

function fast_fourier(y)
    f = similar(y)
    p = plan_fft(y, flags = FFTW.MEASURE)
    A_mul_B!(f, p, y)
    return f
end
y = cis.(linspace(0, 100, 1001))

And these are benchmarks I performed on my system:

julia> versioninfo()
Julia Version 0.6.0-dev.1378
Commit 1e0e793 (2016-12-03 13:48 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

julia> using BenchmarkTools

julia> include("/path/to/code.jl");

julia> @benchmark fft(y)
BenchmarkTools.Trial: 
  memory estimate:  19.11 kb
  allocs estimate:  57
  --------------
  minimum time:     56.305 μs (0.00% GC)
  median time:      58.056 μs (0.00% GC)
  mean time:        58.763 μs (0.32% GC)
  maximum time:     4.445 ms (42.22% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark fast_fourier(y)
BenchmarkTools.Trial: 
  memory estimate:  35.27 kb
  allocs estimate:  64
  --------------
  minimum time:     59.347 μs (0.00% GC)
  median time:      61.447 μs (0.00% GC)
  mean time:        65.695 μs (3.24% GC)
  maximum time:     2.669 ms (54.14% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> FFTW.set_num_threads(2)

julia> @benchmark fft(y)
BenchmarkTools.Trial: 
  memory estimate:  19.11 kb
  allocs estimate:  57
  --------------
  minimum time:     106.047 μs (0.00% GC)
  median time:      109.010 μs (0.00% GC)
  mean time:        110.942 μs (0.18% GC)
  maximum time:     7.053 ms (28.86% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark fast_fourier(y)
BenchmarkTools.Trial: 
  memory estimate:  35.27 kb
  allocs estimate:  64
  --------------
  minimum time:     97.113 μs (0.00% GC)
  median time:      101.668 μs (0.00% GC)
  mean time:        110.067 μs (2.09% GC)
  maximum time:     4.322 ms (35.61% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

There is no significant improvement over using directly fft, and there is no improvement when enabling multi-threaded FFTW. What are we missing? How to do it right?

Edit: I’m not sure whether one has to enable multi-threading in Julia by setting JULIA_NUM_THREADS environment variable (but given the fact that FFTW.set_num_threads was there also in Julia 0.4 I’d say the answer is no), however enabling it doesn’t change much the results.


#5

When you @benchmark fast_fourier(y), you’re including planning the FFT in the benchmark, you should do that step separately and only benchmark the A_mul_B! part.

That said it seems like you might be suffering from the same issue (which for me is still unresolved) given that your fft alone is twice as slow with two cores.


#6

Why? Looking to how fft is defined, it internally performs plan_fft(y) * y, so including the planning looks fair to me.

Exactly, that’s the point.


#7

Well I suppose it depends what you want to test. But a normal usage of plan_fft is to do it once at the beginning of your code, then use the plan throughout your computations, so that only applying the plan is in a speed-critical section and hence that’s what makes sense to benchmark.


#8

Probably this is not clear to me: plan_fft is needed once for every different input array or once for every different shape of input arrays?

Admittedly, I’ve never used directly FFTW library before, but the docstring is not super clear here: using the same symbol A for the input of plan_fft and the array on which one wants to compute the FFT suggests that one should plan a FFT for each array.


#9

Right, yes, you only need to plan once per every shape/datatype combination (not per contents). And I agree, the docs could be clearer if one of those A's were different, or at least this said more explicitly.


#10

I wrote a short FFT benchmark script (see gist here) and ran it on the machine I was asking about above, “machine B”, as well as a different one where things seem better, “machine A”. Here’s a plot to summarize what I get:

So any ideas what’s going on with machine B, where I never even get a factor 2x improvement? Even Machine A I get 4x, but not 8x as perhaps expected?

In more detail,

Machine A

julia> versioninfo()
Julia Version 0.5.0
Commit 3c9d753* (2016-09-19 18:14 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU           E5472  @ 3.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_AFFINITY NEHALEM)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, penryn)
$ lscpu
Architecture:          x86_64
CPU(s):                8
Thread(s) per core:    1
Core(s) per socket:    4
CPU socket(s):         2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 23
Stepping:              6
CPU MHz:               2992.498
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              6144K

Machine B

julia> versioninfo()
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
  System: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 94
Model name:            Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Stepping:              3
CPU MHz:               812.398
CPU max MHz:           3500.0000
CPU min MHz:           800.0000
BogoMIPS:              5183.90
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

#11

Your machine B may run into some of the following issues:

  1. The CPU has hyperthreading (2 threads per core). When you do the FFT with 2 threads, how can you be sure that two physical cores are used, and not two threads on the same core? Ideally, the threads should first be distributed over the physical cores, and only if these are exhausted, the 2nd thread on each core should be used. But I’m afraid I don’t know whether FFTW takes care to pin the threads to physical cores. If not, the behavior depends on the OS and maybe on what else is running on the machine.

  2. Modern CPUs have “turbo-boost”, i.e. when just one thread is running, the CPU frequency can be (much) higher than when multiple threads are running (due to the thermal budget). So by using more threads/cores, the CPU might clock down, i.e. the performance won’t scale.

  3. FFT of size N has a runtime of O(N log N), i.e. the work per data element is actually not large. If an optimized implementation (like FFTW) is used, such an algorithm’s performance may actually not be limited by CPU resources, but by memory bandwidth. Note that your machine A has two CPU sockets, i.e. when you use two threads it can probably use twice the memory bandwidth because each CPU has its own memory controller.

For a proper scaling benchmark, it is often recommended to turn off turbo-boost and hyperthreading, though this may require changing settings in the BIOS. Modern Linux kernels may offer tools to change that behavior for a running system, but I’m not up-to-date on that.