# Possible performance drop when using more than one socket threads

Hello, this is originally raised in an issue. I hope posting here would bring more attention.

Some dummy code for solving ODE:

``````# Julia 1.6.1
using FFTW, BenchmarkTools, LinearAlgebra, Printf, Polyester, Random

println("FFT provider: \$(FFTW.get_provider()), BLAS: \$(BLAS.vendor())")

function ode_1(du, u, p, t)
@batch for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
end

function ode_2(du, u, p, t)
v1, v2, plan, _, _ = p
mul!(v1, plan, u)
@batch for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
ldiv!(v2, plan, v1)
end

function ode_3(du, u, p, t)
_, _, _, K, w = p
@batch for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
mul!(w, K, vec(u))
end

begin
N = 64
n = (2N-1) * N
Random.seed!(42)
u = rand(2N-1, N)
du = similar(u)
v₁ = rand(ComplexF64, N, N)
v₂ = rand(2N-1, N)
K = rand(n, n)
w = zeros(n)
plan = plan_rfft(du, 1; flags=FFTW.PATIENT)
p = (v₁, v₂, plan, K, w)
end

println("ODE-1: only element-wise (EW) ops")
@btime ode_1(\$du, \$u, \$p, 1.0)

println("ODE-2: FFT + EW + FFT")
@btime ode_2(\$du, \$u, \$p, 1.0)

println("ODE-3: EW + BLAS")
@btime ode_2(\$du, \$u, \$p, 1.0)
``````

It scales almost perfectly on my local machine (2 * Intel(R) Xeon(R) Gold 6136, 2 * 12 CPUs, hyperthreading enabled), although when checking the CPU usage, it is higher than that should be (almost doubled).

Results
``````~/codes » julia16 --check-bounds=no -O3 -t 6 ex2.jl                                   pshi@discover
Julia num threads: 6, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
114.905 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
172.101 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
173.103 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 12 ex2.jl                                  pshi@discover
Julia num threads: 12, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
56.885 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
106.648 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
106.777 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 24 ex2.jl                                  pshi@discover
Julia num threads: 24, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
29.294 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
77.235 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
77.275 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 48 ex2.jl                                  pshi@discover
Julia num threads: 48, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
28.303 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
76.601 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
77.470 μs (2 allocations: 160 bytes)
``````

But there is a huge performance drop on a computer cluster (2 * Intel(R) Xeon(R) Gold 5220, 2 * 18 CPUs, hyperthreading disabled) when using more than one socket threads:

Results
``````18 CPU
Julia num threads: 18, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
42.415 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
96.472 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
96.324 μs (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
40.662 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
92.047 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
92.357 μs (2 allocations: 160 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
39.156 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
143.665 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
148.203 μs (2 allocations: 160 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
32.706 μs (0 allocations: 0 bytes) # still scale well
ODE-2: FFT + EW + FFT
10.992 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
10.987 ms (2 allocations: 160 bytes) # oops!

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
25.268 μs (0 allocations: 0 bytes) # still scale well
ODE-2: FFT + EW + FFT
12.047 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
13.059 ms (2 allocations: 160 bytes) # oops!
``````

The reason that I prefer Polyester instead of Threads is that this ODE function has to be called millions of times and the former is allocation-friendly and does perform better. But it seems to have a strong overhead when the ODE function also mixes some FFT/BLAS.

I am wondering if someone could also reproduce this and I would appreciate any advice on how to make this scale up. Thank you in advance!

You can try pinning threads to run on only one socket.

2 Likes

Thanks for the suggestion. The results are:

set `JULIA_EXCLUSIVE=1`
``````18 CPU
Julia num threads: 18, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
47.296 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
303.403 μs (2 allocations: 160 bytes) # not as good as before
ODE-3: EW + BLAS
303.455 μs (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
44.722 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
300.458 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
300.650 μs (2 allocations: 160 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
42.656 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
298.792 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
298.684 μs (2 allocations: 160 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
32.909 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
285.511 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
285.448 μs (2 allocations: 160 bytes)

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
27.535 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
276.389 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
276.192 μs (2 allocations: 160 bytes)
``````
set `numactl --physcpubind=0-{N}`
``````18 CPU
Julia num threads: 18, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
47.114 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
19.486 ms (2 allocations: 160 bytes)
ODE-3: EW + BLAS
18.747 ms (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
45.057 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
36.746 ms (11 allocations: 448 bytes)
ODE-3: EW + BLAS
26.864 ms (11 allocations: 448 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
42.985 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
18.836 ms (4 allocations: 224 bytes)
ODE-3: EW + BLAS
50.491 ms (15 allocations: 576 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
2.897 ms (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
18.438 ms (2 allocations: 160 bytes)
ODE-3: EW + BLAS
21.103 ms (5 allocations: 256 bytes)

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
25.671 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
12.090 ms (2 allocations: 160 bytes)
ODE-3: EW + BLAS
12.056 ms (2 allocations: 160 bytes)
``````

Neither gives the satisfying outcome. Although the first one has less overhead, it decreases the MKL FFT/BLAS performance.

Commenting out the element-wise operation in the 2nd and 3rd functions, so there is no switching between Julia function and MKL:

No switching between Julia and MKL (no pinning thread)
``````18 CPU
Julia num threads: 18, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
42.716 μs (0 allocations: 0 bytes)
Only FFT
28.782 μs (2 allocations: 160 bytes)
Only BLAS
29.253 μs (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
39.942 μs (0 allocations: 0 bytes)
Only FFT
29.704 μs (2 allocations: 160 bytes)
Only BLAS
29.939 μs (2 allocations: 160 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
38.667 μs (0 allocations: 0 bytes)
Only FFT
29.379 μs (2 allocations: 160 bytes)
Only BLAS
29.735 μs (2 allocations: 160 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
32.423 μs (0 allocations: 0 bytes)
Only FFT
28.557 μs (2 allocations: 160 bytes)
Only BLAS
28.734 μs (2 allocations: 160 bytes)

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
27.813 μs (0 allocations: 0 bytes)
Only FFT
31.272 μs (2 allocations: 160 bytes)
Only BLAS
30.160 μs (2 allocations: 160 bytes)
``````

Looks like the MKL FFT/BLAS is using the full node. That’s a bit weird since I have never set any other environment variable related to MKL except the `FFTW/BLAS.set_num_threads(Threads.nthreads())`. Nevertheless, I was hoping the benchmark would give me something better at 36 CPUs than 18 CPUs (~96.472 μs) because of the speed up in the Julia part.

On my local machine, it doesn’t matter if I pin the threads.

I really appreciate it if you have more advice! Thank you so much!