Possible performance drop when using more than one socket threads

Hello, this is originally raised in an issue. I hope posting here would bring more attention.

Some dummy code for solving ODE:

# Julia 1.6.1
using FFTW, BenchmarkTools, LinearAlgebra, Printf, Polyester, Random

println("Julia num threads: $(Threads.nthreads()), Total Sys CPUs: $(Sys.CPU_THREADS)")
println("FFT provider: $(FFTW.get_provider()), BLAS: $(BLAS.vendor())")

function ode_1(du, u, p, t)
    @batch for i ∈ eachindex(u)
        du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
    end
end

function ode_2(du, u, p, t)
    v1, v2, plan, _, _ = p
    mul!(v1, plan, u)
    @batch for i ∈ eachindex(u)
        du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
    end
    ldiv!(v2, plan, v1)
end

function ode_3(du, u, p, t)
    _, _, _, K, w = p
    @batch for i ∈ eachindex(u)
        du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
    end
    mul!(w, K, vec(u))
end

begin
    N = 64
    n = (2N-1) * N
    Random.seed!(42)
    u = rand(2N-1, N)
    du = similar(u)
    v₁ = rand(ComplexF64, N, N)
    v₂ = rand(2N-1, N)
    K = rand(n, n)
    w = zeros(n)
    FFTW.set_num_threads(Threads.nthreads()) # use all CPUs
    plan = plan_rfft(du, 1; flags=FFTW.PATIENT)
    p = (v₁, v₂, plan, K, w)
    BLAS.set_num_threads(Threads.nthreads()) # use all CPUs
end

println("ODE-1: only element-wise (EW) ops")
@btime ode_1($du, $u, $p, 1.0)

println("ODE-2: FFT + EW + FFT")
@btime ode_2($du, $u, $p, 1.0)

println("ODE-3: EW + BLAS")
@btime ode_2($du, $u, $p, 1.0)

It scales almost perfectly on my local machine (2 * Intel(R) Xeon(R) Gold 6136, 2 * 12 CPUs, hyperthreading enabled), although when checking the CPU usage, it is higher than that should be (almost doubled).

Results
~/codes » julia16 --check-bounds=no -O3 -t 6 ex2.jl                                   pshi@discover
Julia num threads: 6, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  114.905 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  172.101 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  173.103 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 12 ex2.jl                                  pshi@discover
Julia num threads: 12, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  56.885 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  106.648 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  106.777 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 24 ex2.jl                                  pshi@discover
Julia num threads: 24, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  29.294 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  77.235 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  77.275 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 48 ex2.jl                                  pshi@discover
Julia num threads: 48, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  28.303 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  76.601 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  77.470 μs (2 allocations: 160 bytes)

But there is a huge performance drop on a computer cluster (2 * Intel(R) Xeon(R) Gold 5220, 2 * 18 CPUs, hyperthreading disabled) when using more than one socket threads:

Results
18 CPU
Julia num threads: 18, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  42.415 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  96.472 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  96.324 μs (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  40.662 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  92.047 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  92.357 μs (2 allocations: 160 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  39.156 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  143.665 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  148.203 μs (2 allocations: 160 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  32.706 μs (0 allocations: 0 bytes) # still scale well
ODE-2: FFT + EW + FFT
  10.992 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
  10.987 ms (2 allocations: 160 bytes) # oops!

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  25.268 μs (0 allocations: 0 bytes) # still scale well
ODE-2: FFT + EW + FFT
  12.047 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
  13.059 ms (2 allocations: 160 bytes) # oops!

The reason that I prefer Polyester instead of Threads is that this ODE function has to be called millions of times and the former is allocation-friendly and does perform better. But it seems to have a strong overhead when the ODE function also mixes some FFT/BLAS.

I am wondering if someone could also reproduce this and I would appreciate any advice on how to make this scale up. Thank you in advance!

You can try pinning threads to run on only one socket.

Thanks for the suggestion. The results are:

set `JULIA_EXCLUSIVE=1`
18 CPU
Julia num threads: 18, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  47.296 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  303.403 μs (2 allocations: 160 bytes) # not as good as before
ODE-3: EW + BLAS
  303.455 μs (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  44.722 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  300.458 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  300.650 μs (2 allocations: 160 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  42.656 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  298.792 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  298.684 μs (2 allocations: 160 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  32.909 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  285.511 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  285.448 μs (2 allocations: 160 bytes)

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  27.535 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  276.389 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  276.192 μs (2 allocations: 160 bytes)
set `numactl --physcpubind=0-{N}`
18 CPU
Julia num threads: 18, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  47.114 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  19.486 ms (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  18.747 ms (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  45.057 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  36.746 ms (11 allocations: 448 bytes)
ODE-3: EW + BLAS
  26.864 ms (11 allocations: 448 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  42.985 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  18.836 ms (4 allocations: 224 bytes)
ODE-3: EW + BLAS
  50.491 ms (15 allocations: 576 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  2.897 ms (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  18.438 ms (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  21.103 ms (5 allocations: 256 bytes)

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  25.671 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  12.090 ms (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  12.056 ms (2 allocations: 160 bytes)

Neither gives the satisfying outcome. Although the first one has less overhead, it decreases the MKL FFT/BLAS performance.

Commenting out the element-wise operation in the 2nd and 3rd functions, so there is no switching between Julia function and MKL:

No switching between Julia and MKL (no pinning thread)
18 CPU
Julia num threads: 18, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  42.716 μs (0 allocations: 0 bytes)
Only FFT
  28.782 μs (2 allocations: 160 bytes)
Only BLAS
  29.253 μs (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  39.942 μs (0 allocations: 0 bytes)
Only FFT
  29.704 μs (2 allocations: 160 bytes)
Only BLAS
  29.939 μs (2 allocations: 160 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  38.667 μs (0 allocations: 0 bytes)
Only FFT
  29.379 μs (2 allocations: 160 bytes)
Only BLAS
  29.735 μs (2 allocations: 160 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  32.423 μs (0 allocations: 0 bytes)
Only FFT
  28.557 μs (2 allocations: 160 bytes)
Only BLAS
  28.734 μs (2 allocations: 160 bytes)

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  27.813 μs (0 allocations: 0 bytes)
Only FFT
  31.272 μs (2 allocations: 160 bytes)
Only BLAS
  30.160 μs (2 allocations: 160 bytes)

Looks like the MKL FFT/BLAS is using the full node. That’s a bit weird since I have never set any other environment variable related to MKL except the FFTW/BLAS.set_num_threads(Threads.nthreads()). Nevertheless, I was hoping the benchmark would give me something better at 36 CPUs than 18 CPUs (~96.472 μs) because of the speed up in the Julia part.

On my local machine, it doesn’t matter if I pin the threads.

I really appreciate it if you have more advice! Thank you so much!