Possible performance drop when using more than one socket threads

Hello, this is originally raised in an issue. I hope posting here would bring more attention.

Some dummy code for solving ODE:

# Julia 1.6.1
using FFTW, BenchmarkTools, LinearAlgebra, Printf, Polyester, Random

println("Julia num threads: $(Threads.nthreads()), Total Sys CPUs: $(Sys.CPU_THREADS)")
println("FFT provider: $(FFTW.get_provider()), BLAS: $(BLAS.vendor())")

function ode_1(du, u, p, t)
    @batch for i ∈ eachindex(u)
        du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
    end
end

function ode_2(du, u, p, t)
    v1, v2, plan, _, _ = p
    mul!(v1, plan, u)
    @batch for i ∈ eachindex(u)
        du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
    end
    ldiv!(v2, plan, v1)
end

function ode_3(du, u, p, t)
    _, _, _, K, w = p
    @batch for i ∈ eachindex(u)
        du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
    end
    mul!(w, K, vec(u))
end

begin
    N = 64
    n = (2N-1) * N
    Random.seed!(42)
    u = rand(2N-1, N)
    du = similar(u)
    v₁ = rand(ComplexF64, N, N)
    v₂ = rand(2N-1, N)
    K = rand(n, n)
    w = zeros(n)
    FFTW.set_num_threads(Threads.nthreads()) # use all CPUs
    plan = plan_rfft(du, 1; flags=FFTW.PATIENT)
    p = (v₁, v₂, plan, K, w)
    BLAS.set_num_threads(Threads.nthreads()) # use all CPUs
end

println("ODE-1: only element-wise (EW) ops")
@btime ode_1($du, $u, $p, 1.0)

println("ODE-2: FFT + EW + FFT")
@btime ode_2($du, $u, $p, 1.0)

println("ODE-3: EW + BLAS")
@btime ode_2($du, $u, $p, 1.0)

It scales almost perfectly on my local machine (2 * Intel(R) Xeon(R) Gold 6136, 2 * 12 CPUs, hyperthreading enabled), although when checking the CPU usage, it is higher than that should be (almost doubled).

Results
~/codes » julia16 --check-bounds=no -O3 -t 6 ex2.jl                                   pshi@discover
Julia num threads: 6, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  114.905 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  172.101 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  173.103 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 12 ex2.jl                                  pshi@discover
Julia num threads: 12, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  56.885 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  106.648 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  106.777 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 24 ex2.jl                                  pshi@discover
Julia num threads: 24, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  29.294 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  77.235 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  77.275 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 48 ex2.jl                                  pshi@discover
Julia num threads: 48, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  28.303 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  76.601 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  77.470 μs (2 allocations: 160 bytes)

But there is a huge performance drop on a computer cluster (2 * Intel(R) Xeon(R) Gold 5220, 2 * 18 CPUs, hyperthreading disabled) when using more than one socket threads:

Results
18 CPU
Julia num threads: 18, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  42.415 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  96.472 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  96.324 μs (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  40.662 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  92.047 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  92.357 μs (2 allocations: 160 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  39.156 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  143.665 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  148.203 μs (2 allocations: 160 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  32.706 μs (0 allocations: 0 bytes) # still scale well
ODE-2: FFT + EW + FFT
  10.992 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
  10.987 ms (2 allocations: 160 bytes) # oops!

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  25.268 μs (0 allocations: 0 bytes) # still scale well
ODE-2: FFT + EW + FFT
  12.047 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
  13.059 ms (2 allocations: 160 bytes) # oops!

The reason that I prefer Polyester instead of Threads is that this ODE function has to be called millions of times and the former is allocation-friendly and does perform better. But it seems to have a strong overhead when the ODE function also mixes some FFT/BLAS.

I am wondering if someone could also reproduce this and I would appreciate any advice on how to make this scale up. Thank you in advance!

You can try pinning threads to run on only one socket.

2 Likes

Thanks for the suggestion. The results are:

set `JULIA_EXCLUSIVE=1`
18 CPU
Julia num threads: 18, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  47.296 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  303.403 μs (2 allocations: 160 bytes) # not as good as before
ODE-3: EW + BLAS
  303.455 μs (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  44.722 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  300.458 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  300.650 μs (2 allocations: 160 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  42.656 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  298.792 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  298.684 μs (2 allocations: 160 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  32.909 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  285.511 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  285.448 μs (2 allocations: 160 bytes)

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  27.535 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  276.389 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  276.192 μs (2 allocations: 160 bytes)
set `numactl --physcpubind=0-{N}`
18 CPU
Julia num threads: 18, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  47.114 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  19.486 ms (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  18.747 ms (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  45.057 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  36.746 ms (11 allocations: 448 bytes)
ODE-3: EW + BLAS
  26.864 ms (11 allocations: 448 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  42.985 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  18.836 ms (4 allocations: 224 bytes)
ODE-3: EW + BLAS
  50.491 ms (15 allocations: 576 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  2.897 ms (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  18.438 ms (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  21.103 ms (5 allocations: 256 bytes)

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  25.671 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  12.090 ms (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  12.056 ms (2 allocations: 160 bytes)

Neither gives the satisfying outcome. Although the first one has less overhead, it decreases the MKL FFT/BLAS performance.

Commenting out the element-wise operation in the 2nd and 3rd functions, so there is no switching between Julia function and MKL:

No switching between Julia and MKL (no pinning thread)
18 CPU
Julia num threads: 18, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  42.716 μs (0 allocations: 0 bytes)
Only FFT
  28.782 μs (2 allocations: 160 bytes)
Only BLAS
  29.253 μs (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  39.942 μs (0 allocations: 0 bytes)
Only FFT
  29.704 μs (2 allocations: 160 bytes)
Only BLAS
  29.939 μs (2 allocations: 160 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  38.667 μs (0 allocations: 0 bytes)
Only FFT
  29.379 μs (2 allocations: 160 bytes)
Only BLAS
  29.735 μs (2 allocations: 160 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  32.423 μs (0 allocations: 0 bytes)
Only FFT
  28.557 μs (2 allocations: 160 bytes)
Only BLAS
  28.734 μs (2 allocations: 160 bytes)

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
ODE-1: only element-wise (EW) ops
  27.813 μs (0 allocations: 0 bytes)
Only FFT
  31.272 μs (2 allocations: 160 bytes)
Only BLAS
  30.160 μs (2 allocations: 160 bytes)

Looks like the MKL FFT/BLAS is using the full node. That’s a bit weird since I have never set any other environment variable related to MKL except the FFTW/BLAS.set_num_threads(Threads.nthreads()). Nevertheless, I was hoping the benchmark would give me something better at 36 CPUs than 18 CPUs (~96.472 μs) because of the speed up in the Julia part.

On my local machine, it doesn’t matter if I pin the threads.

I really appreciate it if you have more advice! Thank you so much!