Hello, this is originally raised in an issue. I hope posting here would bring more attention.
Some dummy code for solving ODE:
# Julia 1.6.1
using FFTW, BenchmarkTools, LinearAlgebra, Printf, Polyester, Random
println("Julia num threads: $(Threads.nthreads()), Total Sys CPUs: $(Sys.CPU_THREADS)")
println("FFT provider: $(FFTW.get_provider()), BLAS: $(BLAS.vendor())")
function ode_1(du, u, p, t)
@batch for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
end
function ode_2(du, u, p, t)
v1, v2, plan, _, _ = p
mul!(v1, plan, u)
@batch for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
ldiv!(v2, plan, v1)
end
function ode_3(du, u, p, t)
_, _, _, K, w = p
@batch for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
mul!(w, K, vec(u))
end
begin
N = 64
n = (2N-1) * N
Random.seed!(42)
u = rand(2N-1, N)
du = similar(u)
v₁ = rand(ComplexF64, N, N)
v₂ = rand(2N-1, N)
K = rand(n, n)
w = zeros(n)
FFTW.set_num_threads(Threads.nthreads()) # use all CPUs
plan = plan_rfft(du, 1; flags=FFTW.PATIENT)
p = (v₁, v₂, plan, K, w)
BLAS.set_num_threads(Threads.nthreads()) # use all CPUs
end
println("ODE-1: only element-wise (EW) ops")
@btime ode_1($du, $u, $p, 1.0)
println("ODE-2: FFT + EW + FFT")
@btime ode_2($du, $u, $p, 1.0)
println("ODE-3: EW + BLAS")
@btime ode_2($du, $u, $p, 1.0)
It scales almost perfectly on my local machine (2 * Intel(R) Xeon(R) Gold 6136, 2 * 12 CPUs, hyperthreading enabled), although when checking the CPU usage, it is higher than that should be (almost doubled).
Results
~/codes » julia16 --check-bounds=no -O3 -t 6 ex2.jl pshi@discover
Julia num threads: 6, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
114.905 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
172.101 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
173.103 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 12 ex2.jl pshi@discover
Julia num threads: 12, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
56.885 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
106.648 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
106.777 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 24 ex2.jl pshi@discover
Julia num threads: 24, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
29.294 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
77.235 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
77.275 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 48 ex2.jl pshi@discover
Julia num threads: 48, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
28.303 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
76.601 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
77.470 μs (2 allocations: 160 bytes)
But there is a huge performance drop on a computer cluster (2 * Intel(R) Xeon(R) Gold 5220, 2 * 18 CPUs, hyperthreading disabled) when using more than one socket threads:
Results
18 CPU
Julia num threads: 18, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
42.415 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
96.472 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
96.324 μs (2 allocations: 160 bytes)
19 CPU
Julia num threads: 19, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
40.662 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
92.047 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
92.357 μs (2 allocations: 160 bytes)
20 CPU
Julia num threads: 20, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
39.156 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
143.665 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
148.203 μs (2 allocations: 160 bytes)
27 CPU
Julia num threads: 27, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
32.706 μs (0 allocations: 0 bytes) # still scale well
ODE-2: FFT + EW + FFT
10.992 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
10.987 ms (2 allocations: 160 bytes) # oops!
36 CPU
Julia num threads: 36, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
25.268 μs (0 allocations: 0 bytes) # still scale well
ODE-2: FFT + EW + FFT
12.047 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
13.059 ms (2 allocations: 160 bytes) # oops!
The reason that I prefer Polyester instead of Threads is that this ODE function has to be called millions of times and the former is allocation-friendly and does perform better. But it seems to have a strong overhead when the ODE function also mixes some FFT/BLAS.
I am wondering if someone could also reproduce this and I would appreciate any advice on how to make this scale up. Thank you in advance!