Hi everyone!
I’m investigating the parallel performance of my code and discovered some behavior I’m not sure I completely understand. I’m using Distributed
with worker processes on a single machine.
In a MWE, I’m solving an ODE on a single core (by setting OMP_NUM_THREADS=1
in the shell before starting julia, and BLAS.set_num_threads(1)
). K
is a square sparse matrix loaded from K.jl
with dimensions of ~18000 and about 0.04% nonzero entries. Here’s the full example:
using Distributed
using BenchmarkTools
NProcs = 20;
addprocs(NProcs)
# load libraries
@everywhere begin
using DifferentialEquations
using SparseArrays
using LinearAlgebra
BLAS.set_num_threads(1)
end
# variables
@everywhere begin
t_0 = 0.0
t_end = 1.0
dt = 1e-2
include("K.jl")
dim = size(K, 2)
u0 = zeros(dim)
alg = ImplicitEuler()
args = (; adaptive=false, dt=dt)
end
# functions
@everywhere begin
function deriv!(du, u, p, t)
mul!(du, K, -u)
du[:] .+= ones(dim)
end
function jac!(J, u, p, t)
mul!(J, K, -I)
end
function final(sol)
return sol.t[end], sol.u[end]
end
function run()
func = ODEFunction(deriv!, jac=jac!, jac_prototype=copy(K))
prob = ODEProblem(func, u0, (t_0, t_end))
sol = solve(prob, alg; args...)
return final(sol)
end
end
# run on control process
@btime run() seconds = 30
# 4.8s, 2.3 GiB
# run on other process
@btime remotecall_wait(run, 2) seconds = 30
# 5.3 s
# run on all processes
@btime (@everywhere run()) seconds = 30
# 7.2 s
Naively, I would expect all three function calls to take the same time, since the run
function uses variables defined @everywhere and thus no data should be transferred between control and worker processes.
Where is the increased runtime coming from? One call to run
allocates about 2.3 GiB, and the machine has more than enough RAM to accommodate all workers. Are the worker processes simply bottlenecked by the throughput of writing to RAM? And where do the extra 0.5s come from when running on a single worker process?
Looking forward to your input!