Distributed Performance Degradation

Hi everyone!

I’m investigating the parallel performance of my code and discovered some behavior I’m not sure I completely understand. I’m using Distributed with worker processes on a single machine.

In a MWE, I’m solving an ODE on a single core (by setting OMP_NUM_THREADS=1 in the shell before starting julia, and BLAS.set_num_threads(1)). K is a square sparse matrix loaded from K.jl with dimensions of ~18000 and about 0.04% nonzero entries. Here’s the full example:

using Distributed
using BenchmarkTools

NProcs = 20;
addprocs(NProcs)

# load libraries
@everywhere begin
    using DifferentialEquations
    using SparseArrays
    using LinearAlgebra
    BLAS.set_num_threads(1)
end

# variables
@everywhere begin
    t_0 = 0.0
    t_end = 1.0
    dt = 1e-2
    include("K.jl")
    dim = size(K, 2)
    u0 = zeros(dim)
    alg = ImplicitEuler()
    args = (; adaptive=false, dt=dt)
end

# functions
@everywhere begin
    function deriv!(du, u, p, t)
        mul!(du, K, -u)
        du[:] .+= ones(dim)
    end

    function jac!(J, u, p, t)
        mul!(J, K, -I)
    end

    function final(sol)
        return sol.t[end], sol.u[end]
    end

    function run()
        func = ODEFunction(deriv!, jac=jac!, jac_prototype=copy(K))
        prob = ODEProblem(func, u0, (t_0, t_end))
        sol = solve(prob, alg; args...)
        return final(sol)
    end
end


# run on control process
@btime run() seconds = 30
# 4.8s, 2.3 GiB

# run on other process
@btime remotecall_wait(run, 2) seconds = 30
# 5.3 s

# run on all processes
@btime (@everywhere run()) seconds = 30
# 7.2 s

Naively, I would expect all three function calls to take the same time, since the run function uses variables defined @everywhere and thus no data should be transferred between control and worker processes.

Where is the increased runtime coming from? One call to run allocates about 2.3 GiB, and the machine has more than enough RAM to accommodate all workers. Are the worker processes simply bottlenecked by the throughput of writing to RAM? And where do the extra 0.5s come from when running on a single worker process?

Looking forward to your input!

1 Like

How many physical cores you have?

What is the output of

versioninfo()

?