Distributed Performance Degradation

Hi everyone!

I’m investigating the parallel performance of my code and discovered some behavior I’m not sure I completely understand. I’m using Distributed with worker processes on a single machine.

In a MWE, I’m solving an ODE on a single core (by setting OMP_NUM_THREADS=1 in the shell before starting julia, and BLAS.set_num_threads(1)). K is a square sparse matrix loaded from K.jl with dimensions of ~18000 and about 0.04% nonzero entries. Here’s the full example:

using Distributed
using BenchmarkTools

NProcs = 20;
addprocs(NProcs)

# load libraries
@everywhere begin
    using DifferentialEquations
    using SparseArrays
    using LinearAlgebra
    BLAS.set_num_threads(1)
end

# variables
@everywhere begin
    t_0 = 0.0
    t_end = 1.0
    dt = 1e-2
    include("K.jl")
    dim = size(K, 2)
    u0 = zeros(dim)
    alg = ImplicitEuler()
    args = (; adaptive=false, dt=dt)
end

# functions
@everywhere begin
    function deriv!(du, u, p, t)
        mul!(du, K, -u)
        du[:] .+= ones(dim)
    end

    function jac!(J, u, p, t)
        mul!(J, K, -I)
    end

    function final(sol)
        return sol.t[end], sol.u[end]
    end

    function run()
        func = ODEFunction(deriv!, jac=jac!, jac_prototype=copy(K))
        prob = ODEProblem(func, u0, (t_0, t_end))
        sol = solve(prob, alg; args...)
        return final(sol)
    end
end


# run on control process
@btime run() seconds = 30
# 4.8s, 2.3 GiB

# run on other process
@btime remotecall_wait(run, 2) seconds = 30
# 5.3 s

# run on all processes
@btime (@everywhere run()) seconds = 30
# 7.2 s

Naively, I would expect all three function calls to take the same time, since the run function uses variables defined @everywhere and thus no data should be transferred between control and worker processes.

Where is the increased runtime coming from? One call to run allocates about 2.3 GiB, and the machine has more than enough RAM to accommodate all workers. Are the worker processes simply bottlenecked by the throughput of writing to RAM? And where do the extra 0.5s come from when running on a single worker process?

Looking forward to your input!

1 Like

How many physical cores you have?

What is the output of

versioninfo()

?

This machine has 192 cores

Here’s the output:

julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 192 × Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake-avx512)
Threads: 1 default, 0 interactive, 1 GC (on 192 virtual cores)
Environment:
  LD_LIBRARY_PATH = /usr/local/openssl/lib:
  JULIA_REVISE = manual

So you can use up to 24 tasks. The max theoretical performance gain would be 24*2.1/3.7=13.6 (the clock frequency drops if you use all cores).

UPDATE: Perhaps the machine has more than one of these CPUs?

But the number of 192 refers to the threads, not to the cores. Perhaps it has four of these CPUs, so you could use up to 96 cores.

You could also have a look at GitHub - JuliaParallel/DistributedNext.jl: Bleeding-edge fork of Distributed.jl to see if the results are the same.

And can you share the file “K.jl” ?