Distributed Performance Degradation

Rbthn · May 22, 2025, 12:16pm

Hi everyone!

I’m investigating the parallel performance of my code and discovered some behavior I’m not sure I completely understand. I’m using Distributed with worker processes on a single machine.

In a MWE, I’m solving an ODE on a single core (by setting OMP_NUM_THREADS=1 in the shell before starting julia, and BLAS.set_num_threads(1)). K is a square sparse matrix loaded from K.jl with dimensions of ~18000 and about 0.04% nonzero entries. Here’s the full example:

using Distributed
using BenchmarkTools

NProcs = 20;
addprocs(NProcs)

# load libraries
@everywhere begin
    using DifferentialEquations
    using SparseArrays
    using LinearAlgebra
    BLAS.set_num_threads(1)
end

# variables
@everywhere begin
    t_0 = 0.0
    t_end = 1.0
    dt = 1e-2
    include("K.jl")
    dim = size(K, 2)
    u0 = zeros(dim)
    alg = ImplicitEuler()
    args = (; adaptive=false, dt=dt)
end

# functions
@everywhere begin
    function deriv!(du, u, p, t)
        mul!(du, K, -u)
        du[:] .+= ones(dim)
    end

    function jac!(J, u, p, t)
        mul!(J, K, -I)
    end

    function final(sol)
        return sol.t[end], sol.u[end]
    end

    function run()
        func = ODEFunction(deriv!, jac=jac!, jac_prototype=copy(K))
        prob = ODEProblem(func, u0, (t_0, t_end))
        sol = solve(prob, alg; args...)
        return final(sol)
    end
end


# run on control process
@btime run() seconds = 30
# 4.8s, 2.3 GiB

# run on other process
@btime remotecall_wait(run, 2) seconds = 30
# 5.3 s

# run on all processes
@btime (@everywhere run()) seconds = 30
# 7.2 s

Naively, I would expect all three function calls to take the same time, since the run function uses variables defined @everywhere and thus no data should be transferred between control and worker processes.

Where is the increased runtime coming from? One call to run allocates about 2.3 GiB, and the machine has more than enough RAM to accommodate all workers. Are the worker processes simply bottlenecked by the throughput of writing to RAM? And where do the extra 0.5s come from when running on a single worker process?

Looking forward to your input!

VinceNeede · May 23, 2025, 1:37am

How many physical cores you have?

ufechner7 · May 23, 2025, 2:16am

What is the output of

versioninfo()

?

Rbthn · May 23, 2025, 9:26am

This machine has 192 cores

Rbthn · May 23, 2025, 9:27am

Here’s the output:

julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 192 × Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake-avx512)
Threads: 1 default, 0 interactive, 1 GC (on 192 virtual cores)
Environment:
  LD_LIBRARY_PATH = /usr/local/openssl/lib:
  JULIA_REVISE = manual

ufechner7 · May 23, 2025, 9:32am

So you can use up to 24 tasks. The max theoretical performance gain would be 24*2.1/3.7=13.6 (the clock frequency drops if you use all cores).

UPDATE: Perhaps the machine has more than one of these CPUs?

But the number of 192 refers to the threads, not to the cores. Perhaps it has four of these CPUs, so you could use up to 96 cores.

You could also have a look at GitHub - JuliaParallel/DistributedNext.jl: Bleeding-edge fork of Distributed.jl to see if the results are the same.

And can you share the file “K.jl” ?

Topic		Replies	Views
Using Distributed: computational efficiency Julia at Scale	5	976	June 24, 2019
@distributed fails for many workers? Julia at Scale	21	1755	May 14, 2019
Very poor performance with DistributedArrays? General Usage	6	622	May 25, 2020
Why passing a distributed matrix to functions hurts performance? Performance performance , linearalgebra , distributed , matrices , functions	7	738	April 14, 2022
How to setup Julia workers to use specific number of cores? Performance	5	4595	January 21, 2020

Distributed Performance Degradation

Related topics