DifferentialEquations.jl uses just one core

The documentation says that EnsembleThreads() uses multithreading:

  • EnsembleThreads() - The default. This uses multithreading. It’s local (single computer, shared memory) parallelism only. Fastest when the trajectories are quick.

But if I write this

import DifferentialEquations
f(du, u, p, t) = du .= 1.
g(du, u, p, t) = du .= 1e-3
prob = DifferentialEquations.SDEProblem(f, g, [0.], (0., 100.))
ensembleprob = DifferentialEquations.EnsembleProblem(prob)
sol = DifferentialEquations.solve(ensembleprob, DifferentialEquations.EnsembleThreads(), trajectories=2^8)

and execute it using julia bug2.jl --threads 4, gnome-system-monitor shows that it’s only using one core.
Any Ideas how to fix this?

What if you try

julia --threads 4 bug2.jl 
2 Likes

Stupid suggestion. To monitor the program I would open a terminal window and use ‘top -H’
Or even better the htop utility

https://htop.dev/

Not op
oh, I do remember getting stung by something like this before and wanting to report it as a bug. You think that would be worth it, or is it by intentional design.~~ I’d think requiring a -- after file name a reasonable requirement.~~ Damn, I don’t think that would be backwards compatible. I does need to be better documented tho

Still the same Problem:

import DifferentialEquations
f(du, u, p, t) = du .= 1.
g(du, u, p, t) = du .= 1e-3
prob = DifferentialEquations.SDEProblem(f, g, [0.], (0., 100.))
ensembleprob = DifferentialEquations.EnsembleProblem(prob)
@time sol = DifferentialEquations.solve(ensembleprob, DifferentialEquations.EnsembleThreads(), trajectories=2^8)
$ time julia --threads 4 bug2.jl
 24.129812 seconds (78.55 M allocations: 20.149 GiB, 6.46% gc time)

real    0m33,589s
user    0m48,884s
sys     0m2,072s

I know that top and htop exist, but that does not change the problem.

Update: The following starts using multiple cores, then drops down to one core when starting main simulation is printed.

using Distributed
addprocs(4)
import DifferentialEquations
f(du, u, p, t) = du .= 1.
g(du, u, p, t) = du .= 1e-3
prob = DifferentialEquations.SDEProblem(f, g, [0.], (0., 100.))
ensembleprob = DifferentialEquations.EnsembleProblem(prob)
println("starting main simulation")
flush(stdout)
DifferentialEquations.solve(ensembleprob, DifferentialEquations.EnsembleThreads(), trajectories=2^8)

Let’s perhaps go in the other direction - do you observe a loss in performance when you explicitly specify EnsembleSerial()? According to the DifferentialEquations.jl docs, EnsembleThreads() is the default anyway, so I’d expect it to be slower when explicitly running in serial (perhaps with a more complicated/slower to calculate differential equation, instead of just setting some memory to a constant value).

See Parallel Ensemble Simulations · DifferentialEquations.jl

I wouldn’t be concerned about it using only one core a priori, unless you have some task/OS-thread migration enabled to even the load between physical cores (that’s the default behavior on windows, I don’t know about your linux machine).

1 Like

EnsembleSerial and EnsembleThreads are equally fast.

I would be concerned if it’s using only one core, because I’m missing out on quite a lot of performance if I run it on a 128 core machine.

Interesting. What if you use EnsembleDistributed() with addprocs(4) and julia script.jl (i.e. no explicit additional threads)? +A somewhat slower diffeq, since the overhead of interprocess communication is larger than for threads.

Also, what specs does your machine have? Can you post versioninfo()?

EnsembleDistributed does use multiple cores, but is slower than EnsembleThreads or EnsembleSerial. Both on my 4 core machine and on a 128 core machine with addprocs(128).

julia> versioninfo()
Julia Version 1.6.2
Commit 1b93d53fc4* (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, haswell)
julia> versioninfo()
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7542 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)

Slower for the minimal “diffeq” that basically only writes a constant to memory? Yes, I don’t think that’s too surprising, since you basically only measure the overhead of serial vs. threaded vs. distributed communication. You really only get an advantage when your code is a bit more expensive than that overhead itself.

It doesn’t just writes a constant to memory. It calculates

x = x + f *dt + g * dt * rand()

And communication is only needed whenever a trajectory is finished.

What code are you running, exactly? What you’ve shown so far

really isn’t a very expensive function to calculate at all (even taking the pertubation by nature of an SDE into account). This is what I meant with “just writes to memory”, since presumably all other calculation is equivalent for all SDEs.

So I still think linux not migrating threads between cores is the cause of you not seeing activity on more than one core. I’m guessing by default julia doesn’t let itself be migrated between cores, since it’s an opt-in API, as far as I know. Maybe @ChrisRackauckas can shed some light on what’s happening with DiffEq and threading here though.

I ran this code twice, once with N=1, once with N=64, on a machine with 128 logical cores and 64 physical cores.
Both are approximately equally fast, and after printing starting main simulation, both of them used only one core as confirmed by htop.
This does not make any sense to me.

using Distributed

addprocs(64)

import DifferentialEquations

@everywhere f(du, u, p, t) = du .= 1.

@everywhere g(du, u, p, t) = du .= 1e-3

prob = DifferentialEquations.SDEProblem(f, g, [0.], (0., 100.))

ensembleprob = DifferentialEquations.EnsembleProblem(prob)

println("starting main simulation")

flush(stdout)

@time DifferentialEquations.solve(ensembleprob, DifferentialEquations.EnsembleThreads(), trajectories=N)

Did you try setting versioninfo()? Did you check Threads.nthreads()?

julia> versioninfo()
Julia Version 1.7.0-beta3.0
Commit e76c9dad42 (2021-07-07 08:12 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.0 (ORCJIT, znver3)
Environment:
  JULIA_EDITOR = "C:\Users\accou\AppData\Local\atom\app-1.58.0\atom.exe"  -a
  JULIA_NUM_THREADS = 32
  JULIA_PKG_SERVER = https://neuralsim.juliahub.com

If you set your threads then versioninfo() would tell you, and yours is blank.

1 Like

Threads.nthreads() is 1 unless I specify --threads 64.
Even with --threads 64 it still runs on 1 core. Without --threads 64 the other stuff runs on multiple cores.

My versioninfo() does not change if I run it with --threads 64.

Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7542 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)

I didn’t installed julia, I just downloaded the binaries. So you should be able to reproduce it.

curl https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz -o julia.tar.gz
tar -zxvf  julia.tar.gz
julia-1.6.2/bin/julia bug.jl

You’re solving a problem so tiny that it’s faster to just solve it then to pay the thread startup cost. But:

import DifferentialEquations
f(du, u, p, t) = du .= 1.
g(du, u, p, t) = du .= 1e-3
prob = DifferentialEquations.SDEProblem(f, g, [0.], (0., 100.))
ensembleprob = DifferentialEquations.EnsembleProblem(prob)
@time DifferentialEquations.solve(ensembleprob, DifferentialEquations.EnsembleThreads(), trajectories=100000)

That shows all cores are used at max capacity just fine (if your threads are setup).

2 Likes