Parallel computing using Parallel Ensemble Simulations (DiffEq) does not seem to work

So i regularly run SDE simulations. To get a better idea of the systems behavior I would run several identical, and plot the various results. I figured this probably would be a good task for the Parallel Ensemble Simulations part of DifferentialEquations. However, I do not actually manage to get any speedup using parallelism. Also weird things happen.

Here is a simple example code of problems that appear. As seen towards the end, there is no speed up using these approach, also some weird errors happen. Comments in the code.

#We needs distributed to use several processors (?).
using Distributed

#Checks number of threads.
Threads.nthreads()
#(It is equal to 4)

#While following these processes in the terminal using "top", there is only ever one julia process, with maximum 100% CPU.
#Can we use more cores? "cat /proc/cpuinfo | grep processor | wc -l" in the terminal gives us "8". Lets try add another 6 to make it 7.
#(Not sure whenever it matters or not, but it feelt good to save 1 for other computer stuff)
#Or so was my initial though, but then someone said that since I already have number of threads in Juno equal to 4, I should not add more than another 4 workers.
#So I settled for 3 for a starter.
addprocs(3)

#Fetches DiffEq (and all workes needs to see it).
@everywhere using DifferentialEquations

#Prepares the test system (Lotka-Volterra). Also ensures that all workers see them
@everywhere begin
  function f(du,u,p,t)
    du[1] = p[1] * u[1] - p[2] * u[1]*u[2]
    du[2] = -3 * u[2] + u[1]*u[2]
  end
  function g(du,u,p,t)
    du[1] = p[3]*u[1]
    du[2] = p[4]*u[2]
  end
  p = [1.5,1.0,0.1,0.1]
  prob = SDEProblem(f,g,[1.0,1.0],(0.0,5000.0),p)
end

#Prepares parallel ensamble simulations.
@everywhere begin
  prob_func(prob,i,repeat) = prob
  ensemble_prob = EnsembleProblem(prob,prob_func=prob_func)
end

#Meassures without using parallel ensemble simulations from DiffEq.
function solveN(n)
  for i = 1:n
    solve(prob,SRIW1())
  end
end
@time solveN(40)    #This takes about 10 seconds. While watching in top, there is only one active julia process.

#Checks the speed of vaiour solves using Ensemble:
@time solve(ensemble_prob,SRIW1(),trajectories=40)                           #This takes about 12 seconds. While watching in top, there is only one active julia process.
@time solve(ensemble_prob,SRIW1(),EnsembleSerial(),trajectories=40)          #This takes about 12 seconds. While watching in top, there is only one active julia process.
@time solve(ensemble_prob,SRIW1(),EnsembleThreads(),trajectories=40)         #This takes about 13 seconds. While watching in top, there is only one active julia process.
@time solve(ensemble_prob,SRIW1(),EnsembleDistributed(),trajectories=40)     #This takes about 27 seconds. Top gives 3-4 active proccess. When 3 they typically all have about the same %CPU (towards 100). When 4 three typically have very little (maybe 2 or 3%).
@time solve(ensemble_prob,SRIW1(),EnsembleSplitThreads(),trajectories=40)    #Running this will yield messages in the console "Worker 3 terminated." and "Worker 2 terminated.". It will end with a ProcessExitedException().

Finally, here is the full output of the ProcessExitedException() generated at the end.

ProcessExitedException()
in top-level scope at base/util.jl:156
in  at base/none
in #solve#365 at DiffEqBase/AfQA1/src/solve.jl:46 
in  at base/none
in #__solve#305 at DiffEqBase/AfQA1/src/ensemble/basic_ensemble_solve.jl:64
in macro expansion at base/util.jl:213 
in macro expansion at DiffEqBase/AfQA1/src/ensemble/basic_ensemble_solve.jl:70 
in solve_batch at DiffEqBase/AfQA1/src/ensemble/basic_ensemble_solve.jl:165
in #pmap at base/none 
in #pmap#213 at stdlib/v1.1/Distributed/src/pmap.jl:126
in #asyncmap at base/none 
in #asyncmap#680 at base/asyncmap.jl:81 
in #async_usemap at base/none 
in #async_usemap#681 at base/asyncmap.jl:154 
in maptwice at base/asyncmap.jl:178
in foreach at base/abstractarray.jl:1866
in  at base/asyncmap.jl:178

Interesting. Can you double check that standard pmap and @threads looping works on your setup? Julia 1.1?

The distributed result isn’t too surprising here, since it’s a little too small to overcome the overhead in this case. Threading not working for you is surprising.

On my setup it’s working. without exception. Julia 1.3

Yes everything is on Julia 1.1. I will make an attempt to update to a later version.

I’ve tried these example from the julia page on parallel computing, and it works without errors.

Threads.@threads for i = 1:10
     a[i] = Threads.threadid()
end
   
using Distributed

@everywhere include_string(Main, $(read("count_heads.jl", String)), "count_heads.jl")
a = @spawn count_heads(100000000)
b = @spawn count_heads(100000000)
fetch(a)+fetch(b)

nheads = @distributed (+) for i = 1:200000000
    Int(rand(Bool))
end

a = zeros(100000)
@distributed for i = 1:100000
    a[i] = i
end

using SharedArrays

a = SharedArray{Float64}(10)
@distributed for i = 1:10
    a[i] = i
end

a = randn(1000)
f(x) = 2x
@distributed (+) for i = 1:100000
    f(a[rand(1:end)])
end

using LinearAlgebra

M = Matrix{Float64}[rand(1000,1000) for i = 1:10];
pmap(svdvals, M);

@time map(svdvals,M);      #Takes about 2.5 seconds
@time pmap(svdvals, M);    #Takes about 2.5 seconds

addprocs(4)
@time pmap(svdvals, M);    #Takes about 3.5 seconds

It should work fine on Julia 1.1. And it’s just using a multithreaded loop under the hood there, so that’s really peculiar…

So I got some error when trying to update to Julia 1.3, and since it was a release candidate I didn’t feel like digging into that to much. Instead I updated to 1.2.0. However, there was no real change.

So you can confirm that the @threads loop is using multiple threads inside of your REPL, but in the same session it doesn’t use multiple threads when using EnsembleThreads?

I can confirm that

Threads.@threads for i = 1:10
     a[i] = Threads.threadid()
end

runs without error. How would I confirm whenever multiple threads are being used, I presume this would not show up as several entries when I use top?

Ok, so I have tried this:

using Base.Threads, BenchmarkTools
function test1!(y, x)
    @assert length(y) == length(x)
    for i = 1:length(x)
        y[i] = sin(x[i])^2 + cos(x[i])^2
    end
    y
end
function testn!(y, x)
    @assert length(y) == length(x)
    @threads for i = 1:length(x)
        y[i] = sin(x[i])^2 + cos(x[i])^2
    end
    y
end

@time test1!(rand(300000000),rand(300000000))
@time testn!(rand(300000000),rand(300000000))

I run the last two lines a few times. The first one takes about 5.5 seconds, and the second one about 3 seconds. Might this suggest that threading works?
(when I have a smaller number, like rand(1000), then the first one is faster)

However; now if I run everything earlier from the same session and then

@time solve(ensemble_prob,SRIW1(),EnsembleSerial(),trajectories=40)         
@time solve(ensemble_prob,SRIW1(),EnsembleThreads(),trajectories=40) 

then the first takes about 11.5s and the second 7.5s. I’m continuing to investigate.

So I think I can confirm threading working, see last post. But then I tried rerunning the first script, but increasing the load a bit (to see whenever the difference become more distinct).

I sat the tspan to (0,15000), increasing its length by a factor three. This causes the computer to freeze when trying to solve with EnsembleThreads() algorithm (the two before works). This have happened both times I’ve tried it (writing on my phone right now since my computer is unusable).

Top did show a single Julia process, which I think at times hit over 300% CPU. At the time of the current freeze it was only at 29%.

Are you running out of RAM?

Using top, and running it again: The KiB Swap, free fields goes to zero. I presume this confirms your theory.

Yup, remember each thread takes memory, so it’s like 4x or 16x of a single run. So if you run a very long simulation with a ton of saving, it’ll just overload your RAM.

I understand. It seems like things might be working then, however, under natural limitations.

A second question. You said that my test probably would not be able to test whenever multi processing worked properly, do you have a suggestion how I could make a modification to confirm whenever this works?