Multithreading not giving noticeable advantage when running parallel functions

Hello,

I have a research problem where I try to determine how a system’s behaviour depends on parameter values. Essentially I have a function determine_behaviour(p) which takes a set of parameter values and determines how the system behaves. Right now I want to simply scan parameter space, and see how behaviours depend on parameters. This involves testing a lot of parameter values. However, all tests are independent, so I should be able to parallelise things. Once a set has been evaluated I would write the result to the disk, and that thread could start on a new set.

I have my laptop (16 threads), my desktop (34 threads) which I can ssh into. I also have access to the uni HPC where I can start a lot of jobs, each on 32 core machines. The plan is to push it all to the HPC, but first I want to get it working locally, to ensure that it works when I push it up there for real.

The problem is that, even though things should work fully parallel, I see no significant speedup at all. This is some sample code:

Threads.nthreads() #This returns 8, confirming that I run on 8 threads.

# Multithreading seem to work in this simple example
@time sleep(1)     # Takes about 1 second.
@time for i = 1:8
    sleep(1)
end                       # Takes about 8 seconds.
@time Threads.@threads for i = 1:8
    sleep(1)
end                       # Takes about 1 second.

# The part I#m actually interested in.
@time determine_behaviour([1.,1.,1.,0.1,3,0.1])     # Takes about 25 seconds.
@time for i = 1:8
    determine_behaviour([1.,1.,1.,0.1,3,0.1])            # Takes about 180 seconds.
end
@time Threads.@threads for i = 1:8
    determine_behaviour([1.,1.,1.,0.1,3,0.1])            # Takes about 140 seconds.
end

I have to admit that this is not something I know much about, but I could really use some advice. I understand that with 8 threads I can not expect an x8 speed up, but certainly I should see a lot more than this? Even though it might be hard to figure out what might be going wrong, it would be good to have some pointers to whenever something is indeed going wrong!

It would be better if you tried to reproduce the issue with a function which everyone can reproduce.

1 Like

It sounds like determine_behaviour writes a file to disk inside the parallel loop. Since you are calling it with the same parameters each time, you may have a bottleneck as all processes try to write to the same file?

But, as leandromartinez98 pointed out: hard to say without seeing the code.

In this case I excluded the file writing (in an attempt to narrow things down a little). The best of minimal example I got so far is this (the origin of everything is solving SDEs using DifferentialEquations.jl). It is basically the DiffernetialEquations.jl SDE solving example from the docs (Stochastic Differential Equations · DifferentialEquations.jl)

Threads.nthreads() # This returns 8, the number of threads.

# Creates an SDEProblem from DifferentialEquations.jl
using DifferentialEquations
α=1
β=1
u₀=1/2
f(u,p,t) = α*u
g(u,p,t) = β*u
dt = 1//2^(4)
tspan = (0.0,1.0)
prob = SDEProblem(f,g,u₀,(0.0,1.0))

# Tries solving the problem in three different ways.
@time solve(prob, adaptive = false, dt = 0.00001)  # This takes about 0.5 seconds.
@time for i = 1:8
    solve(prob, adaptive = false, dt = 0.00001)    # This takes about 4.0 - 4.5 seconds.
end
@time Threads.@threads for i = 1:8
    solve(prob, adaptive = false, dt = 0.00001)    # This takes about 1.7 - 2.0 seconds.
end

This example is not as distinct as my previous one, but the improvement still ain’t that great. Is it helpful? I can try to see if I can make it better tomorrow.

I try following CPU activity using the linux command top, and I can confirm that julia goes over 100% CPU (which should indicate multithreading?).

For this example, it seem’s const claim like

const α=1
const β=1
const u₀=1/2

is more important than parallel acceleration, as it could offer about 10x speed up on my laptop by omit dynamic dispatch.

Using Juno.@profiler, it’s easy to find that GC is the main bottleneck for code with const claim. I know little about juia’s GC, but i think this is why @threads seems inefficient.

Note: This is not what happened on your own code. If you don’t want to show the implementation, then you have to analyze the bottleneck yourself.

The GC runs in serial, so when multithreading you want to make sure that you aren’t allocating a ton per thread. Without const here the SDE has a type instability in the function (due to the global) which will eat at the memory and will cause an issue.

1 Like

I am both impressed and confused! I also get the x10 speedup if I add the const and this is great. Is there a good link explaining why this yields such an improvement in all cases? Maybe I have been solving differential equations unusually slow in other instances as well…

But the multithreading still doesn’t seem to work, running with or without Threads.@threads gives mostly equivalent results. Meanwhile, I can open 4 different terminals and call julia my_script.jl 4 times, effectively running solve(prob, adaptive = false, dt = 0.00001) in parallel, and there aren’t any speed down.

I am happy to give the full script, but it is really large, and I figured since the part that actually takes time is solving SDEs, if I cannot get multithreading to work in this simple example, I shouldn’t be able to get it to work in my main program?

Just follow the performance-tips in Official document.

As mentioned,

For task with GC as bottleneck, multi-worker is better than multi-thread, you can try Base.Distributed to achieve acceleration you want.

FWIW, did you try the ensemble interface? Parallel Ensemble Simulations · DifferentialEquations.jl EnsembleThreads() makes sure to do things in a way that works around known Base issues, and EnsembleDistributed() makes use of multiprocessing so you can try that easily as well. I’d be curious if you’d also be having issues there.

This issue seems algorithm-specific:

Threads.nthreads() # 6

# Creates an SDEProblem from DifferentialEquations.jl
using StochasticDiffEq
const α=1
const β=1
const u₀=1/2
f(u,p,t) = α*u
g(u,p,t) = β*u
dt = 1//2^(4)
tspan = (0.0,1.0)
prob = SDEProblem(f,g,u₀,(0.0,1.0))

# Tries solving the problem in three different ways.
@time solve(prob, SRIW1(), adaptive = false, dt = 0.00001)  # 0.028668 seconds (300.06 k allocations: 50.359 MiB).
@time for i = 1:8
    solve(prob, adaptive = false, dt = 0.00001)    # 0.241266 seconds (2.40 M allocations: 402.876 MiB, 19.12% gc time)
end
@time Threads.@threads for i = 1:8
    solve(prob, adaptive = false, dt = 0.00001)    # 0.160430 seconds (2.42 M allocations: 403.909 MiB, 33.43% gc time)
end


@time for i = 1:100
    solve(prob, SRIW1(), adaptive = false, dt = 0.00001)    # 2.335679 seconds (40.01 M allocations: 1.491 GiB, 10.52% gc time)
end
@time Threads.@threads for i = 1:100
    solve(prob, SRIW1(), adaptive = false, dt = 0.00001)    # 0.276404 seconds (20.02 M allocations: 764.280 MiB, 15.93% gc time)
end

@time for i = 1:200
    solve(prob, SRIW1(), adaptive = false, dt = 0.00001)    # 2.335679 seconds (40.01 M allocations: 1.491 GiB, 10.52% gc time)
end
@time Threads.@threads for i = 1:200
    solve(prob, SRIW1(), adaptive = false, dt = 0.00001)    # 0.678999 seconds (40.03 M allocations: 1.492 GiB, 30.16% gc time)
end

It looks like a possible type-inference bug.

@N5N3 Seems like I will have some reading to do this afternoon.

@ChrisRackauckas Interesting that the solver affects it. I will try around a few. About the ensemble interface yes and no: For each parameter value I run about 8 or so SDEs, that I do with as an Ensemble. When I planned for pushing to 32 cores, however, I changed that to EnsembleSerial() (afraid of what would happen if I tried multithreading in multithreading…) figuring it would be easier to parallelise the parameter scanning loop. But it might be possible to be clever about it and move the parallel computing back to the ensemble. At some point, I had similar problems when trying to run EnsembleThreads() with more than 6 threads, but I will have a second look at that and maybe create a bug report if I find something funny

Nesting multithreading is fine in modern Julia.

I’ll narrow down whatever this inference issue is. It’s a little bit tricky though since it seems like it might be something weird in Base itself.

1 Like

Also, testing on my machine, I do not get as large differences between algorithms, here’s the run results for me:


@time Threads.@threads for i = 1:100
    solve(prob, SRIW1(), adaptive = false, dt = 0.00001)    # 0.860810 seconds (40.03 M allocations: 1.492 GiB, 25.69% gc time)
end

(I changed both loops to 200 iterations, for consistency)

Still noticeable difference from using the default algorithm though:

@time for i = 1:200
    solve(prob, adaptive = false, dt = 0.00001)    # 6.775396 seconds (60.01 M allocations: 9.836 GiB, 23.31% gc time)
end
@time Threads.@threads for i = 1:200
    solve(prob, adaptive = false, dt = 0.00001)    # 4.012337 seconds (60.03 M allocations: 9.837 GiB, 63.35% gc time)
end

Note the difference between different cores and different hyperthreads. If you have a laptop with 8 cores and 16 hyperthreads for example then there are only 8 things that can be done at any one time. Under Linux for example it looks like you have 16 cores, but you don’t.

in my opinion you should set Julia thread count to something like N real actual cores +2 the plus 2 is to take advantage of some capacity for hyperthreading but not fully oversubscribe all your cores.

Shouldn’t the title say “Multithreading not …” instead of “Multithreading now …”?

1 Like

Yes it should, thanks for the note. I’ve changed it now.

I have tried to compress my code into some example. I am happy to share the full code, but it is over 200 lines.

However, this is basically what it does; I declare a determine_behaviour() function, which takes a parameter set as input and classifies my system of interest according to how many fixed point it has, and how long it takes and SDE of the system to pass a threshold (for that parameter set):

using Catalyst     # Tool for creating biochemical reaction network models for DifferentialEquations.jl
using Polynomials
using StochasticDiffEq

# Classifies a parameter set depending on how the corresponding model behaves.
function determine_behaviour(p;m=4,l=2000)
    rts = find_zeros(p)
    return (rt_type=get_root_type(rts), passage_type=meassure_passage(rts,p,m,l))
end

# The model fixed points can be found by solving a polynomial equation.
function find_zeros(p)
    S,D,τ,v0,n = p
    coefficients = zeros(Int64(n)+2)
    coefficients[Int64(n)+2] = -S^n-D^n
    coefficients[Int64(n)+1] = v0*(S^n+D^n)+S^n
    coefficients[2] = -1
    coefficients[1] = v0
    return sort(real.(filter(x->(imag(x)==0)&&real(x)>0.,roots(Polynomial(coefficients)))))
end

# Classifies the type of root.
function get_root_type(rts)
    (length(rts) === 1) && return :single
    (length(rts) === 3) && return :triple
    return :weird
end

# The model is a biochemical reaction network (declared through Catalyst).
const model = @reaction_network begin
    v0+(S*σ)^n/((S*σ)^n+(D*A)^n+1), ∅ → σ
    1., σ → ∅
    (τ*σ,τ), ∅ ↔ A
end S D τ v0 n

# Measures how often the system passages a threshold, when  starting in the fixed point (given random fluctuations).
function meassure_passage(rts,p,m,l)
    (length(rts) > 3) && return :Problem
    passage_times = get_passage_times(p,rts[1:1],3*rts[1],m,l)
    return classify_passage_times(passage_times,l,m)
end

# Gets the times for the system, from a give initial condition, pass a threshold.
function get_passage_times(p,u0,thres,m,l)
    prob = SDEProblem(model,u0,(0.,l),p,noise_scaling=(@variables η)[1])
    ensemble_prob = EnsembleProblem(prob,prob_func=(p,i,r)->p)
    sols = solve(ensemble_prob,ImplicitEM(),EnsembleSerial();trajectories=m,callback=terminate_sim(u0,thres))
    return last.(getfield.(sols[:],:t))
end

# Assigns a class to the solution vector.
function classify_passage_times(passage_times,l,m)
    nbr_not_terminated = count(passage_times .== l)
    (nbr_not_terminated === m) && return :no_passage
    (nbr_not_terminated === 0) && return :all_passage
    return :some_passage
end

# Terminates a simulation once a threshold has been reached.
function terminate_sim(u0,thres);
    condition = (u0[1]<thres) ? (u,t,integrator)->(u[1]>thres) : (u,t,integrator)->(u[1]<thres)
    affect!(integrator) = terminate!(integrator)
    return DiscreteCallback(condition,affect!,save_positions = (true,true))
end

this is the code where I try running it with, and without, multithreading:

# Meassures the performance
@time determine_behaviour([1.,1.,1.,0.1,3,0.05])    # 4.282405 seconds (13.99 M allocations: 633.620 MiB, 3.79% gc time)
@time Threads.@threads for i = 1.:1.:8.
    determine_behaviour([i,1.,1.,0.1,3,0.05])
end                                                 # 41.056272 seconds (112.26 M allocations: 4.972 GiB, 3.38% gc time)
@time for i = 1.:1.:8.
    determine_behaviour([1.,1.,1.,0.1,3,0.05])      
end                                                 # 34.791720 seconds (111.91 M allocations: 4.950 GiB, 3.30% gc time)

garbage collection time is only about 3%.

This is closer to reality, but takes slightly longer to run. I would actually loop over more parameter sets, and write to disk:

# Technically more like this, but this has slightly longer run time.
using Serialization
@time Threads.@threads for i = 1:1.:8.
    behaviours = Vector{NamedTuple{(:rt_type, :passage_type),Tuple{Symbol,Symbol}}}(undef,8)
    for j = 1:1:8
        b = determine_behaviour([i,j,1.,0.1,3,0.05])
        behaviours[j] = b
    end
    serialize("data_i_$i",behaviours)
end                                                 # 383.943875 seconds (920.99 M allocations: 42.692 GiB, 3.03% gc time)
@time for i = 1:1.:8.
    behaviours = Vector{}(undef,8)
    for j = 1:1:8
        b = determine_behaviour([i,j,1.,0.1,3,0.05])
        behaviours[j] = b
    end
    serialize("data_i_$i",behaviours)               # 320.130658 seconds (900.79 M allocations: 41.852 GiB, 3.02% gc time)
end   

Finally, a more minimal example, only depending on StochasticDiffEq

using StochasticDiffEq

function f(du,u,p,t)
    σ,A = u
    S,D,τ,v0,n,η = p
    du[1] = v0+(S*σ)^n/((S*σ)^n+(D*A)^n+1) - σ
    du[2] = τ*(σ-A)
end
function g(du,u,p,t)
    σ,A = u
    S,D,τ,v0,n,η = p
    du[1,1] = η*sqrt(v0+(S*σ)^n/((S*σ)^n+(D*A)^n+1))
    du[1,2] = -η*sqrt(σ)
    du[1,3] = 0
    du[1,4] = 0
    du[2,1] = 0
    du[2,2] = 0
    du[2,3] = η*sqrt(τ*σ)
    du[2,4] = -η*sqrt(τ*A)
end

function run_sims(p)
    prob = SDEProblem(f,g,[1.,1.],(0.,50000.),p,noise_rate_prototype=zeros(2,4))
    ensemble_prob = EnsembleProblem(prob,prob_func=(p,i,r)->p)
    sols = solve(ensemble_prob,ImplicitEM(),EnsembleSerial();trajectories=8)
end

@time run_sims([1.,1.,1.,0.1,3,0.05])       # 2.346126 seconds (4.73 M allocations: 253.060 MiB, 11.12% gc time)
@time Threads.@threads for i = 1.:1.:8.
    run_sims([i,1.,1.,0.1,3,0.05])
end                                         # 22.173271 seconds (58.21 M allocations: 2.946 GiB, 1.01% gc time)
@time for i = 1.:1.:8.
    run_sims([i,1.,1.,0.1,3,0.05])
end                                         # 33.169600 seconds (58.22 M allocations: 2.947 GiB, 5.07% gc time)

this also fails to utilise multi threading.

(everything is run on a 8 core linux machine, using 8 threads)

ImplicitEM, if you’re going to use an implicit method with multithreading, you need to do

using LinearAlgebra
LinearAlgebra.BLAS.set_num_threads(1)

otherwise you’ll oversubscribe your threads.

That sounds interesting!

I have tried it, but there seems to be no substantial change:

# Modifies for ImplicitEM.
using LinearAlgebra
LinearAlgebra.BLAS.set_num_threads(1)

# Meassures the performance
@time determine_behaviour([1.,1.,1.,0.1,3,0.05])    # 5.041183 seconds (14.05 M allocations: 667.787 MiB, 4.01% gc time)
@time Threads.@threads for i = 1.:1.:8.
    determine_behaviour([i,1.,1.,0.1,3,0.05])
end                                                 # 55.370152 seconds (112.86 M allocations: 5.246 GiB, 2.99% gc time)
@time for i = 1.:1.:8.
    determine_behaviour([1.,1.,1.,0.1,3,0.05])      
end                                                 # 48.266905 seconds (112.37 M allocations: 5.217 GiB, 3.20% gc time)

out of interests I also tried two other solvers:

# With ISSEM
@time determine_behaviour([1.,1.,1.,0.1,3,0.05])    # 5.041183 seconds (14.05 M allocations: 667.787 MiB, 4.01% gc time)
@time Threads.@threads for i = 1.:1.:8.
    determine_behaviour([i,1.,1.,0.1,3,0.05])
end                                                 # 55.370152 seconds (112.86 M allocations: 5.246 GiB, 2.99% gc time)
@time for i = 1.:1.:8.
    determine_behaviour([1.,1.,1.,0.1,3,0.05])      
end   
# With ISSEulerHeun
@time determine_behaviour([1.,1.,1.,0.1,3,0.05])    # 5.799506 seconds (15.14 M allocations: 734.356 MiB, 3.65% gc time)
@time Threads.@threads for i = 1.:1.:8.
    determine_behaviour([i,1.,1.,0.1,3,0.05])
end                                                 # 57.124844 seconds (119.14 M allocations: 5.503 GiB, 2.65% gc time)
@time for i = 1.:1.:8.
    determine_behaviour([1.,1.,1.,0.1,3,0.05])      
end                                                 #  42.296906 seconds (113.79 M allocations: 5.277 GiB, 3.39% gc time) 

Something very odd is going on here and I am not sure what it is, since it’s not a GC issue.

I am watching this with interest, as I had some similarly unintuitive results when doing Agent based models in thread-parallel.