EnsembleThreads slower than EnsembleSerial or EnsembleDistributed

For my ode system EnsembleThreads (6 threads) is slower than either EnsembleSerial or EnsembleDistributed (6 processes). My specs are Windows 10 and a 6 core i8700k with 32Gb ram and running Julia V1.6.

I’m running optical bloch equations in EnsembleProblems to do Monte-Carlo trajectory simulations through an approximation of our experimental setup.
Each single trajectory takes about 220 ms, with the actual time depending on the input parameters. To do multiple trajectories and under different parameter sets I use EnsembleProblem, an example is shown below:

using BenchmarkTools
using LinearAlgebra
using Trapz
using DifferentialEquations

function sine_wave(t, frequency, phase)
    0.5.*(1+sin(2*pi.*frequency.*t .+ phase))
end

function prob_func(prob,i,repeat)
    remake(prob,p=[Ω1; params[i]; Δ1; Ω2; params[i]; Δ2])
end

function output_func(sol,i)
    return trapz(sol.t, [real(sum(diag(sol.u[j])[13:end])) for j in 1:size(sol.u)[1]]), false
end

Γ = 2pi*1.6e6
Ω1 = Γ
Ω2 = Γ
Δ1 = 0.
Δ2 = 0.
νp1 = 1e6
νp2 = 1e6
n_states = 16

params = ones(100)*2.8e6

p = [Ω1, νp1, Δ1, Ω2, νp2, Δ2]

ρ = zeros(ComplexF64, n_states, n_states)
ρ[1,1] = 1.
tspan = (0,300e-6)

du = zeros(ComplexF64, n_states, n_states)

prob = ODEProblem(Lindblad_rhs!,ρ,tspan,p)

ens_prob = EnsembleProblem(prob, prob_func = prob_func, output_func = output_func)

@benchmark solve(ens_prob, Tsit5(), EnsembleSerial(), save_start = true, save_end = true,
save_everystep = true; trajectories = size(params)[1], dt = 1e-9, adaptive=true,
abstol = 5e-7, reltol = 5e-4)

Where Lindblad_rhs! is the ODE function, which is about 300 lines so I’ve included a link.
The first part of the function is shown below, and what follows then is a series of matrix multiplication operations written out line by line; e.g. du[1,2] = … etc. (this code is generated by python and depends on which states and couplings are added)

function Lindblad_rhs!(du, ρ, p, t)
	@inbounds begin
		Ω1 = p[1]
		νp1 = p[2]
		Δ1 = p[3]
		Ω2 = p[4]
		νp2 = p[5]
		Δ2 = p[6]
		Ω1ᶜ = conj(Ω1)
		Ω2ᶜ = conj(Ω2)
		Px1 = sine_wave(t, νp1, 4.71238898038469)
		Pz1 = sine_wave(t, νp1, 1.5707963267948966)
		Px2 = sine_wave(t, νp2, 1.5707963267948966)
		Pz2 = sine_wave(t, νp2, 4.71238898038469)
		norm1 = sqrt(Px1^2+Pz1^2)
		norm2 = sqrt(Px2^2+Pz2^2)
		Px1 /= norm1
		Pz1 /= norm1
		Px2 /= norm2
		Pz2 /= norm2
		du[1,1] = 2198254.70329198*ρ[14,14] + 2198721.58229457*ρ[15,15] + 2199188.50568063*ρ[16,16] - 1.0*1im*(0.487104585992817*Ω2*ρ[14,1]*Px2 - 0.688943961004229*Ω2*ρ[15,1]*Pz2 - 0.487209306651389*Ω2*ρ[16,1]*Px2 - 0.487104585992817*Ω2ᶜ*ρ[1,14]*Px2 + 0.688943961004229*Ω2ᶜ*ρ[1,15]*Pz2 + 0.487209306651389*Ω2ᶜ*ρ[1,16]*Px2)	

I’ve benchmarked this and it runs with 0 allocations.

Running with EnsembleSerial takes about 22 seconds on my machine, whereas EnsembleThreads takes 61s with 6 threads. EnsembleDistributed (with appropriate @everywhere added) with 6 processes takes only 5.9 seconds. This is a reduction of 3.7, which, in addition to the speedup gained from moving to Julia from SciPy is already great.
However I’d like to understand why EnsembleThreads is so much slower, so far I haven’t been able to figure out why. The only thing I can think of that there is some thread safety that impacts the performance.

When using EnsembleThreads CPU usage increases to about the same level as EnsembleDistributed, as I would expect (e.g. about 6x that of EnsembleSerial), but execution takes a lot longer.

output_func reduces the full solution for a single trajectory to a Float64, so running out of memory and swapping is not an issue (which would also show up in EnsembleDistributed).

Running the examples from the Parallel Ensemble Simulations guide on the DifferentialEquations user guide shows a slight speedup using EnsembleThreads over EnsembleSerial.

I improved the performance of it today, and this is what I get now:

using BenchmarkTools
using LinearAlgebra
using Trapz
using DifferentialEquations

function sine_wave(t, frequency, phase)
    0.5.*(1+sin(2*pi.*frequency.*t .+ phase))
end

function prob_func(prob,i,repeat)
    remake(prob,p=[Ω1; params[i]; Δ1; Ω2; params[i]; Δ2])
end

function output_func(sol,i)
    return trapz(sol.t, [real(sum(diag(sol.u[j])[13:end])) for j in 1:size(sol.u)[1]]), false
end

function Lindblad_rhs!(du, ρ, p, t)
	@inbounds begin
		Ω1 = p[1]
		νp1 = p[2]
		Δ1 = p[3]
		Ω2 = p[4]
		νp2 = p[5]
		Δ2 = p[6]
		Ω1ᶜ = conj(Ω1)
		Ω2ᶜ = conj(Ω2)
		Px1 = sine_wave(t, νp1, 4.71238898038469)
		Pz1 = sine_wave(t, νp1, 1.5707963267948966)
		Px2 = sine_wave(t, νp2, 1.5707963267948966)
		Pz2 = sine_wave(t, νp2, 4.71238898038469)
		norm1 = sqrt(Px1^2+Pz1^2)
		norm2 = sqrt(Px2^2+Pz2^2)
		Px1 /= norm1
		Pz1 /= norm1
		Px2 /= norm2
		Pz2 /= norm2
		du[1,1] = 2198254.70329198*ρ[14,14] + 2198721.58229457*ρ[15,15] + 2199188.50568063*ρ[16,16] - 1.0*1im*(0.487104585992817*Ω2*ρ[14,1]*Px2 - 0.688943961004229*Ω2*ρ[15,1]*Pz2 - 0.487209306651389*Ω2*ρ[16,1]*Px2 - 0.487104585992817*Ω2ᶜ*ρ[1,14]*Px2 + 0.688943961004229*Ω2ᶜ*ρ[1,15]*Pz2 + 0.487209306651389*Ω2ᶜ*ρ[1,16]*Px2)
    end
end

Γ = 2pi*1.6e6
Ω1 = Γ
Ω2 = Γ
Δ1 = 0.
Δ2 = 0.
νp1 = 1e6
νp2 = 1e6
n_states = 16

params = ones(100)*2.8e6

p = [Ω1, νp1, Δ1, Ω2, νp2, Δ2]

ρ = zeros(ComplexF64, n_states, n_states)
ρ[1,1] = 1.
tspan = (0,300e-6)

du = zeros(ComplexF64, n_states, n_states)

prob = ODEProblem(Lindblad_rhs!,ρ,tspan,p)

ens_prob = EnsembleProblem(prob, prob_func = prob_func, output_func = output_func)

@benchmark solve(ens_prob, Tsit5(), EnsembleSerial(), save_start = true, save_end = true,
                 save_everystep = true; trajectories = size(params)[1],
                 dt = 1e-9, adaptive=true, abstol = 5e-7, reltol = 5e-4)

@benchmark solve(ens_prob, Tsit5(), EnsembleThreads(), save_start = true, save_end = true,
              save_everystep = true; trajectories = size(params)[1],
              dt = 1e-9, adaptive=true, abstol = 5e-7, reltol = 5e-4)

Serial:

BenchmarkTools.Trial: 
  memory estimate:  30.18 MiB
  allocs estimate:  14322
  --------------
  minimum time:     33.391 ms (0.00% GC)
  median time:      52.354 ms (0.00% GC)
  mean time:        55.542 ms (16.06% GC)
  maximum time:     87.034 ms (32.75% GC)
  --------------
  samples:          91
  evals/sample:     1

Threads:

BenchmarkTools.Trial: 
  memory estimate:  30.19 MiB
  allocs estimate:  14345
  --------------
  minimum time:     6.456 ms (0.00% GC)
  median time:      16.632 ms (0.00% GC)
  mean time:        24.550 ms (32.37% GC)
  maximum time:     193.020 ms (86.70% GC)
  --------------
  samples:          204
  evals/sample:     1

So update your packages and give it a try.

1 Like

With updated packages and running the same Lindblad_rhs! function I do see an improvement in using EnsembleThreads over EnsembleSerial

Serial:

BenchmarkTools.Trial:
  memory estimate:  30.18 MiB
  allocs estimate:  14322
  --------------
  minimum time:     6.780 ms (0.00% GC)
  median time:      8.922 ms (22.01% GC)
  mean time:        8.486 ms (13.97% GC)
  maximum time:     22.557 ms (31.40% GC)
  --------------
  samples:          3530
  evals/sample:     1

Threads:

BenchmarkTools.Trial:
  memory estimate:  30.19 MiB
  allocs estimate:  14355
  --------------
  minimum time:     1.827 ms (0.00% GC)
  median time:      4.166 ms (0.00% GC)
  mean time:        5.854 ms (33.06% GC)
  maximum time:     61.606 ms (92.00% GC)
  --------------
  samples:          5112
  evals/sample:     1

There is however no change in performance when I run the full function (5s Distributed, 22s Serial, 61s Threads)

I’ve narrowed it down to a minimum example where EnsembleThreads starts to run into problems (on my machine at least):

function Lindblad_rhs!(du, ρ, p, t)
	@inbounds begin
		Ω1 = p[1]
		νp1 = p[2]
		Δ1 = p[3]
		Ω2 = p[4]
		νp2 = p[5]
		Δ2 = p[6]
		Ω1ᶜ = conj(Ω1)
		Ω2ᶜ = conj(Ω2)
		Px1 = sine_wave(t, νp1, 4.71238898038469)
		Pz1 = sine_wave(t, νp1, 1.5707963267948966)
		Px2 = sine_wave(t, νp2, 1.5707963267948966)
		Pz2 = sine_wave(t, νp2, 4.71238898038469)
		norm1 = sqrt(Px1^2+Pz1^2)
		norm2 = sqrt(Px2^2+Pz2^2)
		Px1 /= norm1
		Pz1 /= norm1
		Px2 /= norm2
		Pz2 /= norm2
		du[1,14] = -5026548.24574367*ρ[1,14] - 1.0*1im*(-0.487104536814559*Ω2*ρ[1,1]*Px2 - 0.674428203801002*Ω2*ρ[1,2]*Pz2 - 0.476946169171229*Ω2*ρ[1,3]*Px2 + 0.283550246614439*Ω2*ρ[1,5]*Pz2 + 0.200508115928818*Ω2*ρ[1,6]*Px2 - 0.43301270380128*Ω2*ρ[1,8]*Px2 + 0.433033582428828*Ω2*ρ[1,9]*Pz2 + 0.176793733186045*Ω2*ρ[1,10]*Px2 + 0.487104536814559*Ω2*ρ[14,14]*Px2 - 0.688943961004228*Ω2*ρ[15,14]*Pz2 - 0.487209355797023*Ω2*ρ[16,14]*Px2 - ρ[1,14]*(Δ2 + 91343.5840148926) - 1245272.14598083*ρ[1,14])
    end
    nothing
end

Serial:

BenchmarkTools.Trial:
  memory estimate:  6.01 GiB
  allocs estimate:  2063222
  --------------
  minimum time:     1.851 s (14.80% GC)
  median time:      1.887 s (15.18% GC)
  mean time:        1.914 s (15.31% GC)
  maximum time:     2.070 s (16.10% GC)
  --------------
  samples:          16
  evals/sample:     1

Threads:

BenchmarkTools.Trial:
  memory estimate:  6.01 GiB
  allocs estimate:  2063255
  --------------
  minimum time:     3.825 s (50.15% GC)
  median time:      4.199 s (47.21% GC)
  mean time:        4.212 s (46.91% GC)
  maximum time:     4.551 s (49.54% GC)
  --------------
  samples:          8
  evals/sample:     1

A single solve in this case takes about 11 ms, versus 140 \mu s for the case where Threads is indeed faster, so it seem like it’s related to the time to solve a single problem.
The documentation does mention that Distributed is recommended for cases where trajectories are at least a ms each, but why is the Threaded parallelism so much slower, for larger problems even slower than Serial.

The GC is serial. If you multithread and hit the GC, it’ll stop all parallelism and be slower than the serial case if the GC is hit too often. That looks to be what’s going on here.