For my ode system EnsembleThreads (6 threads) is slower than either EnsembleSerial or EnsembleDistributed (6 processes). My specs are Windows 10 and a 6 core i8700k with 32Gb ram and running Julia V1.6.
I’m running optical bloch equations in EnsembleProblems to do Monte-Carlo trajectory simulations through an approximation of our experimental setup.
Each single trajectory takes about 220 ms, with the actual time depending on the input parameters. To do multiple trajectories and under different parameter sets I use EnsembleProblem, an example is shown below:
using BenchmarkTools
using LinearAlgebra
using Trapz
using DifferentialEquations
function sine_wave(t, frequency, phase)
0.5.*(1+sin(2*pi.*frequency.*t .+ phase))
end
function prob_func(prob,i,repeat)
remake(prob,p=[Ω1; params[i]; Δ1; Ω2; params[i]; Δ2])
end
function output_func(sol,i)
return trapz(sol.t, [real(sum(diag(sol.u[j])[13:end])) for j in 1:size(sol.u)[1]]), false
end
Γ = 2pi*1.6e6
Ω1 = Γ
Ω2 = Γ
Δ1 = 0.
Δ2 = 0.
νp1 = 1e6
νp2 = 1e6
n_states = 16
params = ones(100)*2.8e6
p = [Ω1, νp1, Δ1, Ω2, νp2, Δ2]
ρ = zeros(ComplexF64, n_states, n_states)
ρ[1,1] = 1.
tspan = (0,300e-6)
du = zeros(ComplexF64, n_states, n_states)
prob = ODEProblem(Lindblad_rhs!,ρ,tspan,p)
ens_prob = EnsembleProblem(prob, prob_func = prob_func, output_func = output_func)
@benchmark solve(ens_prob, Tsit5(), EnsembleSerial(), save_start = true, save_end = true,
save_everystep = true; trajectories = size(params)[1], dt = 1e-9, adaptive=true,
abstol = 5e-7, reltol = 5e-4)
Where Lindblad_rhs!
is the ODE function, which is about 300 lines so I’ve included a link.
The first part of the function is shown below, and what follows then is a series of matrix multiplication operations written out line by line; e.g. du[1,2] = … etc. (this code is generated by python and depends on which states and couplings are added)
function Lindblad_rhs!(du, ρ, p, t)
@inbounds begin
Ω1 = p[1]
νp1 = p[2]
Δ1 = p[3]
Ω2 = p[4]
νp2 = p[5]
Δ2 = p[6]
Ω1ᶜ = conj(Ω1)
Ω2ᶜ = conj(Ω2)
Px1 = sine_wave(t, νp1, 4.71238898038469)
Pz1 = sine_wave(t, νp1, 1.5707963267948966)
Px2 = sine_wave(t, νp2, 1.5707963267948966)
Pz2 = sine_wave(t, νp2, 4.71238898038469)
norm1 = sqrt(Px1^2+Pz1^2)
norm2 = sqrt(Px2^2+Pz2^2)
Px1 /= norm1
Pz1 /= norm1
Px2 /= norm2
Pz2 /= norm2
du[1,1] = 2198254.70329198*ρ[14,14] + 2198721.58229457*ρ[15,15] + 2199188.50568063*ρ[16,16] - 1.0*1im*(0.487104585992817*Ω2*ρ[14,1]*Px2 - 0.688943961004229*Ω2*ρ[15,1]*Pz2 - 0.487209306651389*Ω2*ρ[16,1]*Px2 - 0.487104585992817*Ω2ᶜ*ρ[1,14]*Px2 + 0.688943961004229*Ω2ᶜ*ρ[1,15]*Pz2 + 0.487209306651389*Ω2ᶜ*ρ[1,16]*Px2)
I’ve benchmarked this and it runs with 0 allocations.
Running with EnsembleSerial takes about 22 seconds on my machine, whereas EnsembleThreads takes 61s with 6 threads. EnsembleDistributed (with appropriate @everywhere
added) with 6 processes takes only 5.9 seconds. This is a reduction of 3.7, which, in addition to the speedup gained from moving to Julia from SciPy is already great.
However I’d like to understand why EnsembleThreads is so much slower, so far I haven’t been able to figure out why. The only thing I can think of that there is some thread safety that impacts the performance.
When using EnsembleThreads CPU usage increases to about the same level as EnsembleDistributed, as I would expect (e.g. about 6x that of EnsembleSerial), but execution takes a lot longer.
output_func
reduces the full solution for a single trajectory to a Float64
, so running out of memory and swapping is not an issue (which would also show up in EnsembleDistributed).
Running the examples from the Parallel Ensemble Simulations guide on the DifferentialEquations user guide shows a slight speedup using EnsembleThreads over EnsembleSerial.