For my ode system EnsembleThreads (6 threads) is slower than either EnsembleSerial or EnsembleDistributed (6 processes). My specs are Windows 10 and a 6 core i8700k with 32Gb ram and running Julia V1.6.
I’m running optical bloch equations in EnsembleProblems to do Monte-Carlo trajectory simulations through an approximation of our experimental setup.
Each single trajectory takes about 220 ms, with the actual time depending on the input parameters. To do multiple trajectories and under different parameter sets I use EnsembleProblem, an example is shown below:
using BenchmarkTools using LinearAlgebra using Trapz using DifferentialEquations function sine_wave(t, frequency, phase) 0.5.*(1+sin(2*pi.*frequency.*t .+ phase)) end function prob_func(prob,i,repeat) remake(prob,p=[Ω1; params[i]; Δ1; Ω2; params[i]; Δ2]) end function output_func(sol,i) return trapz(sol.t, [real(sum(diag(sol.u[j])[13:end])) for j in 1:size(sol.u)]), false end Γ = 2pi*1.6e6 Ω1 = Γ Ω2 = Γ Δ1 = 0. Δ2 = 0. νp1 = 1e6 νp2 = 1e6 n_states = 16 params = ones(100)*2.8e6 p = [Ω1, νp1, Δ1, Ω2, νp2, Δ2] ρ = zeros(ComplexF64, n_states, n_states) ρ[1,1] = 1. tspan = (0,300e-6) du = zeros(ComplexF64, n_states, n_states) prob = ODEProblem(Lindblad_rhs!,ρ,tspan,p) ens_prob = EnsembleProblem(prob, prob_func = prob_func, output_func = output_func) @benchmark solve(ens_prob, Tsit5(), EnsembleSerial(), save_start = true, save_end = true, save_everystep = true; trajectories = size(params), dt = 1e-9, adaptive=true, abstol = 5e-7, reltol = 5e-4)
Lindblad_rhs! is the ODE function, which is about 300 lines so I’ve included a link.
The first part of the function is shown below, and what follows then is a series of matrix multiplication operations written out line by line; e.g. du[1,2] = … etc. (this code is generated by python and depends on which states and couplings are added)
function Lindblad_rhs!(du, ρ, p, t) @inbounds begin Ω1 = p νp1 = p Δ1 = p Ω2 = p νp2 = p Δ2 = p Ω1ᶜ = conj(Ω1) Ω2ᶜ = conj(Ω2) Px1 = sine_wave(t, νp1, 4.71238898038469) Pz1 = sine_wave(t, νp1, 1.5707963267948966) Px2 = sine_wave(t, νp2, 1.5707963267948966) Pz2 = sine_wave(t, νp2, 4.71238898038469) norm1 = sqrt(Px1^2+Pz1^2) norm2 = sqrt(Px2^2+Pz2^2) Px1 /= norm1 Pz1 /= norm1 Px2 /= norm2 Pz2 /= norm2 du[1,1] = 2198254.70329198*ρ[14,14] + 2198721.58229457*ρ[15,15] + 2199188.50568063*ρ[16,16] - 1.0*1im*(0.487104585992817*Ω2*ρ[14,1]*Px2 - 0.688943961004229*Ω2*ρ[15,1]*Pz2 - 0.487209306651389*Ω2*ρ[16,1]*Px2 - 0.487104585992817*Ω2ᶜ*ρ[1,14]*Px2 + 0.688943961004229*Ω2ᶜ*ρ[1,15]*Pz2 + 0.487209306651389*Ω2ᶜ*ρ[1,16]*Px2)
I’ve benchmarked this and it runs with 0 allocations.
Running with EnsembleSerial takes about 22 seconds on my machine, whereas EnsembleThreads takes 61s with 6 threads. EnsembleDistributed (with appropriate
@everywhere added) with 6 processes takes only 5.9 seconds. This is a reduction of 3.7, which, in addition to the speedup gained from moving to Julia from SciPy is already great.
However I’d like to understand why EnsembleThreads is so much slower, so far I haven’t been able to figure out why. The only thing I can think of that there is some thread safety that impacts the performance.
When using EnsembleThreads CPU usage increases to about the same level as EnsembleDistributed, as I would expect (e.g. about 6x that of EnsembleSerial), but execution takes a lot longer.
output_func reduces the full solution for a single trajectory to a
Float64, so running out of memory and swapping is not an issue (which would also show up in EnsembleDistributed).
Running the examples from the Parallel Ensemble Simulations guide on the DifferentialEquations user guide shows a slight speedup using EnsembleThreads over EnsembleSerial.