# Solving ODEs in a loop with DifferentialEquations.jl

Hi there!

I’m working on a problem where I must solve a system of ODEs repeatedly with different parameters and initial conditions. In particular, I’m not really interested in the solution as a whole, I only need to know the final state of the system once a certain condition is met.

Here’s a working example: Let’s suppose that I want to solve the Lorenz system 10 times, each time with different initial positions and system parameters. I want to evolve the system until `t = 100.0` or the particle reaches a sphere of a certain radius `r`, that is, it’s `x^2 + y^2 + z^2 - r^2 = 0`. Furthermore I don’t want the solver to save all intermediate steps and interpolation information, I’m only interested in the particle’s position at the final integration time. For that end, one could use the following code:

``````using DifferentialEquations

function parameterized_lorenz!(du,u,p,t)
du = p*(u-u)
du = u*(p-u) - u
du = u*u - p*u
end

function main()
initialData = [[rand(-10.0:10.0), rand(-10.0:10.0), rand(-10.0:10.0)] for i in 1:10]
parameters = [[rand(0.0:10.0), rand(0.0:28.0), rand(0.0:10.0/3.0)] for i in 1:10]

tspan = (0.0,100.0)

condition(u, t, integrator) = (u^2 + u^2 + u^2) - maxRadius^2
affect!(integrator) = terminate!(integrator)
cb = ContinuousCallback(condition, affect!)

for i in 1:10
u0 = initialData[i]
p = parameters[i]

prob = ODEProblem(parameterized_lorenz!,u0,tspan,p)

sol = solve(prob,
Tsit5(),
maxiters = 1e5,
abstol = 1.0e-10,
retol = 1.0e-10,
dtmin = 1.0e-12,
callback = cb,
dense = false,
save_on = false,
save_start = false,
save_end = true,
alias_u0 = true,
calck = false
)
println("Endpoint ", i, ": ", sol, ", ", sol, ", ", sol)
end
end

main()
``````

Running this code with `--track-allocation=user` I see that `solve` and `println` are the only two instructions allocating memory inside the loop. Since `println` is only there for testing purposes, I would like to know what else can be done in order to minimize allocations (again, given that I’m only interested in the position values at the final time step).

More specifically, is it possible to pre-allocate memory for the solver’s output so that it does not need to create new solution objects at each iteration?

I think this use case of repeatedly solving a system is fairly common, so if I’m missing any existing discussion on the topic please let me know.

For allocations, since the ODE is 3D, you can use `StaticArrays` for `u0`.

Additionally, it might be convenient to use `EnsembleProblem` to benefit from parallel simulations

2 Likes

Yes, the next argument after the algorithm is the vector for the time series. That is currently undocumented though and subject to change, but feel free to use it for now for full pre-allocation. timeseries, t, and k to fully pre-allocate.

2 Likes

I don’t seem to be able to use `StaticArrays` for `u0` since they are immutable and `u0` is expected to mutate. What I can do, however is something like

``````function svector_parameterized_lorenz!(du,u,p,t)
du = p*(u-u)
du = u*(p-u) - u
du = u*u - p*u
SVector{3, eltype(du)}(du)
end
``````

But this slightly decreased performance for me.

Ok, nice to know! I was able to pre-allocate but that did not changed performance that much for me, in fact it seems to have made it a bit worse (maybe I’m doing something wrong)

So, I have decided to run multiple benchmarks with different versions of the problem. Here are the results for solving the system 1000 times with 20 threads on a Intel i9 and geforce GTX 1080:

1. Sequential loop evaluation with static arrays in the problem function
266.445 ms (302345 allocations: 20.89 MiB)
2. Sequential loop evaluation with normal arrays in the problem function
265.808 ms (306346 allocations: 20.95 MiB)
3. Sequential loop evaluation with normal arrays and pre-allocation
267.950 ms (321358 allocations: 22.37 MiB)
4. Parallel evaluation using `@Threads.threads` and normal arrays
28.069 ms (300462 allocations: 20.85 MiB)
5. Parallel evaluation using `@Threads.threads` and static arrays
28.338 ms (300473 allocations: 20.85 MiB)
6. Using an `EnsenbleProblem` with CUDA arrays via `EnsembleGPUArray`
210.512 ms (1267945 allocations: 71.57 MiB)
7. Using an `EnsenbleProblem` with Threads and normal arrays
25.072 ms (40634 allocations: 7.01 MiB)

So, `EnsenbleProblem` with threads appears to be the clear winner here, although I really expected CUDA to be faster. Here’s the code using `EnsambleProblem`:

``````using DifferentialEquations
using StaticArrays
using CUDA
using DiffEqGPU

function parameterized_lorenz!(du,u,p,t)
du = p*(u-u)
du = u*(p-u) - u
du = u*u - p*u
end

function main()
dataSize = 1000

initialData = [[rand(-10.0:10.0), rand(-10.0:10.0), rand(-10.0:10.0)] for i in 1:dataSize]
parameters = [[rand(0.0:10.0), rand(0.0:28.0), rand(0.0:10.0/3.0)] for i in 1:dataSize]

tspan = (0.0,100.0)

condition(u, t, integrator) = (u^2 + u^2 + u^2) - maxRadius^2
affect!(integrator) = terminate!(integrator)
cb = ContinuousCallback(condition, affect!)

function prob_func(problem, i, repeat)
remake(problem, u0 = initialData[i], p = parameters[i])
end

function output_func(solution, i)
(SVector{3, Float64}(solution[end], solution[end], solution[end]), false)
end

prob = ODEProblem(parameterized_lorenz!, initialData, tspan, parameters)

ensemble_prob = EnsembleProblem(prob, prob_func = prob_func, output_func = output_func, safetycopy = false)

solve(ensemble_prob,
Tsit5(),
trajectories = dataSize,
maxiters = 1e5,
abstol = 1.0e-10,
retol = 1.0e-10,
dtmin = 1.0e-12,
callback = cb,
dense = false,
save_on = false,
save_start = false,
save_end = true,
alias_u0 = true,
calck = false
)
end

main()
``````

So, do you have any idea why CUDA via `EnsembleGPUArray` is performing so badly in contrast with `EnsembleThreads`? Also, what else can be done here to achieve peak performance?

1 Like

Your example isn’t runnable, and `dataSize` is the main piece that would tell me why CUDA isn’t performing well. Do you have around 1000 trajectories or more? It really needs to fill the GPU to be worth it.

For immutables arrays, you can use the out of place vf

``````function lorenz!(u,p,t)
[p*(u-u), u*(p-u) - u, u*u - p*u]
end
``````

but that should not matter much in the end as you noted.

Sorrry about that! I’ve edited the example and it is runnable now.

Yes I’ve benchmarked `main()` with `@btime` and `dataSize = 1000`. Replacing `EnsembleThreads()` with `EnsembleGPUArray()` produced a slowdown.

Are you using `Float32` or `Float64` for your GPU tests? `Float64` tends to be massively slower on consumer GPUs (the impact is less severe for top-shelf Quadro/RTX models) - try switching everything to `Float32`.

**edit: see this (2015) article for head-to-head comparisons: Explaining FP64 performance on GPUs | ArrayFire

1 Like

I was using `Float64`. Switching to `Float32` improved things , but just a little: Form `210.512 ms` with `Float64` to `204.785 ms` with `Float32`. Still, a lot slower than `25.072 ms` using 20 CPU threads

The performance difference is indeed significant. Maybe 1000 trajectories is still to little to notice speed gains?

1 Like

For 3 non-stiff ODEs? Yes. We’re working on a separate type of GPU-based ODE solver for that case, but it’s hard to overcome the GPU overhead of that. Usually I find the best thing is to have like 10+ ODEs, and preferably stiff. That then is where the big speedups come from (like the 175x acceleration demonstrated in the Pfizer case study of Julia Computing, that was a stiff ODE). That just packages a lot more computation into the kernels.

1 Like

I see. My real problem is ray tracing in general relativity. This involves, like the Lorenz example, solving an ensemble of trajectories for photons that will be composed into an image. The system has only 7 non-stiff equations but each equation has quite a lot of intermediate computations going on involving the spacetime metric and it’s derivatives. Also for a `N x M` pixel image I solve the system `16 * N * M` times so hopefully in my real application the GPU overhead might be diluted.

In any case after the help of everyone here, I think that now I’m able to implement the ensemble strategy in my code and that alone will improve my current performance.

Since in my original post I specifically asked for “How to pre-allocate memory for the output” I think it’s only fair to mark Chris’s first response as the “solution” to the discussion.

1 Like