Solving ODEs in a loop with DifferentialEquations.jl

Hi there!

I’m working on a problem where I must solve a system of ODEs repeatedly with different parameters and initial conditions. In particular, I’m not really interested in the solution as a whole, I only need to know the final state of the system once a certain condition is met.

Here’s a working example: Let’s suppose that I want to solve the Lorenz system 10 times, each time with different initial positions and system parameters. I want to evolve the system until t = 100.0 or the particle reaches a sphere of a certain radius r, that is, it’s x^2 + y^2 + z^2 - r^2 = 0. Furthermore I don’t want the solver to save all intermediate steps and interpolation information, I’m only interested in the particle’s position at the final integration time. For that end, one could use the following code:

using DifferentialEquations

function parameterized_lorenz!(du,u,p,t)
  du[1] = p[1]*(u[2]-u[1])
  du[2] = u[1]*(p[2]-u[3]) - u[2]
  du[3] = u[1]*u[2] - p[3]*u[3]
end

function main()
  initialData = [[rand(-10.0:10.0), rand(-10.0:10.0), rand(-10.0:10.0)] for i in 1:10] 
  parameters = [[rand(0.0:10.0), rand(0.0:28.0), rand(0.0:10.0/3.0)] for i in 1:10] 

  tspan = (0.0,100.0)
  maxRadius = 5.0
  
  condition(u, t, integrator) = (u[1]^2 + u[2]^2 + u[3]^2) - maxRadius^2
  affect!(integrator) = terminate!(integrator)
  cb = ContinuousCallback(condition, affect!)
  
  for i in 1:10
    u0 = initialData[i]
    p = parameters[i]
    
    prob = ODEProblem(parameterized_lorenz!,u0,tspan,p)

    sol = solve(prob,
		Tsit5(),
		adaptive = true,
		maxiters = 1e5,
		abstol = 1.0e-10,
		retol = 1.0e-10,
		dtmin = 1.0e-12,
		callback = cb,
		dense = false,
		save_on = false,
		save_start = false,
		save_end = true,
		alias_u0 = true,
		calck = false
		)
    println("Endpoint ", i, ": ", sol[1][1], ", ", sol[1][2], ", ", sol[1][3])
  end
end

main()

Running this code with --track-allocation=user I see that solve and println are the only two instructions allocating memory inside the loop. Since println is only there for testing purposes, I would like to know what else can be done in order to minimize allocations (again, given that I’m only interested in the position values at the final time step).

More specifically, is it possible to pre-allocate memory for the solver’s output so that it does not need to create new solution objects at each iteration?

I think this use case of repeatedly solving a system is fairly common, so if I’m missing any existing discussion on the topic please let me know.

For allocations, since the ODE is 3D, you can use StaticArrays for u0.

Additionally, it might be convenient to use EnsembleProblem to benefit from parallel simulations

2 Likes

Yes, the next argument after the algorithm is the vector for the time series. That is currently undocumented though and subject to change, but feel free to use it for now for full pre-allocation. timeseries, t, and k to fully pre-allocate.

2 Likes

I don’t seem to be able to use StaticArrays for u0 since they are immutable and u0 is expected to mutate. What I can do, however is something like

function svector_parameterized_lorenz!(du,u,p,t)
    du[1] = p[1]*(u[2]-u[1])
    du[2] = u[1]*(p[2]-u[3]) - u[2]
    du[3] = u[1]*u[2] - p[3]*u[3]
    SVector{3, eltype(du)}(du)
 end

But this slightly decreased performance for me.

Ok, nice to know! I was able to pre-allocate but that did not changed performance that much for me, in fact it seems to have made it a bit worse (maybe I’m doing something wrong)

So, I have decided to run multiple benchmarks with different versions of the problem. Here are the results for solving the system 1000 times with 20 threads on a Intel i9 and geforce GTX 1080:

  1. Sequential loop evaluation with static arrays in the problem function
    266.445 ms (302345 allocations: 20.89 MiB)
  2. Sequential loop evaluation with normal arrays in the problem function
    265.808 ms (306346 allocations: 20.95 MiB)
  3. Sequential loop evaluation with normal arrays and pre-allocation
    267.950 ms (321358 allocations: 22.37 MiB)
  4. Parallel evaluation using @Threads.threads and normal arrays
    28.069 ms (300462 allocations: 20.85 MiB)
  5. Parallel evaluation using @Threads.threads and static arrays
    28.338 ms (300473 allocations: 20.85 MiB)
  6. Using an EnsenbleProblem with CUDA arrays via EnsembleGPUArray
    210.512 ms (1267945 allocations: 71.57 MiB)
  7. Using an EnsenbleProblem with Threads and normal arrays
    25.072 ms (40634 allocations: 7.01 MiB)

So, EnsenbleProblem with threads appears to be the clear winner here, although I really expected CUDA to be faster. Here’s the code using EnsambleProblem:

using DifferentialEquations
using StaticArrays
using CUDA
using DiffEqGPU

function parameterized_lorenz!(du,u,p,t)
  du[1] = p[1]*(u[2]-u[1])
  du[2] = u[1]*(p[2]-u[3]) - u[2]
  du[3] = u[1]*u[2] - p[3]*u[3]
end

function main()
    dataSize = 1000

    initialData = [[rand(-10.0:10.0), rand(-10.0:10.0), rand(-10.0:10.0)] for i in 1:dataSize] 
    parameters = [[rand(0.0:10.0), rand(0.0:28.0), rand(0.0:10.0/3.0)] for i in 1:dataSize] 

    tspan = (0.0,100.0)
    maxRadius = 5.0
    
    condition(u, t, integrator) = (u[1]^2 + u[2]^2 + u[3]^2) - maxRadius^2
    affect!(integrator) = terminate!(integrator)
    cb = ContinuousCallback(condition, affect!)
    
    function prob_func(problem, i, repeat)
        remake(problem, u0 = initialData[i], p = parameters[i])
    end

    function output_func(solution, i)
        (SVector{3, Float64}(solution[end][1], solution[end][2], solution[end][3]), false)
    end
    
    prob = ODEProblem(parameterized_lorenz!, initialData[1], tspan, parameters[1])

    ensemble_prob = EnsembleProblem(prob, prob_func = prob_func, output_func = output_func, safetycopy = false)

    solve(ensemble_prob,
          Tsit5(),
	  EnsembleThreads(),
	  trajectories = dataSize,
	  adaptive = true,
	  maxiters = 1e5,
	  abstol = 1.0e-10,
	  retol = 1.0e-10,
	  dtmin = 1.0e-12,
	  callback = cb,
	  dense = false,
	  save_on = false,
	  save_start = false,
	  save_end = true,
	  alias_u0 = true,
	  calck = false
	  )
end

main()

So, do you have any idea why CUDA via EnsembleGPUArray is performing so badly in contrast with EnsembleThreads? Also, what else can be done here to achieve peak performance?

1 Like

Your example isn’t runnable, and dataSize is the main piece that would tell me why CUDA isn’t performing well. Do you have around 1000 trajectories or more? It really needs to fill the GPU to be worth it.

For immutables arrays, you can use the out of place vf

function lorenz!(u,p,t)
    [p[1]*(u[2]-u[1]), u[1]*(p[2]-u[3]) - u[2], u[1]*u[2] - p[3]*u[3]]
 end

but that should not matter much in the end as you noted.

Sorrry about that! I’ve edited the example and it is runnable now.

Yes I’ve benchmarked main() with @btime and dataSize = 1000. Replacing EnsembleThreads() with EnsembleGPUArray() produced a slowdown.

Are you using Float32 or Float64 for your GPU tests? Float64 tends to be massively slower on consumer GPUs (the impact is less severe for top-shelf Quadro/RTX models) - try switching everything to Float32.

**edit: see this (2015) article for head-to-head comparisons: https://arrayfire.com/explaining-fp64-performance-on-gpus/

1 Like

I was using Float64. Switching to Float32 improved things , but just a little: Form 210.512 ms with Float64 to 204.785 ms with Float32. Still, a lot slower than 25.072 ms using 20 CPU threads

The performance difference is indeed significant. Maybe 1000 trajectories is still to little to notice speed gains?

1 Like

For 3 non-stiff ODEs? Yes. We’re working on a separate type of GPU-based ODE solver for that case, but it’s hard to overcome the GPU overhead of that. Usually I find the best thing is to have like 10+ ODEs, and preferably stiff. That then is where the big speedups come from (like the 175x acceleration demonstrated in the Pfizer case study of Julia Computing, that was a stiff ODE). That just packages a lot more computation into the kernels.

1 Like

I see. My real problem is ray tracing in general relativity. This involves, like the Lorenz example, solving an ensemble of trajectories for photons that will be composed into an image. The system has only 7 non-stiff equations but each equation has quite a lot of intermediate computations going on involving the spacetime metric and it’s derivatives. Also for a N x M pixel image I solve the system 16 * N * M times so hopefully in my real application the GPU overhead might be diluted.

In any case after the help of everyone here, I think that now I’m able to implement the ensemble strategy in my code and that alone will improve my current performance.

Since in my original post I specifically asked for “How to pre-allocate memory for the output” I think it’s only fair to mark Chris’s first response as the “solution” to the discussion.

1 Like