Why the separation of `ODEProblem` and `solve` in DifferentialEquations.jl?

Is there a performance benefit to reusing ODEProblems? I’m curious why the ODEProblem struct/function if the equivalent info could also just be handed off to solve?

I have a problem where I repeatedly need to solve the same large ODE with slightly different conditions repeatedly, and I’d like to avoid reuse work as much as possible if possible. I’m currently putting a lot of pressure on the GC by repeatedly creating ODEProblems and then solveing them. Is there a perf trick that I’m missing?

2 Likes

We need this information in order to know how to dispatch between ODEs, SDEs, DAEs, DDEs, jump equations, etc. In some cases, a function, initial condition, and timespan are not enough to distinguish between these cases. It also let’s us to a lot of engineering and introspection at a higher level. See the coming JuliaCon talk on automated optimization and parallelism which relies on introspection of the problem types pre-solve. Would could have people paste the same 20 arguments in multiple points of their code, but seems decidedly against DRY.

The first thing to note is that if you are creating thousands of really small ODEProblems and want to avoid the dynamic checking going on, then you should do ODEProblem{false}(...) or ODEProblem{true}(...), i.e. directly declare whether it’s in in-place or out-of-place form. Normally this isn’t a huge deal so it’s not mentioned very often in the documentation, but it’s in there since there are scenarios where this helps.

Secondly, note that there are a lot of optimizations in Julia v1.5 (coming out next week) and Julia v1.6 which are specifically designed to optimize this case. Most notably,

  • Immutable structs (including tuples) that contain references can now be allocated on the stack, and allocated inline within arrays and other structs (#33886). This significantly reduces the number of heap allocations in some workloads. Code that requires assumptions about object layout and addresses (usually for interoperability with C or other languages) might need to be updated; for example any object that needs a stable address should be a mutable struct . As a result, Array view s no longer allocate ([#34126]).

On v1.5 the ODEProblem, since it is a fully typed immutable struct, should be completely elided on Julia v1.5. So that issue should completely disappear.

That’s the issue that you know about. Let me just quickly mention though that’s not the real issue: this is only 80 bytes and while annoying it’s not the allocation you’re actually worried about. The one you’re actually worried about a little bit more insidious is:

Essentially previous versions of Julia were unable to rely on constant values inside of keyword arguments, and so because you could do things like save_idxs=1 to change the output from an array to a scalar of just the first value, the existence of these features caused some instabilities which you have to be careful about. We were able to prove this was the case by isolated every issue and get it statically compiling on Julia v1.4, but it required removing a few keyword arguments which we noted were all linked to this idea.

This was fixed in https://github.com/JuliaLang/julia/pull/35976 and it was the last thing added to the v1.5 backports: https://github.com/JuliaLang/julia/pull/36098 so it should be fixed in the RC1. So please test the RC and see if that’s all fixed up. If that doesn’t reduce the last small allocation, I have a hook:

which would allow for specializing the compiled output, but it requires these changes in v1.5 to handle all of the keyword arguments (otherwise it’ll just error outputs which don’t match the default).

That said, one person did find a case which wasn’t specialized by this: https://github.com/JuliaLang/julia/pull/36292#issuecomment-652055307 but we can make sure to open up an issue so this gets fixed in Julia v1.6 if that effects DiffEq users.

This sounds like a good JuliaCon time issue though: let’s get this all cleaned up.

5 Likes

Thank you for such a thorough answer, Chris! I’ll start digging through each of these references further. Looking forward to v1.5!

After some more detailed profiling, I found that in fact performance of the solve on the problem is fast; it’s actually the solve on the ODEAdjointProblem that is much more problematic. Here are some numbers on a very small, toy example:

forward:  0.000738 seconds (4.65 k allocations: 958.266 KiB)
adjoint:  0.007064 seconds (98.62 k allocations: 5.837 MiB)

This is just running the adjoint process on the “u” variables, no parameters yet.

Can I get that example? Reverse mode will always need to allocate unless it’s a reversible AD simply because it needs to cache values for the reverse pass, but I don’t know if it should be that high. I can take a look and profile it some and see if there’s any major offender to fix.

Note that if calculations are small enough to consider the allocations of the ODEProblem itself, maybe you should be using forward sensitivities?

1 Like

Yeah, absolutely. I’m super curious to figure out what’s going on! Here’s a small MWE: https://gist.github.com/samuela/8d7cf55cf921decfffc7559691bfad12.

On this toy problem obv everything is already very fast, but in more realistic scenarios I’m seeing 10-100x slowdowns. Forward solves that take less than 100ms, and corresponding adjoint solves that take 10s. That’s on a 12-dimensional quadrotor system defined in much the same way. Happy to point you towards some example code for that as well if you’re interested!

https://diffeq.sciml.ai/latest/analysis/sensitivity/

TrackerAdjoint is able to use a TrackedArray form with out-of-place functions du = f(u,p,t) but requires an Array{TrackedReal} form for f(du,u,p,t) mutating du . The latter has much more overhead, and should be avoided if possible. Thus if solving non-ODEs with lots of parameters, using TrackerAdjoint with an out-of-place definition may be the current best option.

The same is true about the vjps, though if you don’t have branching you can ReverseDiffVJP(true) to compile the backpass. But I think your problem is probably much better off doing

function aug_dynamics!(z, policy_params, t)
    x = @view z[2:end]
    u = policy(x, policy_params)
    [x' * x + u' * u;u]
end

to accommodate for those factors in reverse mode AD (or use the AoS->SoA conversion and convert back).

1 Like

Gotcha, I misunderstood in-place dynamics as always being faster than out-of-place ones. I’m getting ~2x speedup in the adjoint passes now with the OOP version:

forward:  0.001867 seconds (9.42 k allocations: 7.273 MiB)
adjoint:  0.020280 seconds (45.02 k allocations: 43.124 MiB, 38.89% gc time)

(x_dim = 64 this time.) But it’s still ~10x slower. Is that normal? It still seems a bit high since the adjoint process is the same size as the forward with nearly identical dynamics.

That’s always true for forward passes, but with reverse passes it can sometimes mix better with the vjp calculation to do it OOP. That should go away with the new AD framework that’s being worked on though IIRC.

That does still seem high. I’ll want to take a deep look at that example.

Ok, I just update the gist to OOP dynamics and for clarity!

Here’s the @btime results:

[ Info: forward
  1.443 ms (9420 allocations: 7.27 MiB)
[ Info: BacksolveAdjoint
  50.525 ms (115033 allocations: 113.32 MiB)
[ Info: InterpolatingAdjoint
  30.859 ms (81025 allocations: 75.65 MiB)
[ Info: QuadratureAdjoint
  21.833 ms (50868 allocations: 48.77 MiB)

Perfect, thanks.

In case you’re curious, here’s the actual problem I’m trying to speed up atm: https://gist.github.com/samuela/9a1daca41fd46ce5e67f5df150933373.

Curiously I’ve found that in-place is actually quite a bit faster than out-of-place for this system:

In-place:
[ Info: forward
  20.009 ms (23413 allocations: 58.50 MiB)
[ Info: BacksolveAdjoint
  5.056 s (28059882 allocations: 5.93 GiB)
[ Info: InterpolatingAdjoint
  1.596 s (9013271 allocations: 1.89 GiB)
[ Info: QuadratureAdjoint
  473.475 ms (2782989 allocations: 599.98 MiB)
  
Out-of-place:
[ Info: forward
  22.901 ms (45777 allocations: 60.26 MiB)
[ Info: BacksolveAdjoint
  7.333 s (25147876 allocations: 3.42 GiB)
[ Info: InterpolatingAdjoint
  5.475 s (18755710 allocations: 2.54 GiB)
[ Info: QuadratureAdjoint
  1.659 s (5810587 allocations: 807.58 MiB)

For that problem I would suggest use QuadratureAdjoint(autojacvec=ReverseDiffVJP(true))

For reference, before:

15.938 ms (50871 allocations: 48.77 MiB)

after:

6.676 ms (65565 allocations: 1.66 MiB)

with discussion here: https://github.com/SciML/DiffEqSensitivity.jl/issues/316

We can probably make this faster still with a MTK generated VJP.