Differences in ODE Solver behaviour inside and outside Adam Optimizer

Good morning,
I am working with hybrid systems (universal differential equations) using the SciML ecosystem. I apologize if this is a stupid question, but I have a doubt related to a stiff problem I’m working on, which you can find in the snippet I’ve attached below.

From what I can see, the system integration concludes, albeit with an accompanying warning message, when carried out within the cost function of an ADAM optimizer (line 77 of my code). The initial integration yields the value at the final time, whereas subsequent ones register an ‘Inf’ cost. However, performing exactly the same integration operation, with identical parameters outside the optimizer (on line 88), results in failure.

Could you kindly help me understand what’s the difference between the two integrations?

Thank you in advance, Stefano

snippet_fail_integrating_outside_optim.jl (4.0 KB)

I’m using Julia v"1.9.0", with the following packages

ComponentArrays v0.13.12
DifferentialEquations v7.7.0
Lux v0.4.53
Optimization v3.14.2
OptimizationOptimisers v0.1.2
SciMLSensitivity v7.32.0
StableRNGs v1.0.0

Using Rosenbrock23 with adjoints is a bad idea. The performance properties of the forward pass doesn’t necessary match the performance properties of the reverse pass.

That said, did you check if this was parameter dependent?

Thank you very much for your response:

  1. Regarding Rosenbrock23 with adjoints, I appreciate your advice; I just assumed that using AD without adjoints would be much slower. Could you suggest an alternative approach that would be suitable for dealing with stiff problems?

  2. When you mention “parameter dependent,” are you referring to whether the difference is influenced by the parameters of the neural network (i.e. the seed used)?

TRBDF2, FBDF. They have a bit better scaling.

Yes

I apologize for the delayed response, but I was out of the office. I’ve tried with a different seed (rng = StableRNG(64)) and the two integrations are still a little bit different (10^-14, it could be also something related to floating point arithmetic).
Obviously, this difference is totally fine, I just want to understand what happens when the cost function is infty, because I am facing problems in hybrid models related to the numerical stability of the solution during the optimization process.

I’ve analyzed the code using Infiltrator, and it seems to me that every iteration of the optimiser performs the integration twice. The first call occurs at line 31 of the OptimizationOptizers.jl file:

f.grad(G, θ, d...)

The second call is made on the subsequent line (line 32) with the following code:

x = f.f(θ, prob.p, d...)

The results obtained are slightly different (and the results of this second call returns are equals to the ones I get outside the optimiser).

The second call is what’s without the gradient call. Indeed the gradient call can change the forward steps if it’s something like forward-mode. What gradient type are you using?