ForwardDiff + Adaptive ODE Solvers: Timestep Issue Leads to Incorrect Derivatives

Hello dear Julia community,

TLDR; Passing dual numbers (ForwardDiff.jl) into an adaptative-step ODE solver (OrdinaryDiffEq.jl) affects the timestep choice, making the derivative incorrect.

Let’s say we have a function whose value requires solving an ode, for example f(u_0) = u(u_0, 1), where u(u_0, t) is obtained by integrating \dot u = u from t = 0 to t = t. If we want to do some optimization/shooting, we may want to compute the derivative of this function w.r.t. u_0.

The solution to the ODE is u(t) = u_0 e^t, therefore f(u_0) = u_0 e^1 = e u_0, and f'(u_0) = e. We will choose u_0 = 1 so that f(u_0) = e and f'(u_0) = e.

One simple way to do this is using ForwardDiff.jl and its high-level interface, e.g. ForwardDiff.jacobian in this case. It computes derivatives by using dual numbers, which are passed to the function to differentiate. On the contrary of finite differences, ForwardDiff.jl provides exact derivatives of the given function. But for this to work properly, the control flow inside the function should not depend on the dual part, and in this case it does: the timesteps choosen by the integrator change depending on whether there is a dual part, and on its value if applicable. In other words, FowardDiff.jl is not differentiating the expected function, but a slightly different one.

MRE:


using Pkg

Pkg.activate(; temp=true)

Pkg.add(["OrdinaryDiffEq", "ForwardDiff"])

using OrdinaryDiffEq, ForwardDiff

# s gives the whole solution to the ODE problem

# f takes the value of the solution at the final time

# (s(u0, 1)[end] != s(u0, 1)(1)... for another time)

u(u0, t) = solve(ODEProblem((du, u, p, t) -> (du .= u), u0, t, ()))(t)

f(u0) = u(u0, 1)

u0 = Float64[1]

f_val = f(u0)

f_grad = ForwardDiff.jacobian(f, u0)

@show only(f_val) - only(f_grad) # Should be zero but is not

Further investigations:

  1. Building by hand the dual numbers gives the same result as ForwardDiff.jacobian:

cfg = ForwardDiff.JacobianConfig(f, u0)

dual_u0(u0, seed) = eltype(cfg.duals).(u0, seed)

u0_dual = dual_u0(u0, cfg.seeds)

f_dual = f(dual_u0(u0, cfg.seeds))

@show only(f_dual).value - only(f_dual).partials[1] # Is zero

@show only(f_grad) - only(f_dual).partials[1] # Is zero

  1. Using u, we can see the timesteps are different depending on the input, even if the real part is the same:

@show s(u0).t[2] - s(u0_dual).t[2] # Should be zero but is not

  1. Changing the seed (x100) also changes the timesteps:

@show s(u0_dual).t[2] - s(dual_u0(u0, 100 .* cfg.seeds)).t[2]

  1. Setting the seed to zero doesn’t isn’t equivalent to having no seed:

@show s(u0).t[2] - s(dual_u0(u0, 0 .* cfg.seeds)).t[2]

For now I am avoiding this problem by using constant timesteps (solve(prob, Tsit5(); dapatative=false, dt=1e-2)), but I would prefer if the process for choosing the next timestep wouldn’t depend on the dual part. Also, a possibly more problematic consequence, is that some ODEs may become stiff when calculating their derivative.

Is this known — and expected behavior? Should I make an issue on OrdinaryDiffEq.jl?

Thank you,
abavoil

No, it changes the time step choice so that the derivative is correct. In fact, this is a property that is required in order to achieve convergence towards the correct derivative. This is discussed in lots of detail in this talk:

No again, exactly the opposite. If you don’t adjust the time stepping based on the derivative, then the derivative part being stiff isn’t accounted for with smaller time steps / integrator changes, which would mean that you’d just get an incorrect or unstable derivative calculation. So this behavior is exactly designed to fix this.

You can turn this off via internalnorm = (u,t)->norm(u), but of course for the reasons described in the video and in the paper [2406.09699] Differentiable Programming for Differential Equations: A Review, if you turn this off you get “the normal” behavior but you no longer have that the derivative is guaranteed to converge to the correct answer or have stable time stepping.

That’s why we default to the version that is guaranteed to be correct. What you’re asking for is demonstrably incorrect, if what you’re talking about as “the correct derivative” is the derivative of the ODE being approximated.

1 Like

It’s the difference between the differentiate-then-discretize and discretize-then-differentiate approaches, no? Essentially, the question is whether you want the approximate derivative of the exact ODE solution, or the exact derivative of the approximate ODE solution. i.e. do you want to differentiate the discretization error?

DifferentialEquations is implementing the former perspective, as I understand it. There are cases where the latter is useful, too (because optimization methods can sometimes get confused by discretization errors differing between the derivative and the objective function, which can happen if you are solving to low accuracy), but implementing the discretize-then-differentiate approach tends to be less flexible about the choice of discretization.

In principle, is there a good way to implement forward-mode discretize-then-differentiate in DifferentialEquations? e.g. by passing a custom norm to be used in the error-tolerance checks, where the norm ignores the derivative components? (There might be other circumstances where a custom error norm is useful too.)

PS. However, I’m skeptical of the need for discretize-then-differentiate in an adaptive scheme. The basic problem is that, in an adaptive scheme, the discretization error is generally not even continuous, much less differentiable, as a function of the parameters of the ODE. Discretize-then-differentiate is much more useful in low-accuracy non-adaptive methods, e.g. PDE solutions on a fixed grid or mesh. (Nevertheless, @wsmoses and I implemented discretize-then-differentiate in QuadGK.jl#110, despite QuadGK being adaptive, because it turned out to have other practical advantages for us. e.g. if the user sets an absolute error tolerance for the integrand, it’s not clear how to translate that to an absolute error tolerance for the derivative, which has different units. It also saved on time and allocations in reverse mode.)

Yes, effectively. DifferentialEquations.jl defaults to using a norm hack to include the duals in the adaptivity norm in order to give you the same time stepping as if applied to the forward sensitivity equations, i.e. differentiate-then-discretize. You can change the norm to get the standard AD discretize-then-differentiate behavior, though for adaptive integrators there are some strong reasons to prefer the former which is the reason for the default.

Yes that’s what I showed in the talk and linked above:

sol = solve(prob, alg, internalnorm = (u,t)->norm(u))

I don’t think that’s necessarily the issue. The issue is that adaptivity is always done with respect to some error metric and is also crucial to the stability of the integrator. What can go wrong is that, if the adaptivity scheme is only dependent on the original process, you have no guarantee of correctness, or even stability or convergence, of the derivative.

In the paper and in the talk I give a linear ODE system where as you send abstol->0, reltol->0, your derivative estimator error for discretize-then-differentiate does not converge to zero because the adaptivity of the primal equation is unaffected. This is an extreme case, but it highlights that with discretize-then-differentiate you might be able to say:

“this solver is adapting time steps based on tol = X and so it’s automatically choosing some good time steps to give me a relative error I want”,

but you cannot say

“this solver is adaptive time steps based on tol = X and so it’s automatically choosing good time steps to give me a relative error in the derivative I want”

because, again, there are examples show where you can say tol = 1e-300 and still get a derivative error of 100, without getting any warning or any error. Decoupling the adaptivity of the solver from the derivative calculation is simply not guaranteed to converge. And now while the bad case might be bad (the cases that are not convergent you might say are rare), the more troubling part of that is that even in “not as bad” cases, i.e. cases where you still converge to the correct derivative if tolerance gets low enough, but at standard solver tolerances the derivative ends up doing something bad to give an unreliable result. This is because one of the main purposes of time stepping adaptivity is to ensure that your ODE solve is stable, i.e. choosing a time step small enough so that you don’t have an instability. With the normal function of the solver, if it detects instability in a step and then reject that step and pull back to use a smaller dt. But if you don’t include the derivative values in that adaptivity norm, then it’s possible to get one “bad” derivative step that’s too high and then the rest of your ODE solver is just giving you junk. With discretize-then-differentiate, there are also simple cases you can construct where the primal ODE looks fine, but the derivative ODE has dt in the unstable range and thus give you some derivative of 100,000,000,000 without any warning or error.

However, if you do this change to making the derivative included in the adaptivity norm by default, then you are guaranteed to have a stable derivative calculation (or the solver will error out with a return code), you have convergence as tol -> 0, and your tolerance interpretation also means something about the accuracy of the derivative. This is a choice of course, you can can easily with one line of code swap from one to the other, but my view is that if I’m offering an adaptive solver that’s supposed to try and be as correct as possible automatically, that is more correct behavior.

2 Likes

I think the reason for the difference is that integration is linear while solving ODEs is nonlinear. As such, for quadgk, even if you don’t know the scale of the derivative, you know that as your tolerance decreases, the derivative tolerance also decreases. ODEs are very different because the sensitivities can require totally different error control.

2 Likes

Thank you @ChrisRackauckas @stevengj and @Oscar_Smith for your in-depth replies.

because optimization methods can sometimes get confused by discretization errors differing between the derivative and the objective function

That is what worried me, and indeed using internalnorm = (u,t)->norm(u) solved a problem I had: I use Ipopt to solve an optimization problem, and when having to compute a jacobian, I would sometimes have to wait 5 minutes to get a maxiter message. Now such a case is fixed and I get a solution. The derivative is good enough even if not controlled by the tolerance for this specific problem.

I’ll take a look at @ChrisRackauckas’ workshop when I find the time, thanks again.

1 Like