ForwardDiff.jl \ Optiml.jl Newton Trust Region failing for unclear reason

I am trying to use Optim.jl’s NewtonTrustRegion with forward AD for computing the gradient and hessian. This fails for a reason that I don’t quite understand.

I run

optimize(fn_closure, theta0, NewtonTrustRegion(), autodiff=:forward)

and get

 * Status: failure

 * Candidate solution
    Final objective value:     6.729368e-03

 * Found with
    Algorithm:     Newton's Method (Trust Region)

 * Convergence measures
    |x - x'|               = 0.00e+00 ≤ 0.0e+00
    |x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 3.53e-06 ≰ 1.0e-08

 * Work counters
    Seconds run:   1485  (vs limit Inf)
    Iterations:    39
    f(x) calls:    40
    ∇f(x) calls:   40
    ∇²f(x) calls:  21

For what it’s worth, I’ve checked that the following commands give stable results (same input, same output).

ForwardDiff.gradient(fn_closure, theta0)
ForwardDiff.hessian(fn_closure, theta0) 

Can you provide a reproducible example? The status looks like it converged to a point that it could not move from, but which did not satisfy the constraints. (There is an error of 3.53e-06.) That tolerance might be sufficient for your purpose though.

Thank you for your reply! Unfortunately, the function I’m optimizing is very complicated so I can’t put it into a MWE.

Something puzzling to me is that when I run the optimization again starting from the endpoint (res is the optimize result from my first post) it moves away from this point (and again fails after some time)

theta_hat = Optim.minimizer(res)
optimize(fn_closure, theta_hat, NewtonTrustRegion(), autodiff=:forward)

Perhaps also relevant. When I run

jac2 = ForwardDiff.gradient(fn_closure, theta_hat2) 
hes2 = ForwardDiff.hessian(fn_closure, theta_hat2) 

I get the following for the gradient:

DUAL GRAD : [100.1423680297688, 86.31602164943403, 9.595571170571354, 78.95150913951403, 50.169014609224156, 19.889109806411856, 62.50709572344895]
7-element Vector{Float64}:
  7.110210258949758e-7
 -5.679306872613234e-7
  1.2229708034192441e-8
  1.399793442295446e-6
  3.530252664533079e-6
  3.6471794999852296e-7
  1.7155964792730036e-7

and for the Hessian

DUAL HESS : [100.1423680297688, 86.31602164943403, 9.595571170571354, 78.95150913951403, 50.169014609224156, 19.889109806411856, 62.50709572344895]
7×7 Matrix{Float64}:
  4.01904e-6   5.29512e-6  -6.58453e-6  -1.91412e-7   1.43204e-6   5.66231e-7  -7.2336e-6
  5.29512e-6   8.3286e-6   -7.55979e-6  -1.67395e-7   1.91752e-6   7.72732e-7  -8.9965e-6
 -6.58453e-6  -7.55979e-6   1.69931e-5   1.71959e-7  -2.24058e-6  -1.03791e-6   1.07632e-5
 -1.91412e-7  -1.67395e-7   1.71959e-7   1.04498e-6  -6.03856e-7   1.95477e-7   1.71902e-7
  1.43204e-6   1.91752e-6  -2.24058e-6  -6.03856e-7   2.72165e-6  -1.22485e-7  -2.71248e-6
  5.66231e-7   7.72732e-7  -1.03791e-6   1.95477e-7  -1.22485e-7   1.53187e-6  -9.78581e-7
 -7.2336e-6   -8.9965e-6    1.07632e-5   1.71902e-7  -2.71248e-6  -9.78581e-7   1.46438e-5

Isn’t g(x) the gradient of the objective? In that case, it seems like the problem is badly conditioned and the gradient of the objective cannot be sufficiently reduced in all dimensions.

I apologize, I am not following. I guess I do not understand why optimize exits. Shouldn’t at least one of the following be true?

julia> Optim.x_converged(res)
false

julia> Optim.f_converged(res)
false

julia> Optim.g_converged(res)
false

julia> Optim.iteration_limit_reached(res)
false

Is the idea that at the last point, optimize tries to do a line search (or something like that) and does not find any x that yields a lower value of the objective, even though that theoretically should be the case?

What may be relevant about my problem is that in computing the objective function, I have to solve a fixed point problem by iteration. This inner loop has a tolerance parameter (which is small at e-12), so indeed there must exist regions of the parameter space where the objective is discontinuous as the number of iterations changes.

My guess is that the fixed point solver is causing auto-diff to fail. I’ve also encountered this problem in the past (see: confused by this root finding example · Issue #595 · JuliaDiff/ForwardDiff.jl · GitHub). In my case, the issue was that the root-finder terminates in a different number of steps using floating point values than it did using Duals, which is what ForwardDiff.jl is using to compute derivatives, so the gradient calculation you are getting is possibly not right.

One thing you could do is reformulate your problem as a constrained one: instead of solving an inner fixed point for each parameter value, define your objective over both the parameters and some “residuals”, and define a constraint function like residual = c(parameters). Then, in addition to maximizing your objective over the parameters + residuals, you’d also impose the constraint that 0 = c(parameters), etc. In structural IO (hi gabriel!) we call this the “MPEC” approach: https://doi.org/10.3982/ECTA7925. I’ve had some success with this in my own work, but it’s definitely not foolproof.

The other approach would be to use the ImplicitAD.jl package, which requires you to somewhat reformulate your problem, but not nearly as much as the full-blow MPEC setup. ImplicitAd will again require you to define a residual function, but lets you keep it inside your existing objective, and then auto-diff ought to work fine. In other settings, I have found some success with this too.

1 Like

You could try to write your own gradient too. If your fixed-point solution x is given by

F(x,p) - x = 0

where p are your parameters to optimize and x depends on p, then differentiating w.r.t. p gives

F_x \, x_p + F_p - x_p = 0

where F_x is the Jacobian of x \mapsto F(x,p) and F_p the Jacobian of p \mapsto F(x,p), and x_p is the jacobian of p \mapsto x(p). Rearranging to solve for x_p gives

x_p = (I - F_x)^{-1} \, F_p

which you can then pass onto your cost function. If your cost function is of the form

c(x, p) (where again x depends on p)

then its gradient (at the fixed-point x) is given by

g(x, p) = c_x \, x_p + c_p

where c_x and c_p are the Jacobians of x\mapsto c(x,p) and p\mapsto c(x,p), respectively. Substituting x_p gives

g(x, p) = c_x \, (I - F_x)^{-1} \, F_p + c_p

which might be more accurate than ForwardDiff differentiating through your fixed-point solver.

2 Likes

Thanks very much @tcovert and @briochemc!! These are super interesting points.

Earlier, I was planning on using ImplicitDifferentiation.jl (which seems to be an older version of ImplicitAD.jl?) and was just about to do that, but I tested on a super simple example and noticed that ForwardDiff.jl was actually getting it right (and much faster). I don’t know that this is also true for my more complicated function.

I am still confused about (at least) one thing. Even if I use adjoint methods to compute the gradient and hessian correctly, it will still be the case that the objective function itself has discontinuities, where the number of iterations changes. I wonder if this might lead to failures from line searches. (The gradient + hessian suggests going in direction D, but the function is discontinuous in that direction very close to the current point.)

I am thinking about this because when I run Newton() it also fails, but with a more informative message, specifically it says that the line search failed.

I am trying a simpler “solution,” namely using a fixed and very high number of iterations in the fixed point problem. If that doesn’t help I’ll consider the implicit differentiation and I’ll report back!

I think ImplicitDifferentiation.jl and ImplicitAD.jl are actually distinct code bases with slightly different underlying approaches, but they are both based on the idea that instead of auto-diffing through a fixed point, you should just compute the adjoint, and they provide an auto-diff friendly way to do that for you, instead of you computing it yourself. @briochemc’s suggestion that you write it out yourself will surely result in faster code than using an adjoint/auto-diff approach, but maybe its too hard in your case, I don’t know.

Whether your existing setup (an objective function that includes an internal fixed-point step) will “work” with ForwardDiff.jl, provided that you guarantee the fixed-point always uses the same number of iterations, I’m not sure. This was the idea suggested in the link I posted above, but when I tried it on my toy examples it still failed, so either I had the software versions wrong, or I was somehow missing the point.

You stated that your objective function has some discontinuities. I think that is still “fine” regardless of how you solve your fixed point problem: the derivatives will be well-defined everywhere except those discontinuities, so unless your optimizer happens to pick the exact discontinuity point as an evaluation point (and that seems unlikely?) you will be ok.

It turns out discontinuities may lead to failure with Newton() and NewtonTrustRegion() in some cases. The idea is that the algorithm may lead to the discontinuity point, and the local gradient and hessian may “tell” it to go further.

Here is a MWE

function f(x)
	return x[1]^2 + (x[1] > -0.05) * 0.5
end

mygrid = -0.5:0.01:0.5
plot(mygrid, f.(mygrid))

testop = optimize(f, [-0.5], NewtonTrustRegion(), autodiff=:forward)

image

I’ve checked and this is what happens in my case. I need to go back to the drawing board to think how it may be possible to avoid these discontinuities in my case. (My other options are (1) use Newton despite the failure, because the solution is pretty good, and (2) use a derivative-free method)

Is it necessary to solve for a fixed point separately from the optimization? Why not let the optimizer simultaneously find the fixed point as a constraint while optimizing?

If your fixed point is supposed to have smooth continuation, the discontinuity could arise from the optimizer and fixed point algorithms fighting each other. For example, there could be a jump in the number of fixed point iterations, which confuses the optimizer. On the other hand, if the discontinuity is intended, then I wouldn’t expect optim to work well, nor would “gradient-free” methods be a solution.

I also concur with the suggestions to put together a reproducible MWE, otherwise everyone is just grasping at straws.