Help with gradient-based optimizers

I’m using Optimization.jl and Optim.jl to solve an optimization problem for which I have gradient information (via AD). What I’ve found is that the methods that use the gradient information perform worse than the derivative-free methods. The gradient-based methods seem to get stuck, despite a large gradient value.

Why is this? I would expect that the gradient methods could get stuck if they find a point in parameter space where the derivative is zero, but here it’s not. If they’re not moving, is it something with the line-search portion of the algorithm that is deciding to take an infinitesimally small step?

To give a snippet of the run script,

optf = OptimizationFunction(quasisym, Optimization.AutoForwardDiff());
prob_x0 = OptimizationProblem(optf, x0, params);

qs_x0 = quasisym(x0,params)

@time sol_x0_nm = solve(prob_x0, NelderMead(); maxtime = 3600);
@time sol_x0_bfgs = solve(prob_x0, BFGS(); maxtime = 3600);
@time sol_x0_lbfgs = solve(prob_x0, LBFGS(); maxtime = 3600);

This produces the following solutions. I’ve also included a plot at the bottom, showing that Nelder-Mead is the only method that moves significantly.

julia> qs_x0
2.193829699196393

julia> sol_x0_nm.original
 * Status: failure (reached maximum number of iterations)

 * Candidate solution
    Final objective value:     2.465799e-03

 * Found with
    Algorithm:     Nelder-Mead

 * Convergence measures
    √(Σ(yᵢ-y)²)/n ≰ 1.0e-08

 * Work counters
    Seconds run:   1590  (vs limit 3600)
    Iterations:    1000
    f(x) calls:    5812


julia> sol_x0_bfgs.original
 * Status: success

 * Candidate solution
    Final objective value:     2.166061e+00

 * Found with
    Algorithm:     BFGS

 * Convergence measures
    |x - x'|               = 5.29e-23 ≰ 0.0e+00
    |x - x'|/|x'|          = 5.29e-23 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 3.30e+05 ≰ 1.0e-08

 * Work counters
    Seconds run:   3569  (vs limit 3600)
    Iterations:    8
    f(x) calls:    360
    ∇f(x) calls:   360


julia> sol_x0_lbfgs.original
 * Status: success

 * Candidate solution
    Final objective value:     2.189750e+00

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 8.27e-25 ≰ 0.0e+00
    |x - x'|/|x'|          = 8.27e-25 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 8.81e+05 ≰ 1.0e-08

 * Work counters
    Seconds run:   2905  (vs limit 3600)
    Iterations:    7
    f(x) calls:    312
    ∇f(x) calls:   312

Have you evaluated the gradient to see if it does in fact work? I’ve had similar issues in Turing models and found the gradient had NaNs

Yes. In the figure below, the gradient is shown at two different points. The blue orange circles are at the initial point, and orange blue are at the final step of BFGS.

Also,

julia> sum(isnan.(df_x0))
0

Interesting. I’m following along but don’t have further suggestions other than BOBYQA from NLopt worked well for me in my similar case.

Hm, are you sure your gradient is correct? The stopping criteria seem to indicate it stopped because the function value change was small enough (both absolute an relative it was zero, and the last step was of size 1e-25, so basically no step.

This usually happens when the gradient is wrong (and hence in (L)BFGS the hessian approximation) and with that Armco line search does not find a proper step site (larger than - say 1e-3), on the other side the gradient norm (above 1e-5) is still super large. So I suppose your gradient might be wrong.

One can also see there were a lot of steps Armijo (or the line search in general that was used) tried, since while only doing 7 or 8 iterations (L)BFGS has 312 something gradient and function calls. I would usually expect something like 5-6 calls per iteration, not 50.

But from the run time it also seems sour problem is much larger than the ones I test and I usually do not use Optim, so I can not directly tell you more than these theoretical things I just wrote.

edit I am also a bit surprised the one entry in the gradient even increases so much, but that might be misleading in such high-dim problems.

The fact that the two gradient-based methods actually converged to some local minima suggests that there is more likely something wrong with your problem, i.e. it is not convex in the region where you are searching for the solution.

So, I checked with FiniteDifferences.jl, and indeed, it does not match the ForwardDiff result.

Right again. I mistakenly swapped the colors.

I guess I’m suspicious of the FD result, too. It’s the result of qs_grad = grad(central_fdm(5,1), qs, x0)

If I take a small enough step, then either gradient improves the result slightly (~1e-11).

Can you maybe provide a complete runnable code, which includes the definition of your function?

1 Like

Hmm. That might be tough. Let me see what I can do.

It’s actually the same project that you were helping us with before.

1 Like

A deeper dive on this exposed a bug in the original target function that was missed by our original test case. This has now been fixed, but a similar question remains. When performing gradient-based optimization, the optimizer exits successfully but reports that the gradient is still non-zero. I would like to ask a (probably elementary) question about this, as to me it doesn’t make sense that a local minimum with a non-zero gradient can exist. The optimizer appears to be stopping because it cannot take a further step, how is that possible if gradient is non-zero? I do expect this problem to have many local minima.

The returned Optim code is below:

 * Status: success

 * Candidate solution
    Final objective value:     7.549600e+01

 * Found with
    Algorithm:     Fminbox with L-BFGS

 * Convergence measures
    |x - x'|               = 0.00e+00 ≤ 0.0e+00
    |x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 5.10e-01 ≰ 1.0e-08

 * Work counters
    Seconds run:   14  (vs limit Inf)
    Iterations:    2
    f(x) calls:    194
    ∇f(x) calls:   18

Is it a constrained problem? If so, you may just be hitting the boundary.

1 Like

You’re right, it is constrained and it does look like it’s hitting the boundary. That was indeed an elementary question. We don’t yet have easy access to the Hessian for the implicitly differentiated problem, so we haven’t yet been able to use interior point or trust region methods.

Another point might be your stopping criteria. As mentioned above, if you stop because either your step (in x) is in relative or absolute terms very small (first two convergence measures) or the same for the cost (3rd and 4th criterion) – those are – at least for me – not directly convergence criteria, but just stopping ones that do not indicate convergence to a stationary point.

If you do very small steps, you might also just be “in a mean area” of your cost (cf. Rosenbrock problem) or, more likely, still have a bug somewhere.

And sure for a constrained problem you of course have the complementary that at a boundary the gradient and the (weighted by dual variable) gradient of the constraint sum to zero, but the gradient might per se not be zero.
Or as a maybe illustrative example: Imagine your minimiser is slightly outside your feasible region. The (negative) gradient still points to that even if you are at the boundary. but the gradients of the (active) constraints counter that and avoid that you go there.

You can also do Trust region with an approximate Hessian like one approximated similar to L-BFGS in quasi newton based on gradients collected.

1 Like

Thank you @kellertuer for the helpful comments and advice, I appreciate it.

The resolution is that the gradient calculation is indeed incorrect. A finite difference approximation with the corrected target function does not yield the same gradient as the autodiff calculation. Running a small optimization with the finite difference gradient calculation does not have the issues I was seeing before.

@gdalle This may be due to an me incorrectly using ImplicitDifferentiation, I will try to come up with as minimum a working example as possible and open an issue.

2 Likes