Help with gradient-based optimizers

jwalker · September 13, 2024, 3:15pm

I’m using Optimization.jl and Optim.jl to solve an optimization problem for which I have gradient information (via AD). What I’ve found is that the methods that use the gradient information perform worse than the derivative-free methods. The gradient-based methods seem to get stuck, despite a large gradient value.

Why is this? I would expect that the gradient methods could get stuck if they find a point in parameter space where the derivative is zero, but here it’s not. If they’re not moving, is it something with the line-search portion of the algorithm that is deciding to take an infinitesimally small step?

To give a snippet of the run script,

optf = OptimizationFunction(quasisym, Optimization.AutoForwardDiff());
prob_x0 = OptimizationProblem(optf, x0, params);

qs_x0 = quasisym(x0,params)

@time sol_x0_nm = solve(prob_x0, NelderMead(); maxtime = 3600);
@time sol_x0_bfgs = solve(prob_x0, BFGS(); maxtime = 3600);
@time sol_x0_lbfgs = solve(prob_x0, LBFGS(); maxtime = 3600);

This produces the following solutions. I’ve also included a plot at the bottom, showing that Nelder-Mead is the only method that moves significantly.

julia> qs_x0
2.193829699196393

julia> sol_x0_nm.original
 * Status: failure (reached maximum number of iterations)

 * Candidate solution
    Final objective value:     2.465799e-03

 * Found with
    Algorithm:     Nelder-Mead

 * Convergence measures
    √(Σ(yᵢ-y)²)/n ≰ 1.0e-08

 * Work counters
    Seconds run:   1590  (vs limit 3600)
    Iterations:    1000
    f(x) calls:    5812


julia> sol_x0_bfgs.original
 * Status: success

 * Candidate solution
    Final objective value:     2.166061e+00

 * Found with
    Algorithm:     BFGS

 * Convergence measures
    |x - x'|               = 5.29e-23 ≰ 0.0e+00
    |x - x'|/|x'|          = 5.29e-23 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 3.30e+05 ≰ 1.0e-08

 * Work counters
    Seconds run:   3569  (vs limit 3600)
    Iterations:    8
    f(x) calls:    360
    ∇f(x) calls:   360


julia> sol_x0_lbfgs.original
 * Status: success

 * Candidate solution
    Final objective value:     2.189750e+00

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 8.27e-25 ≰ 0.0e+00
    |x - x'|/|x'|          = 8.27e-25 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 8.81e+05 ≰ 1.0e-08

 * Work counters
    Seconds run:   2905  (vs limit 3600)
    Iterations:    7
    f(x) calls:    312
    ∇f(x) calls:   312

dlakelan · September 13, 2024, 4:07pm

Have you evaluated the gradient to see if it does in fact work? I’ve had similar issues in Turing models and found the gradient had NaNs

jwalker · September 13, 2024, 4:47pm

Yes. In the figure below, the gradient is shown at two different points. The ~~blue~~ orange circles are at the initial point, and ~~orange~~ blue are at the final step of BFGS.

Also,

julia> sum(isnan.(df_x0))
0

dlakelan · September 13, 2024, 5:26pm

Interesting. I’m following along but don’t have further suggestions other than BOBYQA from NLopt worked well for me in my similar case.

kellertuer · September 13, 2024, 5:57pm

Hm, are you sure your gradient is correct? The stopping criteria seem to indicate it stopped because the function value change was small enough (both absolute an relative it was zero, and the last step was of size 1e-25, so basically no step.

This usually happens when the gradient is wrong (and hence in (L)BFGS the hessian approximation) and with that Armco line search does not find a proper step site (larger than - say 1e-3), on the other side the gradient norm (above 1e-5) is still super large. So I suppose your gradient might be wrong.

One can also see there were a lot of steps Armijo (or the line search in general that was used) tried, since while only doing 7 or 8 iterations (L)BFGS has 312 something gradient and function calls. I would usually expect something like 5-6 calls per iteration, not 50.

But from the run time it also seems sour problem is much larger than the ones I test and I usually do not use Optim, so I can not directly tell you more than these theoretical things I just wrote.

edit I am also a bit surprised the one entry in the gradient even increases so much, but that might be misleading in such high-dim problems.

lieskjur · September 13, 2024, 6:56pm

The fact that the two gradient-based methods actually converged to some local minima suggests that there is more likely something wrong with your problem, i.e. it is not convex in the region where you are searching for the solution.

jwalker · September 13, 2024, 9:06pm

So, I checked with FiniteDifferences.jl, and indeed, it does not match the ForwardDiff result.

Right again. I mistakenly swapped the colors.

jwalker · September 13, 2024, 9:29pm

I guess I’m suspicious of the FD result, too. It’s the result of qs_grad = grad(central_fdm(5,1), qs, x0)

If I take a small enough step, then either gradient improves the result slightly (~1e-11).

gdalle · September 13, 2024, 9:36pm

Can you maybe provide a complete runnable code, which includes the definition of your function?

jwalker · September 13, 2024, 9:41pm

Hmm. That might be tough. Let me see what I can do.

It’s actually the same project that you were helping us with before.

benjaminfaber · September 21, 2024, 4:19pm

A deeper dive on this exposed a bug in the original target function that was missed by our original test case. This has now been fixed, but a similar question remains. When performing gradient-based optimization, the optimizer exits successfully but reports that the gradient is still non-zero. I would like to ask a (probably elementary) question about this, as to me it doesn’t make sense that a local minimum with a non-zero gradient can exist. The optimizer appears to be stopping because it cannot take a further step, how is that possible if gradient is non-zero? I do expect this problem to have many local minima.

The returned Optim code is below:

 * Status: success

 * Candidate solution
    Final objective value:     7.549600e+01

 * Found with
    Algorithm:     Fminbox with L-BFGS

 * Convergence measures
    |x - x'|               = 0.00e+00 ≤ 0.0e+00
    |x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 5.10e-01 ≰ 1.0e-08

 * Work counters
    Seconds run:   14  (vs limit Inf)
    Iterations:    2
    f(x) calls:    194
    ∇f(x) calls:   18

gdalle · September 21, 2024, 4:24pm

Is it a constrained problem? If so, you may just be hitting the boundary.

benjaminfaber · September 21, 2024, 4:28pm

You’re right, it is constrained and it does look like it’s hitting the boundary. That was indeed an elementary question. We don’t yet have easy access to the Hessian for the implicitly differentiated problem, so we haven’t yet been able to use interior point or trust region methods.

kellertuer · September 21, 2024, 5:05pm

Another point might be your stopping criteria. As mentioned above, if you stop because either your step (in x) is in relative or absolute terms very small (first two convergence measures) or the same for the cost (3rd and 4th criterion) – those are – at least for me – not directly convergence criteria, but just stopping ones that do not indicate convergence to a stationary point.

If you do very small steps, you might also just be “in a mean area” of your cost (cf. Rosenbrock problem) or, more likely, still have a bug somewhere.

And sure for a constrained problem you of course have the complementary that at a boundary the gradient and the (weighted by dual variable) gradient of the constraint sum to zero, but the gradient might per se not be zero.
Or as a maybe illustrative example: Imagine your minimiser is slightly outside your feasible region. The (negative) gradient still points to that even if you are at the boundary. but the gradients of the (active) constraints counter that and avoid that you go there.

You can also do Trust region with an approximate Hessian like one approximated similar to L-BFGS in quasi newton based on gradients collected.

benjaminfaber · September 21, 2024, 8:03pm

Thank you @kellertuer for the helpful comments and advice, I appreciate it.

The resolution is that the gradient calculation is indeed incorrect. A finite difference approximation with the corrected target function does not yield the same gradient as the autodiff calculation. Running a small optimization with the finite difference gradient calculation does not have the issues I was seeing before.

@gdalle This may be due to an me incorrectly using ImplicitDifferentiation, I will try to come up with as minimum a working example as possible and open an issue.

Topic		Replies	Views
Understanding the results from Optim.jl for a black box function Optimization (Mathematical) optim , optimization	15	1712	June 25, 2021
Gradient not zero when optimizing function Optimization (Mathematical) optimization	0	391	August 25, 2021
`Optimization.LBFGS` fails to converge while `Optim.NelderMead()` works General Usage question , optim , optimization	11	281	March 29, 2025
ForwardDiff.jl \ Optiml.jl Newton Trust Region failing for unclear reason Optimization (Mathematical) optimization	12	363	December 20, 2023
L-BFGS returns a NaN gradient when providing analytical gradient. BFGS shows no issues Optimization (Mathematical)	1	1523	January 31, 2021

Help with gradient-based optimizers

Related topics