I’m using Optimization.jl and Optim.jl to solve an optimization problem for which I have gradient information (via AD). What I’ve found is that the methods that use the gradient information perform worse than the derivative-free methods. The gradient-based methods seem to get stuck, despite a large gradient value.
Why is this? I would expect that the gradient methods could get stuck if they find a point in parameter space where the derivative is zero, but here it’s not. If they’re not moving, is it something with the line-search portion of the algorithm that is deciding to take an infinitesimally small step?
Yes. In the figure below, the gradient is shown at two different points. The blue orange circles are at the initial point, and orange blue are at the final step of BFGS.
Hm, are you sure your gradient is correct? The stopping criteria seem to indicate it stopped because the function value change was small enough (both absolute an relative it was zero, and the last step was of size 1e-25, so basically no step.
This usually happens when the gradient is wrong (and hence in (L)BFGS the hessian approximation) and with that Armco line search does not find a proper step site (larger than - say 1e-3), on the other side the gradient norm (above 1e-5) is still super large. So I suppose your gradient might be wrong.
One can also see there were a lot of steps Armijo (or the line search in general that was used) tried, since while only doing 7 or 8 iterations (L)BFGS has 312 something gradient and function calls. I would usually expect something like 5-6 calls per iteration, not 50.
But from the run time it also seems sour problem is much larger than the ones I test and I usually do not use Optim, so I can not directly tell you more than these theoretical things I just wrote.
edit I am also a bit surprised the one entry in the gradient even increases so much, but that might be misleading in such high-dim problems.
The fact that the two gradient-based methods actually converged to some local minima suggests that there is more likely something wrong with your problem, i.e. it is not convex in the region where you are searching for the solution.
A deeper dive on this exposed a bug in the original target function that was missed by our original test case. This has now been fixed, but a similar question remains. When performing gradient-based optimization, the optimizer exits successfully but reports that the gradient is still non-zero. I would like to ask a (probably elementary) question about this, as to me it doesn’t make sense that a local minimum with a non-zero gradient can exist. The optimizer appears to be stopping because it cannot take a further step, how is that possible if gradient is non-zero? I do expect this problem to have many local minima.
You’re right, it is constrained and it does look like it’s hitting the boundary. That was indeed an elementary question. We don’t yet have easy access to the Hessian for the implicitly differentiated problem, so we haven’t yet been able to use interior point or trust region methods.
Another point might be your stopping criteria. As mentioned above, if you stop because either your step (in x) is in relative or absolute terms very small (first two convergence measures) or the same for the cost (3rd and 4th criterion) – those are – at least for me – not directly convergence criteria, but just stopping ones that do not indicate convergence to a stationary point.
If you do very small steps, you might also just be “in a mean area” of your cost (cf. Rosenbrock problem) or, more likely, still have a bug somewhere.
And sure for a constrained problem you of course have the complementary that at a boundary the gradient and the (weighted by dual variable) gradient of the constraint sum to zero, but the gradient might per se not be zero.
Or as a maybe illustrative example: Imagine your minimiser is slightly outside your feasible region. The (negative) gradient still points to that even if you are at the boundary. but the gradients of the (active) constraints counter that and avoid that you go there.
You can also do Trust region with an approximate Hessian like one approximated similar to L-BFGS in quasi newton based on gradients collected.
Thank you @kellertuer for the helpful comments and advice, I appreciate it.
The resolution is that the gradient calculation is indeed incorrect. A finite difference approximation with the corrected target function does not yield the same gradient as the autodiff calculation. Running a small optimization with the finite difference gradient calculation does not have the issues I was seeing before.
@gdalle This may be due to an me incorrectly using ImplicitDifferentiation, I will try to come up with as minimum a working example as possible and open an issue.