Hi — I want to use the Optim.jl package to minimize some function. Say the gradient of my objective is “easy” to approximate using some additional procedure. Now, when appealing to first-order methods like gradient descent or (L-)BFGS, I don’t really need to compute the value of my objective at the iterates. Sure, doing so, one can monitor the convergence etc. But say my objective is hard to compute. Can I “hack” the usual functions of Optim.jl to bypass the need of providing an objective function ? Or are you aware of other packages to do so ? Thanks !
That’s not true for general nonlinear optimisation. Typically, the gradient is used to get a search direction and then a line search is performed to choose the next iterate that improves the objective value. The second step requires the objective to be evaluated. You can use a static line search with a fixed step size and then you technically don’t need the objective value but given how Optim is implemented, it uses the objective value to judge convergence. You can return 0 for the objective value all the time but this will cause immediate convergence due to lack of change in the objective. You can try defining a function that always gives a decrease in the objective value compared to the previous one and then use it along with the static line search. That might do what you want but it’s very hackish.
You don’t, but algorithms usually do.
If your objective is convex and smooth (ie has Lipschitz gradient) then there’s hope: just use a (fast) gradient method with fixed stepsize 1/L, where L is the Lipschitz constant of the gradient.
Otherwise I’m pretty sure evaluating the objective value is quite important to ensure convergence.
Perhaps you would be interested in NLsolve, particularly if you can also get a cheap proxy for your Hessian (/jacobian of your gradient).
It’s a painful proposition for most optimization people but it is actually very common amongst practitioners to ignore the objective. Sometimes - as you say - it can be difficult to compute, but sometimes it is just noisy.
you have two choices: (1) choose a fixed step-size - if LineSearchers.jl doesn’t support this, then it should be. You may wish to use a momentum descent type method then. (2) Alternatively, you can implement a line-search where instead of minimizing the objective you enforce approximate orthogonality of the gradient with the search direction. This works suprisingly well in practice … often but not always … and in particular there are no theoretical guarantees unless you have a convex objective. This is definitely not implemented in LineSearches.jl.
But neither would be a hack - just a different choice of line search.
you could try giving the norm of the gradient as the objective function, and the optimization procedure should land you at the minimum of that … and hence probably near the extremum of some hard to compute or define objective
The problem with that is that -grad would not be a descent direction so the standard linesearch schemes would fail.
Of course you can just treat it globally this way then just have a standard nonlinear least squares sYstem. Plenty of solvers for that but again no convergence guarantees - in the sense that minima need not satisfy grad = 0
I like this idea. The objective value is the gradient norm and the gradient is the true gradient. The search direction will still be along the negative gradient passed and the line search will try to minimise the gradient norm along the direction. In convex problems, this should be fine. In non-convex problems, you may end up converging to a local maximum or a saddle point instead of a minimum but that’s probably a risk you are taking in non-convex optimisation anyways when you don’t evaluate the objective. So just don’t trust the gradient too much in the line search and set a maximum step size inversely proportional to the Lipchitz constant of the function.
Edit: scratch this, see the comments below.
Just asking to learn here because I’m not an optimizer, but would that really be preferable to just using a zero finder with the gradient directly, at least assuming that the Hessian or some decent Hessian approximation were available? I feel like the obvious answer to this question is not to pretend like you have the objective function, but to just use a zero finder for the gradient.
The problem with zero finders is that they have no preference between local minima, local maxima and saddle points. The idea of optimisation is that you are searching along a descent direction so a step small enough is guaranteed to give you a decrease in the objective for continuous functions. If you get a guaranteed finite decrease in each iteration and run the optimisation forever, it will eventually converge to a local minimum assuming the objective is bounded from below (doesn’t go to -Inf). So the convergence proof of optimisation algorithms relies on the incremental descent property of line searches. Convergence rate proofs need more assumptions about how fast the function you are optimising is changing in the vicinity of the local minimum. A measure of how fast a function is changing is the Lipchitz constant of the function hence the need to consider this when choosing a maximum step size for the line search not to trust the gradient “too far” from where it’s evaluated (think of a high frequency sine wave) or trust it too little (think of a nearly flat function with little curvature so the gradient isn’t changing much). Too much and too little trust in the gradient can both slow down convergence. Too much trust can also cause complete lack of convergence in some cases. Good line search algorithms are ones that can handle various “topologies” in the function optimised where sometimes it’s more of a high frequency sine wave and sometimes it’s a low curvature function in different neighbourhoods or wrt different decision variables. Adapting the maximum step size and the number of “test points” to check along the search direction is a key feature of many line search algorithms. Of course, typically we have the objective function to evaluate and compare test points and only move to a new iterate if it has a lower objective. So the negative gradient or search direction is only a guide but we don’t have to take any step in that direction if there is no improvement in the objective value. When we don’t have the objective value and only have the gradient (weird ), I think your best bet would be to trust the negative gradient or search direction but be conservative with step sizes.
Thanks for the explanation! What you explain about line searches is vaguely familiar to me (which is to say, I’ve read the line search chapter of Nocedal and Wright once or twice and that’s the extent of my literacy), but I didn’t really think about the local maxima problem with zero finders. My first reaction to your post was to think that checking the Hessian should help avoid that, but I can see how that isn’t really the same as helping to avoid maxima in the first place the way that having the objective and a non-constant line search step size does. I guess I’ve just been spoiled to have optimization problems that, while very nonconvex and unpleasant, don’t really have local maxima problems and stuff.
Thanks again for the response!
It really is not a good idea - I’m sorry. -grad is NOT generally a descent direction for |grad| and therefore there is no guarantee that a linesearch will even terminate. Please don’t do this.
On second thought, you are right. In non-convex cases, this can easily fail in any concave section. So I doubt one can do much better than just following the negative gradient with a small step size when the objective isn’t available, which is similar enough to what SGD, ADAM, etc. does in deep learning.
I asked a similar question here: How can I use Optim/BFGS on functions I can't evaluate but whose gradients I can compute? - #11 by marius311
In the end I settled on using a zero-finder on the gradient, specifically Broyden.
One caveat to note is that a zero-finder doesn’t assume the gradient vector is a conservative vector field, even though you secretly know that it is since its a gradient. This generally means these algorithms won’t be exactly equivalent to a hypothetical optimizer that ignores the objective. That said this might just be an academic point, at least on my problems I didn’t find a big difference in practice.
I do wish there was an easy way to have Optim use a fixed step size and to never try and evaluate the object. It seems that LineSearches.Static exists for fixed step size, but I don’t think Optim has an option to never use the objective, at least last time I checked. It would be nice to have that.