Momentum methods in Optim - do you use them?

pkofod · October 12, 2017, 6:14pm

As the title suggests, I’m currently wondering if Accelerated Gradient Descent and Momentum Gradient Descent should be (re)moved from Optim.

“Momentum methods” certainly have their use, and machine learning seems to have adopted them to avoid getting stuck at saddle points/local minima, but I’m not so sure that the two methods are the best fit for Optim, at least not in their current form, but maybe not even in Optim in any form. We don’t really do SGD-type methods, and I doubt many are using Optim in the same situations where AGD and MGD are popular.

So, this post is just to get some feedback, and to see if twenty people reply: “Don’t remove it! I use it all the time!” or not. Should we remove them from Optim, they would go into another package, see https://github.com/JuliaNLSolvers/LegacyOptim.jl for how that might look. Of course, they would stay for a major version release cycle with proper deprecation warnings for their constructors, etc.

Best,
Patrick

Tamas_Papp · October 12, 2017, 6:37pm

To be honest, this is the first time I hear about them, and I use Optim.jl regularly.

jonathanBieler · October 12, 2017, 6:54pm

Fine for me. I think there should be a package dedicated to gradient descent, with all the bells and whistles and just a simple version in Optim.

pkofod · October 13, 2017, 9:09am

Yeah… they’re sort of not advertised at the moment, but they’ve been there for as long as I’ve been involved in the project.

samuelpowell · October 13, 2017, 10:05am

These methods used to be advertised in the documentation and I do make occasional use of them. In particular, it is quite useful to be able to swap from a standard gradient descent to momentum/accelerated versions for certain problems I deal with which are mildly (!?) stochastic, simply to assess the performance of the alternative methods.

Is the problem that having so many methods creates additional work in developing the Optim API, or is this more philosophical?

pkofod · October 13, 2017, 10:21am

It’s more of a maintenance question, but it’s interesting to hear you’ve have success using them. Do you find that they work well with the default values, or do you change to something like Static() line search for example?

I should note that the change I have in mind would simply require you to do something like

using Optim
optimize(yourthing, x0, GradientDescent())
# oh I want to try AGD
using MomentumOptim
optimize(yourthing, x0, AcceleratedGradientDescent())

The thing with keeping them in Optim is that if no one is really motivated to write documentation, make sure the code is correct, etc, there’s a risk that Optim carries these half implementations around. I’ve personally never used them, and only found two bugs in AGD for example completely by accident. That’s why I was wondering if anybody actually did use them.

samuelpowell · October 13, 2017, 10:32am

Yes, I tend to have to use a static line search, as the noise which motivates the use of these methods breaks Hager-Zhang (backtracking is sometimes okay).

Your proposal sounds fine - it’s far more important that we have high quality conventional GD/CG/QN/GN methods than a smorgasboard of subtly different GD approaches. As you mentioned, this is more the domain of stochastic optimisation.

pkofod · October 13, 2017, 11:11am

I figured. I was actually thinking that if we did decide to leave it in, we should maybe explicitly require a line search to be specified for these methods (though I think it would be better to move them to MomentumOptim and explicitly and up front warn about the choice of LS).

samuelpowell · October 13, 2017, 11:24am

Would it make sense for these alternative methods to go into a package called StochasticOptim? Perhaps that would couple the methods too tightly to only one interpretation of their use.

jonathanBieler · October 13, 2017, 12:32pm

BlackBoxOptim.jl already plays that role to some extend.

I think that gradient descent is a big enough topic to have its own package (maybe based on SGDOptim.jl), it would be nice to have a parametric type in which you can mix and match methods, e.g.

g = GradientDescent(
    MiniBatch(100),
    AdaGrad(),
    Hypergradient()
)

optimize(f,g,pinit,...)

Edit: it seems OnlineStats also has some of these: http://joshday.github.io/OnlineStats.jl/latest/pages/api.html#OnlineStats.StatLearn

pkofod · October 13, 2017, 3:54pm

Flux as well I think

pkofod · October 13, 2017, 3:56pm

Though we do have to be a bit careful: some of these packages play significantly different roles/serve different purposes.

Topic		Replies	Views
Optim.jl v0.9.0 is out! Community release , optim	16	2407	June 28, 2017
Optim: What optimiser is best if your gradient computation is slow? Optimization (Mathematical) optim	13	3141	August 22, 2017
Using LBFGS to train Flux models General Usage	10	705	February 10, 2024
Optim.jl : first-order (and second-order) optimization without providing an objective function Optimization (Mathematical) optim	13	1073	May 9, 2022
[ANN] Optim.jl updates Package Announcements	19	2927	June 12, 2025

Momentum methods in Optim - do you use them?

Related topics