Momentum methods in Optim - do you use them?


As the title suggests, I’m currently wondering if Accelerated Gradient Descent and Momentum Gradient Descent should be (re)moved from Optim.

“Momentum methods” certainly have their use, and machine learning seems to have adopted them to avoid getting stuck at saddle points/local minima, but I’m not so sure that the two methods are the best fit for Optim, at least not in their current form, but maybe not even in Optim in any form. We don’t really do SGD-type methods, and I doubt many are using Optim in the same situations where AGD and MGD are popular.

So, this post is just to get some feedback, and to see if twenty people reply: “Don’t remove it! I use it all the time!” or not. Should we remove them from Optim, they would go into another package, see for how that might look. Of course, they would stay for a major version release cycle with proper deprecation warnings for their constructors, etc.



To be honest, this is the first time I hear about them, and I use Optim.jl regularly.


Fine for me. I think there should be a package dedicated to gradient descent, with all the bells and whistles and just a simple version in Optim.


Yeah… they’re sort of not advertised at the moment, but they’ve been there for as long as I’ve been involved in the project.


These methods used to be advertised in the documentation and I do make occasional use of them. In particular, it is quite useful to be able to swap from a standard gradient descent to momentum/accelerated versions for certain problems I deal with which are mildly (!?) stochastic, simply to assess the performance of the alternative methods.

Is the problem that having so many methods creates additional work in developing the Optim API, or is this more philosophical?


It’s more of a maintenance question, but it’s interesting to hear you’ve have success using them. Do you find that they work well with the default values, or do you change to something like Static() line search for example?

I should note that the change I have in mind would simply require you to do something like

using Optim
optimize(yourthing, x0, GradientDescent())
# oh I want to try AGD
using MomentumOptim
optimize(yourthing, x0, AcceleratedGradientDescent())

The thing with keeping them in Optim is that if no one is really motivated to write documentation, make sure the code is correct, etc, there’s a risk that Optim carries these half implementations around. I’ve personally never used them, and only found two bugs in AGD for example completely by accident. That’s why I was wondering if anybody actually did use them.


Yes, I tend to have to use a static line search, as the noise which motivates the use of these methods breaks Hager-Zhang (backtracking is sometimes okay).

Your proposal sounds fine - it’s far more important that we have high quality conventional GD/CG/QN/GN methods than a smorgasboard of subtly different GD approaches. As you mentioned, this is more the domain of stochastic optimisation.


I figured. I was actually thinking that if we did decide to leave it in, we should maybe explicitly require a line search to be specified for these methods (though I think it would be better to move them to MomentumOptim and explicitly and up front warn about the choice of LS).


Would it make sense for these alternative methods to go into a package called StochasticOptim? Perhaps that would couple the methods too tightly to only one interpretation of their use.


BlackBoxOptim.jl already plays that role to some extend.

I think that gradient descent is a big enough topic to have its own package (maybe based on SGDOptim.jl), it would be nice to have a parametric type in which you can mix and match methods, e.g.

g = GradientDescent(


Edit: it seems OnlineStats also has some of these:


Flux as well I think


Though we do have to be a bit careful: some of these packages play significantly different roles/serve different purposes.