How to train dense nets several times faster than with Adam

I had the pleasure to give a lecture about the basics of neural nets at a university and purposely used Julia also for the practical lessons.

As a side effect I wanted to explain my students how an optimizer can be derived from scratch and why current first-order optimizers use EMA. So I purposely designed a non-EMA optimizer with the unforeseen result that in all test cases from the lecture it was a factor 2 to 20 faster than Adam.

I therefore wrote a proper scientific article about it and also published its Julia source code so that you can try it easily out if you like:

The paper can be found via its DOI:

By the way, it is impressive how quick the students learned Julia (and also how fast it is compared to PyTorch). About 1/3 of my lecture was live-coding and the interactive plotlyJS is thereby a killer feature.

Sorry if that is a beginner question, but I just do Optimization and not so much ML. What is EMA?

edit: Ah nevermind, while your paper did not load on Zenodo, the PDF is in the Repo: Exponential moving average.

Still: Which non-EMA solver did you use? I mean in Optimization, something like constant stepsize is not enough and usually you want to do something like (L)BFGS.
In ML I can understand that (full) gradients and hence als BFGS is a bit too expensive. But if you use these second-order solvers, of course they are better?

while your paper did not load on Zenodo

Hmm, works for me, there is also a direct download link:

Which non-EMA solver did you use?

SurpriseOpt, my new optimizer and the topic of my paper is the non-EMA solver.

like (L)BFGS

(L)BFGS cannot be used for mini-batching and they are computationally expensive. However, in Appendix A.1 of my paper I describe how SurpriseOpt in combination with (L)BFGS can decrease the time to get solutions for PINN and for PINN (L)BFGS is a very good choice. The test I described is also available in my repository.

Zenodo was down most of the day for me, but it works now. Looks super interesting!

Nice, it’s always interesting to see what people come up with regarding optimization.

A constant stepsize can be enough in stochastic optimization: https://arxiv.org/pdf/1704.00116 .

Second order solvers are definitely better for smooth and preferably convex objectives but that’s usually not what ML has to deal with. There are stochastic variants of second order methods getting developed but it’s not a simple thing.

There is quite a bit of research on how to extend L-BFGS to work with mini-batching, this looks fairly promising for example: A Proximal Stochastic Quasi-Newton Algorithm with Dynamical Sampling and Stochastic Line Search | Journal of Scientific Computing | Springer Nature Link . Estimating (s, y) pairs is definitely a tricky thing but at least it doesn’t look prohibitively expensive to me.