How to train dense nets several times faster than with Adam

uwestoehr · May 7, 2026, 2:40am

I had the pleasure to give a lecture about the basics of neural nets at a university and purposely used Julia also for the practical lessons.

As a side effect I wanted to explain my students how an optimizer can be derived from scratch and why current first-order optimizers use EMA. So I purposely designed a non-EMA optimizer with the unforeseen result that in all test cases from the lecture it was a factor 2 to 20 faster than Adam.

I therefore wrote a proper scientific article about it and also published its Julia source code so that you can try it easily out if you like:

The paper can be found via its DOI:

By the way, it is impressive how quick the students learned Julia (and also how fast it is compared to PyTorch). About 1/3 of my lecture was live-coding and the interactive plotlyJS is thereby a killer feature.

kellertuer · May 7, 2026, 6:46am

Sorry if that is a beginner question, but I just do Optimization and not so much ML. What is EMA?

edit: Ah nevermind, while your paper did not load on Zenodo, the PDF is in the Repo: Exponential moving average.

Still: Which non-EMA solver did you use? I mean in Optimization, something like constant stepsize is not enough and usually you want to do something like (L)BFGS.
In ML I can understand that (full) gradients and hence als BFGS is a bit too expensive. But if you use these second-order solvers, of course they are better?

uwestoehr · May 7, 2026, 3:11pm

while your paper did not load on Zenodo

Hmm, works for me, there is also a direct download link:

Which non-EMA solver did you use?

SurpriseOpt, my new optimizer and the topic of my paper is the non-EMA solver.

like (L)BFGS

(L)BFGS cannot be used for mini-batching and they are computationally expensive. However, in Appendix A.1 of my paper I describe how SurpriseOpt in combination with (L)BFGS can decrease the time to get solutions for PINN and for PINN (L)BFGS is a very good choice. The test I described is also available in my repository.

mbaz · May 7, 2026, 11:29pm

Zenodo was down most of the day for me, but it works now. Looks super interesting!

mateuszbaran · May 8, 2026, 10:36am

Nice, it’s always interesting to see what people come up with regarding optimization.

A constant stepsize can be enough in stochastic optimization: https://arxiv.org/pdf/1704.00116 .

Second order solvers are definitely better for smooth and preferably convex objectives but that’s usually not what ML has to deal with. There are stochastic variants of second order methods getting developed but it’s not a simple thing.

There is quite a bit of research on how to extend L-BFGS to work with mini-batching, this looks fairly promising for example: A Proximal Stochastic Quasi-Newton Algorithm with Dynamical Sampling and Stochastic Line Search | Journal of Scientific Computing | Springer Nature Link . Estimating (s, y) pairs is definitely a tricky thing but at least it doesn’t look prohibitively expensive to me.

Topic		Replies	Views
Neural ODE fitting really slow Machine Learning diffeq , flux , sciml	13	2109	September 11, 2020
Parameters of the neural network not updating after training in a Neural ODE problem New to Julia sciml , reversediff , differentialequation	13	466	February 16, 2025
Using LBFGS to train Flux models General Usage	10	891	February 10, 2024
Two questions on Flux Machine Learning	23	4923	October 2, 2020
How can I speed up my Neural ODE? Performance question , diffeq , flux , neural-network	4	926	July 20, 2021

How to train dense nets several times faster than with Adam

Related topics