How to train dense nets several times faster than with Adam

I had the pleasure to give a lecture about the basics of neural nets at a university and purposely used Julia also for the practical lessons.

As a side effect I wanted to explain my students how an optimizer can be derived from scratch and why current first-order optimizers use EMA. So I purposely designed a non-EMA optimizer with the unforeseen result that in all test cases from the lecture it was a factor 2 to 20 faster than Adam.

I therefore wrote a proper scientific article about it and also published its Julia source code so that you can try it easily out if you like:

The paper can be found via its DOI:

By the way, it is impressive how quick the students learned Julia (and also how fast it is compared to PyTorch). About 1/3 of my lecture was live-coding and the interactive plotlyJS is thereby a killer feature.

Sorry if that is a beginner question, but I just do Optimization and not so much ML. What is EMA?

edit: Ah nevermind, while your paper did not load on Zenodo, the PDF is in the Repo: Exponential moving average.

Still: Which non-EMA solver did you use? I mean in Optimization, something like constant stepsize is not enough and usually you want to do something like (L)BFGS.
In ML I can understand that (full) gradients and hence als BFGS is a bit too expensive. But if you use these second-order solvers, of course they are better?