Using LBFGS to train Flux models

Hey everyone,

I have found out that using LBFGS is quite used in the Physics-Informed Neural Networks and I am wondering it it is interesting to use it for usual MLPs. Is there a way to use it in Flux?

Thank you for your answers!

FluxOptTools.jl lets you use Optim with Flux, and Optim has LBFGS.

1 Like

Interesting. Do you know if FluxOptTools will work with training on a gpu?

I don’t know, sorry, though it should be quick to test.

Thanks, it works really well. It’s quite impressive to see the efficiency of this optimizer. I wonder why it isn’t teached in ML lectures it’s so great.

Now, one last question : now we use the Optim optimize function that does everything in a black box manner. What is its stopping condition? Is there a way to change it?

I did try, but ran into trouble, which I didn’t make a serious effort to solve. I will look into it a bit more.

See the options available in Optim, e.g. iterations and callback.

Probably for two reasons. First, in traditional ML you are often doing stochastic optimization, where on each iteration you are sampling a batch of the training data (as opposed to full-batch training), both for efficiency and to avoid over-fitting (see also this nice lecture on stochastic gradient descent), and usually you are converging to relatively low accuracy (again to avoid overfitting — you don’t want a neural net trained to 10 decimal places). There are stochastic L-BFGS variants, but my (vague, possibly wrong) impression is that their advantages over Adam etc. are not as clear as in a non-stochastic high-accuracy setting.

Second, explaining how BFGS works is quite difficult. Even deriving a BFGS step in a basic form requires a lot of linear algebra and convex-optimization theory, then you have L-BFGS via low-rank approximation, and convergence proofs of BFGS are even harder. Most ML classes don’t assume the necessary background for students to cover any of this in detail.

(But of course, they could mention L-BFGS as a black-box algorithm, handwave a bit about what it does, and show it off in an example.)


Speculating and spitballing quite a bit, (L)BFGS builds up some cross second derivative information, so it seems plausible that it can move down a local minimum faster than ADAM or similar methods. The limited memory version seems very appropriate for neural nets, which often have very many parameters, compared to many other problems.

It seems conceivable to me that LBFGS could potentially get trapped in a local minimum more easily than more traditional methods for neural nets.

The caveat here is that the second-derivative information in BFGS becomes more accurate only as you approach a local minimum where the function is approximately quadratic, so sometimes the acceleration only kicks in when you are trying to refine the optimum to higher accuracies (which you often don’t do in typical ML, but do for PINNs). And the benefit might be quickly degraded by noise in stochastic optimization problems.

(The “momentum” terms in Adam etcetera also incorporate some second-derivative information, being closely related to conjugate-gradient methods.)

The way I see it is that rapidly moving into a local minimum is not really much of a benefit, if you end up trapped there, and that stochastic gradients are an important tool to avoid this happening. So, the performance of LBFGS relative to ADAM, etc, for nets to find a good local min, not just any local min, is probably an empirical question, and it may depend in the starting point.

Regarding second derivatives, the momentum feature of ADAM effectively forms an approximation to own second derivatives, I believe. LBFGS will also incorporate cross partial information, which seems to me to be a notable difference.