Preconditioner for L-BFGS in Optim.jl

cpross90 · May 9, 2020, 8:09am

I’m stumped, and I’m not sure if my problem is a lack of understanding of the L-BFGS implementation or the Optim.jl interface.

I’m trying to figure out how to create a preconditioner such that it is equivalent to the approximate Hessian that L-BFGS built up until it converged on a previous input value.

The goal is to use the previous approximate Hessian to increase convergence while bifurcating. I’m following a paper that rolled their own L-BFGS in C++, and their code isn’t available. The paper says that by using the previous approximate Hessian they decreased the number of iterations by factors of 10-20 in many cases, and thus I’m really interested in this optimization. I’m sure if this is relevant, but they also store 2x the number of columns in the Hessian of Hessian updates; which is what I set m to in the LBFGS options.

I was considering using precondprep to update the preconditioner, but since precondprep only passes the current value of x this would require me re-computing the Gradient; which involves solving a PDE and a handful of FFTs. So, it doesn’t make sense to increase the number of computations of the most expensive routine if its already been computed.

I’m hoping that I’m just missing something in the docs or source, but I’ve poured over both and the forums already, so hopefully someone knows how to do this.

Let me know if I need to clarify anything. Thanks!

fastwave · May 9, 2020, 10:33am

Could you attach a link to the paper you mentioned? I would be happy to figure out how to do this.

fastwave · May 9, 2020, 10:36am

Also in my experience the full BFGS is much better in Julia and not much slower. The default linesearch is also important. For me backtracking works best.

cpross90 · May 9, 2020, 12:05pm

Sure, here’s the two papers where they reveal the most information about the specifics of their convergence results for BFGS:

https://www.pnas.org/content/pnas/107/8/3361.full.pdf

“In our continuation algorithm, we
initialize the approximate Hessian with that of the previous minimization
step (rather than the identity matrix), which leads to a tremendous reduction
in the number of iterations required to converge (by factors of 10–20 in many
cases). We use the limited memory feature of the code for the opposite reason
it was originally intended: We store twice as many Hessian updates as
there are columns in the matrix before cyclically overwriting them, which
gives the algorithm more time to achieve superlinear convergence in the
final iterations. The cost of the linear algebra in the BFGS algorithm is dwarfed
by the PDE solves required to compute G and ∇G, so there is no benefit to
using fewer Hessian updates. On the other hand, using more than twice as
many columns does not seem to improve convergence rates”

pg 4 second to last paragraph, and it’s really just a blurb. So, in using the L-BFGS their goal is to store more of a history than a typical BFGS implementation is my understanding. The first paper says the following about their implementation:

“In our implementation, we wrote a C++ wrapper around J. Nocedal’s L-BFGS Fortran
code released in 1990, but we turn off the limited memory aspect of the code since computing
G takes more time than the linear algebra associated with updating the full Hessian matrix.”

pg 15, 4.2, first sentence.

Here’s a quick outline of the algorithm,

F

Evolve initial condition according to PDE
Use solution to compute the functional

\nabla

Use that solution as an IC to the original PDE’s adjoint equation, and evolve it
Use this solution to compute the variational derivative

Solving the PDE is the rough part, and starting over from the identity matrix starts to visibly slow convergence down once I move away from “easy” solutions.

Since the minimization aspect is supposed to be black-boxed, a MWE would be as simple as retrieving the history/approximate Hessian built from finding a root for a polynomial.

Thanks for the reply, and I appreciate any help that you can contribute

cpross90 · May 10, 2020, 2:18am

After looking at libLBFGS and LBFGS++, I can’t see an interface that was provided to access the approximated Hessian, and that is most likely why the authors just made a C++ wrapper around Nocedal’s Fortran code. I will try reaching out to on the GitHub page of Optim.jl to see if they have any recommendations, or if this is too niche and is something that I should try to hack together on my own.

I’m all ears to suggestions though.

dpo · May 10, 2020, 5:23am

We provide a pure Julia LBFGS operator here: Tutorial · LinearOperators.jl. Let me know if you have any questions.

lostella · May 10, 2020, 7:21am

I’ve implemented L-BFGS as an update!-able linear operator here in case you’re interested: ProximalAlgorithms.jl/lbfgs.jl at master · JuliaFirstOrder/ProximalAlgorithms.jl · GitHub

fastwave · May 10, 2020, 12:31pm

I would just use the normal BFGS method, if the linear algebra is not the bottleneck. The standard BFGS does keep the Hessian from previous steps, so this should be equivalent to what you are looking for. As far as I can tell you cannot have better performance from a modified LBFGS than a normal BFGS.

cpross90 · May 10, 2020, 6:27pm

@dpo

From what I can tell, it looks like your LBFGS forward operator only provides initializing the approximate Hessian to a scalar multiple of the Identity matrix. Is that right?

@lostella

This definitely would allow me to retain the approximate Hessian between bifurcations, but it looks like it isn’t possible to adjust the line search algorithm directly.

@fastwave

So, the “trick” to what the authors of the paper I’m using are doing is actually increasing the history by keeping track of more updates than there are columns in the matrix in a normal implementation of BFGS. At least that’s what I’ve been able to reason. I’m not sure how this helps, but those guys are rock stars in the applied math water waves community. So, I believe them when they say it has a significant impact.

I reached out on Gitter and @antoine-levitt responded with a potential solution that I will hopefully have time to test today. Here it is for anyone else interested:

optimize(d::D, initial_x::Tx, method::M,
                  options::Options = Options(;default_options(method)...),
                  state = initial_state(method, options, d, initial_x))

This should allow me to build a matrix to precondition future optimization calls by having access to the history of dg and dx.

dpo · May 10, 2020, 6:30pm

Currently yes; it’s the typical initialization. However, it would be easy to let the user provide an arbitrary initial approximation.

I don’t recommend using full BFGS unless your problem is very small, as that will require that your store an explicit dense matrix.

cpross90 · May 10, 2020, 6:46pm

Well, I think that by using LBFGS how I am, I’m storing up to 4x what BFGS would regularly store by storing twice as much history as columns that we would expect

I think I’m going to go with the Optim.jl approach, but I need to figure out how to use the dx and dg history to build a preconditioner. I was going to try implementing something similar to the routine they have in their twoloop!.

rveltz · May 10, 2020, 9:12pm

I am not sure I get their point about not using “conventional” methods. I did not even find the word preconditioner in the two papers referenced above In any case, if you want to try an alternative method, the bifurcation package PseudoArcLengthContinuation.jl should provide the mean to replicate (some) of these results.

cpross90 · May 10, 2020, 9:23pm

I’m not sure what you mean by “conventional”, but I think you’re talking about how/why they wrote a wrapper around LBFGS? I think it’s because of the exact problem I’m having, it’s not typical for users to care about the approximate Hessian for LBFGS once it converges. The vortex sheet paper was the first time they mention using different parameters for m, and if it truly is a difference of a factor 10-20 in the number of iterations required to converge what’s motivating me to try and implement this on my end.

Yeah, they don’t refer to it as a preconditioner. I was using the verbage of the Opim.jl package, and unless I’m mistaken the optional keywork for LBFGS P is what you’re choosing to initialize the approximate Hessian to. Please correct me if I’m wrong, because that would change my approach here.

YES, I’m very interested (and excited) about the PseudoArcLengthContinuation.jl package! Haha, but I haven’t had time to learn how to adapt my problem to its interface between classes and research. It’s definitely on my list of things to try out. Only one more month before graduation and my time will be greatly freed up to focus on research and trying to find a job amid this craziness.

rveltz · May 10, 2020, 9:35pm

Why dont they try to find the periodic orbit either by

Fourier collocation in time
finite differences in time
shooting methods

??

This is very briefly mentioned in the second paper you reference.

Instead, they opt for extremizing a functional. That’s the point I am not getting.

With my package, I observe (as many before) that you need a good preconditioner if you don’t use Shooting methods. For shooting methods for dissipative systems, a precondioner is not “needed that much”.

cpross90 · May 11, 2020, 6:54am

Ah, well to be quite honest. I’m not sure either

I had a quick chat with my advisor to see if after chatting together we could have a better answer for you, but here’s what my RA had to say:

Finite differences aren’t accurate enough.

Fourier collocation works well for traveling wave solutions but it’s not clear to me how to generalize that to nontrivial time periodic solutions.

You could do shooting methods for sure. But they are hard to use especially if you are looking for a special thing.

Ambrose and Wilkening used GNU MPFR to get up to 26 places of precision for some of their routines. So far we’ve done fine with double precision, but we’re interested in eventually using DoubleFloats.jl.

So, neither of us has a good answer for you unfortunately. I’m sure if you e-mailed Jon Wilkening at Berkley he could summarize why. My RA and me have tried not to reach out to him just yet. We wanted to attempt to make as much progress on the material before resorting to conferring with him about specifics. I have everything but the continuation down at this point, so I hope to be able to communicate with him at some point soon.

rveltz · May 11, 2020, 7:12am

If you have the time stepper and spectral collocation in space as required for the second reference, you have what is required to apply (multiple) shooting. You can also use (nonlinear) deflation to find non-trivial solutions.

cpross90 · May 11, 2020, 5:52pm

I downloaded the Deflation techniques for finding distinct solutions of nonlinear partial differential equations paper that you reference in the PseudoArcLengthContinuation.jl deflation tutorial, and I’m also looking at the Brusselator and Complex Ginzburg-Landau pages as well. Hopefully I can dive in next month and starting trying out some of these methods. Do you have any literature or textbooks recommendations?

Thanks again for sharing some knowledge

rveltz · May 11, 2020, 6:45pm

The papers by Farrell on deflated continuation are very good. I have not pushed my deflated continuation yet although it is sitting on my computer, it should be done this month.

I will also release a new version of the package soon with automatic branch switching, improved periodic orbit computations and a new interface for specifying problems… This should land at the same time you look at this hopefully.

For general continuation methods, it is huge field

Govaerts, Numerical Methods for Bifurcations of Dynamical Equilibria .
Mei, Numerical Bifurcation Analysis for Reaction-Diffusion Equations .
Kuznetsov, Elements of Applied Bifurcation Theory .
Waugh, Illingworth, and Juniper, “Matrix-Free Continuation of Limit Cycles for Bifurcation Analysis of Large Thermoacoustic Systems.”
Dijkstra et al., “Numerical Bifurcation Methods and Their Application to Fluid Dynamics.”
Keller, “The Bordering Algorithm and Path Following Near Singular Points of Higher Nullity.”
…

Topic		Replies	Views
How to pass a preconditioner to NLopt? Optimization (Mathematical) question	0	425	October 15, 2018
Design preconditioner for unconstrained optimization problem Optimization (Mathematical) question , jump , optim	0	363	April 5, 2021
Extract inverse Hessian representation used by L-BFGS Optimization (Mathematical) question	1	855	October 6, 2018
Optim L-BFGS() - Can I save internal state? Optimization (Mathematical) question , optim	1	1749	January 25, 2018
Preconditioner for nonlinear optimization Optimization (Mathematical)	0	225	July 23, 2020

Preconditioner for L-BFGS in Optim.jl

Related topics