Making LBFGS in `Optim.jl` as fast as in `NLopt.jl`

I have a (somewhat expensive to calculate) loss function f(x) for which I can compute exact gradients using Zygote.jl and wish to minimise it.

So far I have been using the LBFGS implementation in NLopt.jl to do that, but there appear to be some problems with the loss functions causing NLopt to abort optimisation in some cases and return the return code :FORCED_STOP (this happens in roughly one third of cases). Because NLopt.jl calls some C code inside, it is very hard to debug to what exactly is going wrong, so I wanted to switch to a pure julia implementation of the LBFGS method in Optim.jl. However, I found that this was about 3 times slower than the NLopt implementation:

opt = NLopt.Opt(:LD_LBFGS, nparams)
# loss_fun.for_nlopt(x, grad) writes the gradient at x to grad and returns f(x)
objective = loss_fun.for_nlopt 
opt.min_objective = objective
opt.xtol_rel = 1e-8
opt.ftol_rel = 1e-8
opt.maxeval = 5000

@time ret_nlopt = NLopt.optimize(opt, x0)
@show opt.numevals
@show ret_nlopt[1];
@show ret_nlopt[3]

#= output
  1.534183 seconds (2.96 M allocations: 975.899 MiB, 4.01% gc time)
opt.numevals = 493
ret_nlopt[1] = -16.68266898954586
ret_nlopt[3] = :FTOL_REACHED

whereas in Optim.jl:

options = Optim.Options(x_reltol=1e-8, f_reltol=1e-8, iterations=5000)
method = Optim.LBFGS(linesearch=LineSearches.BackTracking())
# loss_fun.g!(grad, x) writes the gradient at x into grad
# loss_fun.fg!(grad, x) writes the gradient at x into grad and returns f(x)
once_diff = Optim.OnceDifferentiable(loss_fun, loss_fun.g!, loss_fun.fg!, x0)

@time ret_optim = Optim.optimize(od, x0, method, options)
@show ret_optim

#= output
  3.373001 seconds (7.98 M allocations: 2.412 GiB, 3.59% gc time)
ret_optim =  * Status: success

 * Candidate solution
    Final objective value:     -1.677771e+01

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 3.12e-04 ≰ 0.0e+00
    |x - x'|/|x'|          = 7.54e-05 ≰ 1.0e-08
    |f(x) - f(x')|         = 1.55e-07 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 9.21e-09 ≤ 1.0e-08
    |g(x)|                 = 3.24e-04 ≰ 1.0e-08

 * Work counters
    Seconds run:   3  (vs limit Inf)
    Iterations:    967
    f(x) calls:    998
    ∇f(x) calls:   968

Is there a way to use exactly the same settings as in NLopt for Optim.jl for LBFGS?

I have already looked online for a solution and found

but none of the tips mentioned there helped.


By curiosity, have you also compared with the L-BFGS implemented in GitHub - JuliaSmoothOptimizers/JSOSolvers.jl ?

1 Like

I just tried that, but couldn’t figure out how to pass my loss function to the lbfgs() function. Is there some easy function that allows one to create a <: AbstractNLPModel from a function and its gradient?

1 Like

There is an example here: How to create a model from the function and its derivatives


Okay, got JSOSolvers.jl also working, but that appears to be considerably worse:

nlp = ManualNLPModels.NLPModel(x0, loss_fun, grad=loss_fun.g!)

@time ret_jso = JSOSolvers.lbfgs(nlp, x=x0, rtol=1e-8, max_eval=5000, max_time=60.)
@show ret_jso.iter
@show ret_jso.objective
#= output
 16.703706 seconds (40.87 M allocations: 12.349 GiB, 3.71% gc time)
ret_jso.iter = 4699
ret_jso.objective = -16.91551990240162

Note that here I am definitely paying a 50% extra runtime because I am not passing loss_fun.fg! in which allows to compute the function value in one forward and backward pass. But not enough to explain a whole factor of 10 difference.

Looks like an interesting problem. If you can share more it would help us a lot to figure out why it is so much slower.
Also, I think I can make a quick change to have fg!.

Unfortunately it is a quite big code base, but later today I can try extract a MWE from it that has the same behaviour.

1 Like

It looks like it made 4699 iterations, while Optim variant 967 iterations that probably explains it.
In JSOSolvers.jl, the memory parameter is mem and equals 5 by default. No idea, what is its value for the others.

Okay, I created an MWE for the problem. It is the simulation of a variational quantum circuit and then minimisation of an expectation value:

using Yao
using Optim 
using NLopt
using ChainRulesCore
using Zygote
using LineSearches
using JSOSolvers
using ManualNLPModels

# Define a whole lot of code to set up the loss functions

struct LossFunction{D,
        RT<:AbstractRegister} <: Function

@inline function _call(f::LossFunction, x)
    dispatch!(f.circuit, x)
    fill!(f.register.state, zero(eltype(f.register.state)))
    f.register.state[1] = one(eltype(f.register.state))
    apply!(f.register, f.circuit)
    return real(expect(f.hamiltonian, f.register))

(f::LossFunction)(x) = _call(f, x)

function ChainRulesCore.rrule(f::LossFunction, x)
    dispatch!(f.circuit, x)
    fill!(f.register.state, zero(eltype(f.register.state)))
    f.register.state[1] = one(eltype(f.register.state))
    y = real(expect(f.hamiltonian, f.register => f.circuit))

    function lossfunction_pullback(Δy)
        stateδ, paramδ = expect'(f.hamiltonian, f.register => f.circuit)
        return (NoTangent(), Δy .* paramδ)

    return y, lossfunction_pullback

# to be consistent with the `expect'(ham, register => circuit)` notation 
# of Yao.jl
Base.adjoint(f::LossFunction) = x -> gradient(f, x)


A version of the cost function that is suitable for `NLopt.jl` and writes the 
gradient ∇f(x) to the `grad` argument. Note that you need `size(x) == size(grad)`.
function _nlopt_costfunction(f::LossFunction, x::AbstractVector, grad::AbstractVector)
    y, y_pullback = pullback(f, x)
    grad .= only(y_pullback([one(y)]))
    return y

nlopt_costfunction(f::LossFunction, x, grad) = _nlopt_costfunction(f, x, grad)

_getproperty(f::LossFunction, ::Val{s}) where {s} = getfield(f, s)
_getproperty(f::LossFunction, ::Val{:for_nlopt}) = (x, grad) -> nlopt_costfunction(f, x, grad)
_getproperty(f::LossFunction, ::Val{:wavefunction}) = x -> wavefunction(f, x)
_getproperty(f::LossFunction, ::Val{:g!}) = (grad, x) -> (grad .= f'(x)[1])
_getproperty(f::LossFunction, ::Val{:fg!}) = (grad, x) -> nlopt_costfunction(f, x, grad)
Base.getproperty(f::LossFunction, s::Symbol) = _getproperty(f, Val{s}())
function Base.propertynames(::FT, private::Bool = false) where {FT<:LossFunction}
    return (fieldnames(FT)..., :for_nlopt, :fg!, :g!)

# now actually create the loss functions
nq = 10 
p = 5

ham = EasyBuild.heisenberg(nq)
circuit = EasyBuild.variational_circuit(nq, p)
loss_fun = LossFunction(circuit, ham, zero_state(nq))
nparams = nparameters(circuit)
x0 = rand(nparams);

# set up the NLopt problem
opt = NLopt.Opt(:LD_LBFGS, nparams)
objective = loss_fun.for_nlopt
opt.min_objective = objective
opt.xtol_rel = 1e-8
opt.ftol_rel = 1e-8
opt.maxeval = 5000

# run the NLopt problem
@time ret_nlopt = NLopt.optimize(opt, x0)
@show opt.numevals
@show ret_nlopt[1]
@show ret_nlopt[3]

# set up the Optim problem
options = Optim.Options(x_reltol=1e-8, f_reltol=1e-8, iterations=5000)
method = Optim.LBFGS(linesearch=LineSearches.BackTracking())
od = Optim.OnceDifferentiable(loss_fun, loss_fun.g!, loss_fun.fg!, x0)

# run the Optim problem
@time ret_optim = Optim.optimize(od, x0, method, options)
@show ret_optim

# set up the JSO problem
nlp = ManualNLPModels.NLPModel(x0, cost_fun, grad=cost_fun.g!);

# and run the JSO problem
@time ret_jso = JSOSolvers.lbfgs(nlp, x=x0, rtol=1e-8, max_eval=5000, max_time=60.)
@show ret_jso.iter
@show ret_jso.objective

The exact times it takes to run both optimisations at the end depends on x0, but as a rule of thumb
Optim.jl takes around three times as long as NLopt.jl and JSOSolvers.jl takes even longer.

We just released ManualNLPModels 0.1.3 with objgrad support. Should improve a little bit, but I think tweaking on the LBFSG parameters might help as well (Solvers · JSOSolvers.jl).

I noticed that the other solvers terminated with f stalling and we don’t check that, so it might be necessary to add a callback if you really want that stopping condition.
The final gradient is around 1e-4, so rtol and atol can also be changed.

After the MWE I can leave more comments

1 Like

Thanks for the MWE. Since you use random initial guess, it might be useful to set a seed

using Random

I noticed that the solutions obtained are returning quite different norms of the gradient.

Besides, trying to guess a mem parameter comparable to the others got:

@time ret_jso = JSOSolvers.lbfgs(nlp, mem = 7, x=x0, atol = norm(grad(nlp, ret_nlopt[2])), rtol=eps(), max_eval=5000, max_time=60.)

so that the number of iterations is similar to the others, and checking the norm of gradients:

using LinearAlgebra, NLPModels
norm(grad(nlp, ret_nlopt[2])), norm(grad(nlp, ret_optim.minimizer)), norm(grad(nlp,ret_jso.solution))


(0.0002431544585082576, 0.0011227761758756743, 0.00022974580470524832)

Edit: commentating on @abelsiqueira update

# loss_fun.fg2! should return (fx, gx)
nlp = ManualNLPModels.NLPModel(x0, loss_fun, grad=loss_fun.g!, objgrad=loss_fun.fg2!)

seems to help as well.

Here is a version using callbacks. The times varies a lot depending on the starting points. Sometimes we are faster, sometimes NLOpt is faster.
Let me know if something similar works better in your original problem.

It might also be good for comparison to run the solver once before @time to precompile, using smaller limits.
In that case, add a reset!(nlp) before the second call.

Current changes:

# set up the JSO problem
objgrad(gx, x) = begin
  f = loss_fun.fg!(gx, x)
  return f, gx
nlp = ManualNLPModels.NLPModel(x0, loss_fun, grad=loss_fun.g!, objgrad=objgrad);

# and run the JSO problem
global f_old = 100.0
cb(nlp, solver, stats) = begin
  f = stats.objective
  if abs(f_old - f) < 1e-8 * abs(f)
    stats.status = :user
  global f_old = f
@time ret_jso = JSOSolvers.lbfgs(nlp, x=x0, mem=7, rtol=1e-8, max_eval=5000, max_time=60., callback=cb)
1 Like

Thanks for adding the objgrad option! However I am still failing to get JSOSolvers to be as fast as Optim.jl or NLopt.jl.

After some more testing it seems the fastest option is to actually use the BFGS solver from Optim.jl because my real problem has at most 100 variables, but takes a couple seconds to compute. So the dense matrix inversion in BFGS doesn’t contribute much to the overall cost.

Of course, that still doesn’t answer my initial question. If anyone more well-versed with Optim.jl knows how to make it behave exactly the same as NLopt.jl that would still be appreciated.

It may very well be that the different versions are using a different memory setting for LBFGS. That can have an effect on how many iterations are needed. I suggest running the example fixing the memory parameter to be the same for the different solvers.