Making LBFGS in `Optim.jl` as fast as in `NLopt.jl`

jlbosse · December 5, 2022, 12:21pm

I have a (somewhat expensive to calculate) loss function f(x) for which I can compute exact gradients using Zygote.jl and wish to minimise it.

So far I have been using the LBFGS implementation in NLopt.jl to do that, but there appear to be some problems with the loss functions causing NLopt to abort optimisation in some cases and return the return code :FORCED_STOP (this happens in roughly one third of cases). Because NLopt.jl calls some C code inside, it is very hard to debug to what exactly is going wrong, so I wanted to switch to a pure julia implementation of the LBFGS method in Optim.jl. However, I found that this was about 3 times slower than the NLopt implementation:

opt = NLopt.Opt(:LD_LBFGS, nparams)
# loss_fun.for_nlopt(x, grad) writes the gradient at x to grad and returns f(x)
objective = loss_fun.for_nlopt 
opt.min_objective = objective
opt.xtol_rel = 1e-8
opt.ftol_rel = 1e-8
opt.maxeval = 5000

@time ret_nlopt = NLopt.optimize(opt, x0)
@show opt.numevals
@show ret_nlopt[1];
@show ret_nlopt[3]

#= output
  1.534183 seconds (2.96 M allocations: 975.899 MiB, 4.01% gc time)
opt.numevals = 493
ret_nlopt[1] = -16.68266898954586
ret_nlopt[3] = :FTOL_REACHED
=#

whereas in Optim.jl:

options = Optim.Options(x_reltol=1e-8, f_reltol=1e-8, iterations=5000)
method = Optim.LBFGS(linesearch=LineSearches.BackTracking())
# loss_fun.g!(grad, x) writes the gradient at x into grad
# loss_fun.fg!(grad, x) writes the gradient at x into grad and returns f(x)
once_diff = Optim.OnceDifferentiable(loss_fun, loss_fun.g!, loss_fun.fg!, x0)

@time ret_optim = Optim.optimize(od, x0, method, options)
@show ret_optim

#= output
  3.373001 seconds (7.98 M allocations: 2.412 GiB, 3.59% gc time)
ret_optim =  * Status: success

 * Candidate solution
    Final objective value:     -1.677771e+01

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 3.12e-04 ≰ 0.0e+00
    |x - x'|/|x'|          = 7.54e-05 ≰ 1.0e-08
    |f(x) - f(x')|         = 1.55e-07 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 9.21e-09 ≤ 1.0e-08
    |g(x)|                 = 3.24e-04 ≰ 1.0e-08

 * Work counters
    Seconds run:   3  (vs limit Inf)
    Iterations:    967
    f(x) calls:    998
    ∇f(x) calls:   968
=#

Is there a way to use exactly the same settings as in NLopt for Optim.jl for LBFGS?

I have already looked online for a solution and found

github.com/JuliaNLSolvers/Optim.jl

How to configure Optim.jl's LBFGS to have similar parameters to scipy.optimize's L-BFGS-B?

opened 03:42PM - 23 May 21 UTC

rht

I'm `@**Rein Zustand**` on julialang.zulipchat.com who asked a question there on… this topic. See @pkofod 's answer at https://julialang.zulipchat.com/#narrow/stream/274208-helpdesk-.28published.29/topic/why.20is.20Optim.2Ejl.20so.20slow.3F/near/227892581. The difference is that this time I have got the permission to publish a subset of my model on GitHub for a concrete context: https://github.com/rht/climate_stress_test_benchmark (the code is ~300 LOC in Julia / Python). I have also asked at Julia Discourse for insights (see https://discourse.julialang.org/t/optim-jl-vs-scipy-optimize-once-again/61661). The benchmark result: 1. Python objective function, scipy.optimize with L-BFGS-B, ftol=2.220446049250313e-09 (default), gtol=1e-5 (default), maxls=20 (default): 0.448 ± 0.013 s 2. Julia objective function, but uses scipy.optimize with the same params as previous point: 0.014093, which 31.8x faster than full Python version 3. Julia objective function, Optim.jl with LBFGS, f_tol=2.2e-9, g_tol=1e-5, HagerZhang with linesearchmax=20 (those params are explicitly set): 700.513 ms (3365 allocations: 148.81 KiB) using `@btime` The remaining difference between Optim.jl and scipy.optimize seems to be that Optim.jl uses HagerZhang line search, which is probably inherently slower than scipy.optimize's `dcsrch`, which is extracted from Minpack2 with some modification (function definition of the line search algorithm at https://github.com/scipy/scipy/blob/4ec6f29c0344ddce80845df0c2b8765fc8f27a72/scipy/optimize/lbfgsb_src/lbfgsb.f#L3326). Some comment (https://github.com/scipy/scipy/blob/4ec6f29c0344ddce80845df0c2b8765fc8f27a72/scipy/optimize/lbfgsb_src/lbfgsb.f#L2439-L2480): ``` c This subroutine calls subroutine dcsrch from the Minpack2 library c to perform the line search. Subroutine dscrch is safeguarded so c that all trial points lie within the feasible region. c c Be mindful that the dcsrch subroutine being called is a copy in c this file (lbfgsb.f) and NOT in the Minpack2 copy distributed c by scipy. ``` Which line search algorithm should I use from LineSearches.jl, and with what params so that it is most similar to scipy.optimize's L-BFGS-B? If such line search algorithm does not exist yet in LineSearches.jl, do you think it is worth it to port `dcsrch` to LineSearches.jl?

but none of the tips mentioned there helped.

tmigot · December 5, 2022, 1:08pm

By curiosity, have you also compared with the L-BFGS implemented in GitHub - JuliaSmoothOptimizers/JSOSolvers.jl ?

jlbosse · December 5, 2022, 2:12pm

I just tried that, but couldn’t figure out how to pass my loss function to the lbfgs() function. Is there some easy function that allows one to create a <: AbstractNLPModel from a function and its gradient?

abelsiqueira · December 5, 2022, 2:19pm

There is an example here: How to create a model from the function and its derivatives

jlbosse · December 5, 2022, 2:38pm

Okay, got JSOSolvers.jl also working, but that appears to be considerably worse:

nlp = ManualNLPModels.NLPModel(x0, loss_fun, grad=loss_fun.g!)

@time ret_jso = JSOSolvers.lbfgs(nlp, x=x0, rtol=1e-8, max_eval=5000, max_time=60.)
@show ret_jso.iter
@show ret_jso.objective
#= output
 16.703706 seconds (40.87 M allocations: 12.349 GiB, 3.71% gc time)
ret_jso.iter = 4699
ret_jso.objective = -16.91551990240162
=#

Note that here I am definitely paying a 50% extra runtime because I am not passing loss_fun.fg! in which allows to compute the function value in one forward and backward pass. But not enough to explain a whole factor of 10 difference.

abelsiqueira · December 5, 2022, 2:45pm

Looks like an interesting problem. If you can share more it would help us a lot to figure out why it is so much slower.
Also, I think I can make a quick change to have fg!.

jlbosse · December 5, 2022, 2:50pm

Unfortunately it is a quite big code base, but later today I can try extract a MWE from it that has the same behaviour.

tmigot · December 5, 2022, 2:51pm

It looks like it made 4699 iterations, while Optim variant 967 iterations that probably explains it.
In JSOSolvers.jl, the memory parameter is mem and equals 5 by default. No idea, what is its value for the others.

jlbosse · December 5, 2022, 4:07pm

Okay, I created an MWE for the problem. It is the simulation of a variational quantum circuit and then minimisation of an expectation value:

using Yao
using Optim 
using NLopt
using ChainRulesCore
using Zygote
using LineSearches
using JSOSolvers
using ManualNLPModels


# Define a whole lot of code to set up the loss functions

struct LossFunction{D,
        CT<:AbstractBlock{D},
        HT<:AbstractBlock{D},
        RT<:AbstractRegister} <: Function
    circuit::CT
    hamiltonian::HT
    register::RT
end

@inline function _call(f::LossFunction, x)
    dispatch!(f.circuit, x)
    fill!(f.register.state, zero(eltype(f.register.state)))
    f.register.state[1] = one(eltype(f.register.state))
    apply!(f.register, f.circuit)
    return real(expect(f.hamiltonian, f.register))
end

(f::LossFunction)(x) = _call(f, x)

function ChainRulesCore.rrule(f::LossFunction, x)
    dispatch!(f.circuit, x)
    fill!(f.register.state, zero(eltype(f.register.state)))
    f.register.state[1] = one(eltype(f.register.state))
    y = real(expect(f.hamiltonian, f.register => f.circuit))

    function lossfunction_pullback(Δy)
        stateδ, paramδ = expect'(f.hamiltonian, f.register => f.circuit)
        return (NoTangent(), Δy .* paramδ)
    end

    return y, lossfunction_pullback
end

# to be consistent with the `expect'(ham, register => circuit)` notation 
# of Yao.jl
Base.adjoint(f::LossFunction) = x -> gradient(f, x)

"""
    $(SIGNATURES)

A version of the cost function that is suitable for `NLopt.jl` and writes the 
gradient ∇f(x) to the `grad` argument. Note that you need `size(x) == size(grad)`.
"""
function _nlopt_costfunction(f::LossFunction, x::AbstractVector, grad::AbstractVector)
    y, y_pullback = pullback(f, x)
    grad .= only(y_pullback([one(y)]))
    return y
end

nlopt_costfunction(f::LossFunction, x, grad) = _nlopt_costfunction(f, x, grad)

_getproperty(f::LossFunction, ::Val{s}) where {s} = getfield(f, s)
_getproperty(f::LossFunction, ::Val{:for_nlopt}) = (x, grad) -> nlopt_costfunction(f, x, grad)
_getproperty(f::LossFunction, ::Val{:wavefunction}) = x -> wavefunction(f, x)
_getproperty(f::LossFunction, ::Val{:g!}) = (grad, x) -> (grad .= f'(x)[1])
_getproperty(f::LossFunction, ::Val{:fg!}) = (grad, x) -> nlopt_costfunction(f, x, grad)
Base.getproperty(f::LossFunction, s::Symbol) = _getproperty(f, Val{s}())
function Base.propertynames(::FT, private::Bool = false) where {FT<:LossFunction}
    return (fieldnames(FT)..., :for_nlopt, :fg!, :g!)
end

# now actually create the loss functions
nq = 10 
p = 5

ham = EasyBuild.heisenberg(nq)
circuit = EasyBuild.variational_circuit(nq, p)
loss_fun = LossFunction(circuit, ham, zero_state(nq))
nparams = nparameters(circuit)
x0 = rand(nparams);


# set up the NLopt problem
opt = NLopt.Opt(:LD_LBFGS, nparams)
objective = loss_fun.for_nlopt
opt.min_objective = objective
opt.xtol_rel = 1e-8
opt.ftol_rel = 1e-8
opt.maxeval = 5000

# run the NLopt problem
@time ret_nlopt = NLopt.optimize(opt, x0)
@show opt.numevals
@show ret_nlopt[1]
@show ret_nlopt[3]

# set up the Optim problem
options = Optim.Options(x_reltol=1e-8, f_reltol=1e-8, iterations=5000)
method = Optim.LBFGS(linesearch=LineSearches.BackTracking())
od = Optim.OnceDifferentiable(loss_fun, loss_fun.g!, loss_fun.fg!, x0)

# run the Optim problem
@time ret_optim = Optim.optimize(od, x0, method, options)
@show ret_optim

# set up the JSO problem
nlp = ManualNLPModels.NLPModel(x0, cost_fun, grad=cost_fun.g!);

# and run the JSO problem
@time ret_jso = JSOSolvers.lbfgs(nlp, x=x0, rtol=1e-8, max_eval=5000, max_time=60.)
@show ret_jso.iter
@show ret_jso.objective

The exact times it takes to run both optimisations at the end depends on x0, but as a rule of thumb
Optim.jl takes around three times as long as NLopt.jl and JSOSolvers.jl takes even longer.

abelsiqueira · December 5, 2022, 4:14pm

We just released ManualNLPModels 0.1.3 with objgrad support. Should improve a little bit, but I think tweaking on the LBFSG parameters might help as well (Solvers · JSOSolvers.jl).

I noticed that the other solvers terminated with f stalling and we don’t check that, so it might be necessary to add a callback if you really want that stopping condition.
The final gradient is around 1e-4, so rtol and atol can also be changed.

After the MWE I can leave more comments

tmigot · December 5, 2022, 4:44pm

Thanks for the MWE. Since you use random initial guess, it might be useful to set a seed

using Random
Random.seed!(1234)

I noticed that the solutions obtained are returning quite different norms of the gradient.

Besides, trying to guess a mem parameter comparable to the others got:

@time ret_jso = JSOSolvers.lbfgs(nlp, mem = 7, x=x0, atol = norm(grad(nlp, ret_nlopt[2])), rtol=eps(), max_eval=5000, max_time=60.)

so that the number of iterations is similar to the others, and checking the norm of gradients:

using LinearAlgebra, NLPModels
norm(grad(nlp, ret_nlopt[2])), norm(grad(nlp, ret_optim.minimizer)), norm(grad(nlp,ret_jso.solution))

returns

(0.0002431544585082576, 0.0011227761758756743, 0.00022974580470524832)

Edit: commentating on @abelsiqueira update

# loss_fun.fg2! should return (fx, gx)
nlp = ManualNLPModels.NLPModel(x0, loss_fun, grad=loss_fun.g!, objgrad=loss_fun.fg2!)

seems to help as well.

abelsiqueira · December 5, 2022, 7:35pm

Here is a version using callbacks. The times varies a lot depending on the starting points. Sometimes we are faster, sometimes NLOpt is faster.
Let me know if something similar works better in your original problem.

It might also be good for comparison to run the solver once before @time to precompile, using smaller limits.
In that case, add a reset!(nlp) before the second call.

Current changes:

# set up the JSO problem
objgrad(gx, x) = begin
  f = loss_fun.fg!(gx, x)
  return f, gx
end
nlp = ManualNLPModels.NLPModel(x0, loss_fun, grad=loss_fun.g!, objgrad=objgrad);

# and run the JSO problem
global f_old = 100.0
cb(nlp, solver, stats) = begin
  f = stats.objective
  if abs(f_old - f) < 1e-8 * abs(f)
    stats.status = :user
  end
  global f_old = f
end
@time ret_jso = JSOSolvers.lbfgs(nlp, x=x0, mem=7, rtol=1e-8, max_eval=5000, max_time=60., callback=cb)
print(ret_jso)

jlbosse · December 6, 2022, 11:11am

Thanks for adding the objgrad option! However I am still failing to get JSOSolvers to be as fast as Optim.jl or NLopt.jl.

After some more testing it seems the fastest option is to actually use the BFGS solver from Optim.jl because my real problem has at most 100 variables, but takes a couple seconds to compute. So the dense matrix inversion in BFGS doesn’t contribute much to the overall cost.

Edit:
Of course, that still doesn’t answer my initial question. If anyone more well-versed with Optim.jl knows how to make it behave exactly the same as NLopt.jl that would still be appreciated.

mcreel · December 6, 2022, 5:18pm

It may very well be that the different versions are using a different memory setting for LBFGS. That can have an effect on how many iterations are needed. I suggest running the example fixing the memory parameter to be the same for the different solvers.

Topic		Replies	Views
Optim.jl vs scipy.optimize once again Performance optim	54	8902	June 10, 2021
`Optimization.LBFGS` fails to converge while `Optim.NelderMead()` works General Usage question , optim , optimization	11	279	March 29, 2025
Is there a Julia equivalent of scipy.optimize.minimize(method='TNC')? Optimization (Mathematical)	23	2491	August 21, 2020
Optimize performance comparison - Optim.jl vs Scipy Performance	15	5762	February 12, 2020
Performance of Optim.jl vs. Matlab's fminunc Optimization (Mathematical) performance	12	3999	August 6, 2018

Making LBFGS in `Optim.jl` as fast as in `NLopt.jl`

Edit: commentating on @abelsiqueira update

Related topics