`Optimization.LBFGS` fails to converge while `Optim.NelderMead()` works

jling · October 7, 2024, 3:52pm

jling · October 7, 2024, 3:59pm

interestingly, this is reporting success while clearly failing:

optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = :bfgs) = retcode: Success
u: [1.0e-5, 0.001, 0.001, 0.001, 0.001, 0.001]
Final objective value:     7142.997313963175

roflmaostc · October 7, 2024, 5:08pm

Using Optim.jl directly works though:

using Optimization, OptimizationOptimJL
using Optim
using Zygote
using ForwardDiff

@. ATLAS_f_6p(x, p) = p[1]*(1-x)^p[2] * x^(p[3] + p[4]*log(x) + p[5]*log(x)^2 + p[6]*log(x)^3)

function chi2(os, cs, σs)
    sum(@. (os - cs)^2/σs^2)
end

function optim_fit(func::T, xs, ys, σs, p0; algo) where T
    F = function (p, _)
        cs = func(xs, p)
        return chi2(ys, cs, σs)
    end
    optf = OptimizationFunction(F, AutoForwardDiff())
    prob = OptimizationProblem(optf, p0; lb = zeros(length(p0)) .- 1e5, ub = zeros(length(p0)) .+ 1e5)
    if algo == :nm
        solve(prob, NelderMead(); x_tol=1e-7, g_tol=1e-7, maxiters=10^5)
    elseif algo == :lbfgs
        solve(prob, Optimization.LBFGS())
    end
end


function optim_fit2(func::T, xs, ys, σs, p0) where T
    F = function (p)
        cs = func(xs, p)
        return chi2(ys, cs, σs)
    end

    G!(g, x) = !isnothing(g) ? g .= ForwardDiff.gradient(F, x) : x
    return @time Optim.optimize(F, G!, p0, LBFGS())
end
################ example data

function main()
    xs = collect(121:2:329) ./ 13600
    ys = [202.0, 198.0, 221.0, 211.0, 195.0, 169.0, 147.0, 168.0, 170.0, 169.0, 156.0, 164.0, 147.0, 135.0, 143.0, 136.0, 121.0, 137.0, 128.0, 129.0, 124.0, 109.0, 100.0, 104.0, 104.0, 99.0, 95.0, 102.0, 90.0, 94.0, 89.0, 80.0, 87.0, 65.0, 98.0, 80.0, 79.0, 59.0, 63.0, 73.0, 73.0, 65.0, 61.0, 54.0, 53.0, 47.0, 53.0, 54.0, 59.0, 55.0, 46.0, 61.0, 41.0, 46.0, 40.0, 47.0, 47.0, 44.0, 32.0, 46.0, 44.0, 29.0, 34.0, 33.0, 36.0, 35.0, 37.0, 30.0, 45.0, 42.0, 23.0, 20.0, 27.0, 39.0, 32.0, 24.0, 18.0, 23.0, 31.0, 18.0, 15.0, 28.0, 14.0, 19.0, 26.0, 20.0, 28.0, 25.0, 22.0, 13.0, 14.0, 14.0, 21.0, 19.0, 18.0, 17.0, 17.0, 23.0, 17.0, 20.0, 17.0, 12.0, 12.0, 13.0, 15.0]
    sigmas = sqrt.(ys)
    
    p0_6para = [1e-5; 0.001*ones(5)]
    @show optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo=:nm)
    @show optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo=:lbfgs)
    r = optim_fit2(ATLAS_f_6p, xs, ys, sigmas, p0_6para)
    @show r
    @show r.minimizer
end

julia> main()
optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = :nm) = retcode: Success
u: [0.03756371444660228, -7.340228730228301, 1.0019315916033942, 0.1351637703868156, -0.32867928276850933, -0.04894659548629979]
Final objective value:     98.5061556041032

optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = :lbfgs) = retcode: Success
u: [1.0e-5, 0.001, 0.001, 0.001, 0.001, 0.001]
Final objective value:     7142.997313963177

  0.040708 seconds (93.77 k allocations: 40.363 MiB)
r =  * Status: failure (reached maximum number of iterations)

 * Candidate solution
    Final objective value:     9.850700e+01

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 9.03e-03 ≰ 0.0e+00
    |x - x'|/|x'|          = 3.51e-03 ≰ 0.0e+00
    |f(x) - f(x')|         = 1.00e-05 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.02e-07 ≰ 0.0e+00
    |g(x)|                 = 1.15e+02 ≰ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    1000
    f(x) calls:    2743
    ∇f(x) calls:   2743

r.minimizer = [0.3162552206761995, -1.0013989067701674, 2.575248450283215, 0.6800532460951797, -0.23266037063777448, -0.04227604261832813]
6-element Vector{Float64}:
  0.3162552206761995
 -1.0013989067701674
  2.575248450283215
  0.6800532460951797
 -0.23266037063777448
 -0.04227604261832813

``

kellertuer · October 7, 2024, 5:21pm

Well “works” , it still hit 1000 iterations, but sure the cost is close to what one would expect. LBFGS should finish (much) faster

I looked a bit at the gradient, and it seems while probably correct, it seems to be unstable.
For the following figure: The gradient can be used to provide a linear approximation (first order Taylor) to the cost. The figure plots the approximation error and checks whether it has slope 2 (the check is the dashed line). This is the linear approximation at the start point p0.

I also looked at the gradient norms during a (similar) LBFGS run (from Manopt.jl for the simple reason that I could get the following print easily)

# 119    f(x): 98.521056   |grad f(p)|:112.59798058575441
# 120    f(x): 98.521053   |grad f(p)|:0.11722103028124436
# 121    f(x): 98.521053   |grad f(p)|:0.10405872433131812
# 122    f(x): 98.521053   |grad f(p)|:0.36430340562416974
# 123    f(x): 98.521053   |grad f(p)|:0.04506270919617104
# 124    f(x): 98.521053   |grad f(p)|:0.05409064169436933
# 125    f(x): 98.521053   |grad f(p)|:0.14359850776461958
[...]
# 997    f(x): 98.509641   |grad f(p)|:71.62963904974352
# 998    f(x): 98.509630   |grad f(p)|:44.0010184317672
# 999    f(x): 98.509627   |grad f(p)|:147.03947404681605
# 1000   f(x): 98.509622   |grad f(p)|:92.35763900759153

So I would carefully conclude, the gradient computation is not that stable.

edit: with using Manopt, Manifolds this is what I added in the original snipped above

    elseif algo == :lbfgsManopt
        M = Euclidean(length(p0))
        f = p -> F(p,1)
        mn_f = (M,p) -> f(p)
        mn_g = (M, p) -> ForwardDiff.gradient(f, p)
        p3 = quasi_Newton(M, mn_f, mn_g, p0;
        debug=[:Iteration, " ", :Cost, "   ", :GradientNorm, 1, "\n", :Stop],
        #    return_state = true,
        )
        return p3
    end

but careful, this prints one line per iteration for the 1000 iterations – and returns the state eventually (uncommenting a line) the one can also see it hits max tier)

roflmaostc · October 7, 2024, 5:23pm

I also checked whether the difference is ForwardDiff or Zygote. But both works.

Ideally, Optimization.jl should just wrap Optim.jl. But both with default options Optimization.jl fails.

jling · October 7, 2024, 5:28pm

Does that mean I should reformulate the problem somehow? To avoid gradient being unstable

This is a pretty standard chi2 fit, so idk what I’m doing wrong

kellertuer · October 7, 2024, 5:32pm

Chi2 has a nice gradient part in the chain rule (with respect to cs),
your atlas_magic function is not really nice and probably the culprit. But for me it is too late to try to give you the analytical gradient of that.

gdalle · October 7, 2024, 6:48pm

I know of cases where SciML secretly switches the AD backend out from under you. Could this be one of those @Vaibhavdixit02?

kellertuer · October 7, 2024, 7:08pm

So, here is an approach to the gradient. Since you called it atlas_6p we can even to the general case.

We consider for given data x, y, and \sigma.

f_n(p) = p_1(1-x)^{p_2}x^{g(x)} with g_n(p) = p_3+p_4\log x + \ldots+ p_n(\log x)^{n-3}

as well as X_2(x) = \frac{(y-x)^2}{\sigma^2}, then X_2'(x) = -\frac{2(y-x)}{\sigma^2}.

Then the gradient of g_n is easy, since all variables appear linear, we have

\nabla g_n(p) = \begin{pmatrix} 0\\0\\1\\\log x\\\vdots\\(\log x)^{n-3}\end{pmatrix}

and hence \nabla f(p) has the components (\nabla f(p))_i with (in vector form maybe a bit spacey):

(1-x)^{p_2}x^{g(x)} for i=1
p_1\ln(x-1)(x-1)^{p_2}x^{g_n(p)} for i=2
p_1(1-x)^{p_2}ln(x)x^{g_n(p)}(\nabla g_n(p))_i = p_1(1-x)^{p_2}ln(x)x^{g_n(p)}(\log x)^{i-3} for i=3,\ldots,n

The final gradient of F_n(p) = X_2(f_n(p)) is now another chain rule. Each entry reads as
(\nabla F_n(p))_i = -\frac{2(y-f_n(p))}{\sigma^2}(\nabla f_n(p))_i

if one implements these step-by-step (gn, grad_gn, fn, grad_fn) the gradient is given in closed form.
The only warning on this is, that the computation was done after sundown (and not yet in December where that is 2pm here in Trondheim, but more like 9pm).

edit: If one has multiple data x_j,y_j,\sigma_j and performs this X_2 element wise and sums that, the overall gradient is the sum of the single ones \nabla F_{n,x_j,y_j,\sigma_j}(p).

edit2: The first few things I missed is, that when 0<x<1 and numerically p_2<0 this all does not work.

jling · October 8, 2024, 7:54am

thanks for the insight, so if I updated your code to increase iterations to 10^5:

    return Optim.optimize(F, p0, LBFGS(), Optim.Options(;iterations=10^5); autodiff=:forward)

I get

Status: failure (line search failed)

 * Candidate solution
    Final objective value:     9.802206e+01

 * Found with
    Algorithm:     L-BFGS

btw, even with more iterations, the result from NelderMead and LBFGS don’t converge:

function optim_fit2(func::T, xs, ys, σs, p0; algo = LBFGS()) where T
    F = function (p)
        cs = func(xs, p)
        return chi2(ys, cs, σs)
    end

    return Optim.optimize(F, p0, algo, Optim.Options(;iterations=5000); autodiff=:forward)
end

sol_nm = optim_fit2(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = NelderMead())
sol_lbfgs = optim_fit2(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = LBFGS())


julia> sol_nm.minimum, sol_lbfgs.minimum
(98.51662613370482, 98.0220580738921)

julia> sol_nm.minimizer, sol_lbfgs.minimizer
([0.014343016762772673, 0.36918264249014077, 0.08258611548972156, -0.019076422764126248, -0.31125906069802134, -0.044992442492889535], [11207.052120386572, -913.9928401545511, 165.74801872441768, 91.61206638696663, 17.418590874571905, 1.1299718078901382])

if we use BFGS + stopping criteria from Connection between BFGS and ROOT.Minuit. Stopping criteria - #4 by Yuan-Ru-Lin

which is yet another set of minimizer…

julia> sol_lbfgs.minimizer |> print
[3.724592263478385, 0.7652684260382869, 4.556910181502994, 1.2983875079831064, -0.1450263191617602, -0.037556051055012134]

I think this is consistent with what @ChrisRackauckas said about the line search in Optim is “bad” (not sure what it means exactly).

jling · October 8, 2024, 7:57am

yeah, I’m okay with Optimization.LBFGS() fails I guess, but right now Optim.LBFGS() also fails when used from Optimization.jl which is not ideal.

jling · March 29, 2025, 2:42am

I’m back to this rabbit hole with more bizzare findings.

xs = 105.5:1:319.5
ys = [2616.0, 2646.0, 2531.0, 2539.0, 2455.0, 2427.0, 2293.0, 2250.0, 2306.0, 2320.0, 2219.0, 2123.0, 2051.0, 2131.0, 2119.0, 2053.0, 1976.0, 2046.0, 1955.0, 1886.0, 1875.0, 1842.0, 1818.0, 1781.0, 1725.0, 1729.0, 1696.0, 1632.0, 1624.0, 1557.0, 1661.0, 1565.0, 1449.0, 1435.0, 1438.0, 1500.0, 1441.0, 1478.0, 1439.0, 1413.0, 1336.0, 1330.0, 1258.0, 1290.0, 1218.0, 1271.0, 1200.0, 1159.0, 1183.0, 1115.0, 1122.0, 1088.0, 1111.0, 1091.0, 1028.0, 1038.0, 1063.0, 961.0, 959.0, 976.0, 991.0, 926.0, 893.0, 954.0, 871.0, 861.0, 854.0, 802.0, 804.0, 846.0, 803.0, 780.0, 750.0, 820.0, 718.0, 646.0, 698.0, 665.0, 687.0, 634.0, 724.0, 701.0, 658.0, 642.0, 608.0, 635.0, 575.0, 640.0, 616.0, 587.0, 574.0, 537.0, 554.0, 565.0, 540.0, 499.0, 505.0, 539.0, 501.0, 474.0, 500.0, 480.0, 519.0, 442.0, 447.0, 478.0, 418.0, 438.0, 386.0, 397.0, 405.0, 416.0, 390.0, 390.0, 364.0, 359.0, 379.0, 390.0, 392.0, 375.0, 361.0, 328.0, 328.0, 283.0, 299.0, 311.0, 307.0, 307.0, 299.0, 305.0, 301.0, 328.0, 302.0, 263.0, 263.0, 295.0, 272.0, 305.0, 245.0, 249.0, 268.0, 284.0, 254.0, 226.0, 251.0, 212.0, 258.0, 243.0, 222.0, 227.0, 212.0, 215.0, 164.0, 208.0, 196.0, 226.0, 187.0, 204.0, 198.0, 200.0, 182.0, 192.0, 143.0, 183.0, 171.0, 158.0, 159.0, 161.0, 181.0, 168.0, 150.0, 148.0, 140.0, 177.0, 155.0, 154.0, 123.0, 146.0, 131.0, 126.0, 135.0, 155.0, 123.0, 115.0, 120.0, 123.0, 118.0, 106.0, 109.0, 101.0, 123.0, 123.0, 105.0, 101.0, 104.0, 88.0, 113.0, 90.0, 107.0, 104.0, 92.0, 86.0, 110.0, 79.0, 99.0, 83.0, 94.0, 84.0, 94.0, 99.0, 88.0, 82.0, 89.0, 71.0, 97.0]

If I optimize it using box limits:

    optf = OptimizationFunction(F, AutoForwardDiff())
    prob = OptimizationProblem(optf, p0; lb=zeros(length(p0)) .- 1e5, ub=zeros(length(p0)) .+ 1e5)

It looks like it’s suceeding but hitting the limits:

fit1 = BjetTLA.optim_fit(ATLAS_f_6p, debug_h0, [p0_6para;]; ub = fill(1e5, 6))

retcode: Success
u: 6-element Vector{Float64}:
 99999.98889301751
   197.25808122367738
    -1.0636621029865105
    -1.7406435444971484
    -0.5006315448662244
    -0.04235801042490806

SciMLBase.OptimizationStats
Number of iterations:                              24
Time in seconds:                                   1.250599
Number of function evaluations:                    35178
Number of gradient evaluations:                    1
Number of hessian evaluations:                     0

But if I relax the limits:

fit2 = BjetTLA.optim_fit(ATLAS_f_6p, debug_h0, [p0_6para;]; ub = fill(1e6, 6))
retcode: Success
u: 6-element Vector{Float64}:
   0.030269225860610954
 124.03967786074736
  -3.404918269132515
  -0.06338458302561076
   0.009027360039605137
  -0.0030097030886507043

It shows that the local minimal is not near the bounds.

Btw, our function is quite degerative?

fit1.objective = 256.1631642700225
fit2.objective = 256.1389875330926

Topic		Replies	Views
Making LBFGS in `Optim.jl` as fast as in `NLopt.jl` Optimization (Mathematical)	13	2099	December 6, 2022
MAP optimization in Turing with LBFGS doesn't move at all Probabilistic Programming optimization , turing , autodiff	22	271	August 23, 2024
Understanding the results from Optim.jl for a black box function Optimization (Mathematical) optim , optimization	15	1713	June 25, 2021
Help with gradient-based optimizers Optimization (Mathematical) question	14	449	September 21, 2024
Optim.jl vs scipy.optimize once again Performance optim	54	8941	June 10, 2021

`Optimization.LBFGS` fails to converge while `Optim.NelderMead()` works

Related topics