`Optimization.LBFGS` fails to converge while `Optim.NelderMead()` works

Reproducer: MWE.converge.jl · GitHub

1 Like

interestingly, this is reporting success while clearly failing:

optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = :bfgs) = retcode: Success
u: [1.0e-5, 0.001, 0.001, 0.001, 0.001, 0.001]
Final objective value:     7142.997313963175

Using Optim.jl directly works though:

using Optimization, OptimizationOptimJL
using Optim
using Zygote
using ForwardDiff

@. ATLAS_f_6p(x, p) = p[1]*(1-x)^p[2] * x^(p[3] + p[4]*log(x) + p[5]*log(x)^2 + p[6]*log(x)^3)

function chi2(os, cs, σs)
    sum(@. (os - cs)^2/σs^2)
end

function optim_fit(func::T, xs, ys, σs, p0; algo) where T
    F = function (p, _)
        cs = func(xs, p)
        return chi2(ys, cs, σs)
    end
    optf = OptimizationFunction(F, AutoForwardDiff())
    prob = OptimizationProblem(optf, p0; lb = zeros(length(p0)) .- 1e5, ub = zeros(length(p0)) .+ 1e5)
    if algo == :nm
        solve(prob, NelderMead(); x_tol=1e-7, g_tol=1e-7, maxiters=10^5)
    elseif algo == :lbfgs
        solve(prob, Optimization.LBFGS())
    end
end


function optim_fit2(func::T, xs, ys, σs, p0) where T
    F = function (p)
        cs = func(xs, p)
        return chi2(ys, cs, σs)
    end

    G!(g, x) = !isnothing(g) ? g .= ForwardDiff.gradient(F, x) : x
    return @time Optim.optimize(F, G!, p0, LBFGS())
end
################ example data

function main()
    xs = collect(121:2:329) ./ 13600
    ys = [202.0, 198.0, 221.0, 211.0, 195.0, 169.0, 147.0, 168.0, 170.0, 169.0, 156.0, 164.0, 147.0, 135.0, 143.0, 136.0, 121.0, 137.0, 128.0, 129.0, 124.0, 109.0, 100.0, 104.0, 104.0, 99.0, 95.0, 102.0, 90.0, 94.0, 89.0, 80.0, 87.0, 65.0, 98.0, 80.0, 79.0, 59.0, 63.0, 73.0, 73.0, 65.0, 61.0, 54.0, 53.0, 47.0, 53.0, 54.0, 59.0, 55.0, 46.0, 61.0, 41.0, 46.0, 40.0, 47.0, 47.0, 44.0, 32.0, 46.0, 44.0, 29.0, 34.0, 33.0, 36.0, 35.0, 37.0, 30.0, 45.0, 42.0, 23.0, 20.0, 27.0, 39.0, 32.0, 24.0, 18.0, 23.0, 31.0, 18.0, 15.0, 28.0, 14.0, 19.0, 26.0, 20.0, 28.0, 25.0, 22.0, 13.0, 14.0, 14.0, 21.0, 19.0, 18.0, 17.0, 17.0, 23.0, 17.0, 20.0, 17.0, 12.0, 12.0, 13.0, 15.0]
    sigmas = sqrt.(ys)
    
    p0_6para = [1e-5; 0.001*ones(5)]
    @show optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo=:nm)
    @show optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo=:lbfgs)
    r = optim_fit2(ATLAS_f_6p, xs, ys, sigmas, p0_6para)
    @show r
    @show r.minimizer
end

julia> main()
optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = :nm) = retcode: Success
u: [0.03756371444660228, -7.340228730228301, 1.0019315916033942, 0.1351637703868156, -0.32867928276850933, -0.04894659548629979]
Final objective value:     98.5061556041032

optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = :lbfgs) = retcode: Success
u: [1.0e-5, 0.001, 0.001, 0.001, 0.001, 0.001]
Final objective value:     7142.997313963177

  0.040708 seconds (93.77 k allocations: 40.363 MiB)
r =  * Status: failure (reached maximum number of iterations)

 * Candidate solution
    Final objective value:     9.850700e+01

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 9.03e-03 ≰ 0.0e+00
    |x - x'|/|x'|          = 3.51e-03 ≰ 0.0e+00
    |f(x) - f(x')|         = 1.00e-05 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.02e-07 ≰ 0.0e+00
    |g(x)|                 = 1.15e+02 ≰ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    1000
    f(x) calls:    2743
    ∇f(x) calls:   2743

r.minimizer = [0.3162552206761995, -1.0013989067701674, 2.575248450283215, 0.6800532460951797, -0.23266037063777448, -0.04227604261832813]
6-element Vector{Float64}:
  0.3162552206761995
 -1.0013989067701674
  2.575248450283215
  0.6800532460951797
 -0.23266037063777448
 -0.04227604261832813

``
1 Like

Well “works” , it still hit 1000 iterations, but sure the cost is close to what one would expect. LBFGS should finish (much) faster

I looked a bit at the gradient, and it seems while probably correct, it seems to be unstable.
For the following figure: The gradient can be used to provide a linear approximation (first order Taylor) to the cost. The figure plots the approximation error and checks whether it has slope 2 (the check is the dashed line). This is the linear approximation at the start point p0.

I also looked at the gradient norms during a (similar) LBFGS run (from Manopt.jl for the simple reason that I could get the following print easily)

# 119    f(x): 98.521056   |grad f(p)|:112.59798058575441
# 120    f(x): 98.521053   |grad f(p)|:0.11722103028124436
# 121    f(x): 98.521053   |grad f(p)|:0.10405872433131812
# 122    f(x): 98.521053   |grad f(p)|:0.36430340562416974
# 123    f(x): 98.521053   |grad f(p)|:0.04506270919617104
# 124    f(x): 98.521053   |grad f(p)|:0.05409064169436933
# 125    f(x): 98.521053   |grad f(p)|:0.14359850776461958
[...]
# 997    f(x): 98.509641   |grad f(p)|:71.62963904974352
# 998    f(x): 98.509630   |grad f(p)|:44.0010184317672
# 999    f(x): 98.509627   |grad f(p)|:147.03947404681605
# 1000   f(x): 98.509622   |grad f(p)|:92.35763900759153

So I would carefully conclude, the gradient computation is not that stable.

edit: with using Manopt, Manifolds this is what I added in the original snipped above

    elseif algo == :lbfgsManopt
        M = Euclidean(length(p0))
        f = p -> F(p,1)
        mn_f = (M,p) -> f(p)
        mn_g = (M, p) -> ForwardDiff.gradient(f, p)
        p3 = quasi_Newton(M, mn_f, mn_g, p0;
        debug=[:Iteration, " ", :Cost, "   ", :GradientNorm, 1, "\n", :Stop],
        #    return_state = true,
        )
        return p3
    end

but careful, this prints one line per iteration for the 1000 iterations – and returns the state eventually (uncommenting a line) the one can also see it hits max tier)

1 Like

I also checked whether the difference is ForwardDiff or Zygote. But both works.

Ideally, Optimization.jl should just wrap Optim.jl. But both with default options Optimization.jl fails.

2 Likes

Does that mean I should reformulate the problem somehow? To avoid gradient being unstable

This is a pretty standard chi2 fit, so idk what I’m doing wrong

Chi2 has a nice gradient part in the chain rule (with respect to cs),
your atlas_magic function is not really nice and probably the culprit. But for me it is too late to try to give you the analytical gradient of that.

1 Like

I know of cases where SciML secretly switches the AD backend out from under you. Could this be one of those @Vaibhavdixit02?

So, here is an approach to the gradient. Since you called it atlas_6p we can even to the general case.

We consider for given data x, y, and \sigma.

f_n(p) = p_1(1-x)^{p_2}x^{g(x)} with g_n(p) = p_3+p_4\log x + \ldots+ p_n(\log x)^{n-3}

as well as X_2(x) = \frac{(y-x)^2}{\sigma^2}, then X_2'(x) = -\frac{2(y-x)}{\sigma^2}.

Then the gradient of g_n is easy, since all variables appear linear, we have

\nabla g_n(p) = \begin{pmatrix} 0\\0\\1\\\log x\\\vdots\\(\log x)^{n-3}\end{pmatrix}

and hence \nabla f(p) has the components (\nabla f(p))_i with (in vector form maybe a bit spacey):

  • (1-x)^{p_2}x^{g(x)} for i=1
  • p_1\ln(x-1)(x-1)^{p_2}x^{g_n(p)} for i=2
  • p_1(1-x)^{p_2}ln(x)x^{g_n(p)}(\nabla g_n(p))_i = p_1(1-x)^{p_2}ln(x)x^{g_n(p)}(\log x)^{i-3} for i=3,\ldots,n

The final gradient of F_n(p) = X_2(f_n(p)) is now another chain rule. Each entry reads as
(\nabla F_n(p))_i = -\frac{2(y-f_n(p))}{\sigma^2}(\nabla f_n(p))_i

if one implements these step-by-step (gn, grad_gn, fn, grad_fn) the gradient is given in closed form.
The only warning on this is, that the computation was done after sundown (and not yet in December where that is 2pm here in Trondheim, but more like 9pm).

edit: If one has multiple data x_j,y_j,\sigma_j and performs this X_2 element wise and sums that, the overall gradient is the sum of the single ones \nabla F_{n,x_j,y_j,\sigma_j}(p).

edit2: The first few things I missed is, that when 0<x<1 and numerically p_2<0 this all does not work.

2 Likes

thanks for the insight, so if I updated your code to increase iterations to 10^5:

    return Optim.optimize(F, p0, LBFGS(), Optim.Options(;iterations=10^5); autodiff=:forward)

I get

Status: failure (line search failed)

 * Candidate solution
    Final objective value:     9.802206e+01

 * Found with
    Algorithm:     L-BFGS

btw, even with more iterations, the result from NelderMead and LBFGS don’t converge:

function optim_fit2(func::T, xs, ys, σs, p0; algo = LBFGS()) where T
    F = function (p)
        cs = func(xs, p)
        return chi2(ys, cs, σs)
    end

    return Optim.optimize(F, p0, algo, Optim.Options(;iterations=5000); autodiff=:forward)
end

sol_nm = optim_fit2(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = NelderMead())
sol_lbfgs = optim_fit2(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = LBFGS())


julia> sol_nm.minimum, sol_lbfgs.minimum
(98.51662613370482, 98.0220580738921)

julia> sol_nm.minimizer, sol_lbfgs.minimizer
([0.014343016762772673, 0.36918264249014077, 0.08258611548972156, -0.019076422764126248, -0.31125906069802134, -0.044992442492889535], [11207.052120386572, -913.9928401545511, 165.74801872441768, 91.61206638696663, 17.418590874571905, 1.1299718078901382])

if we use BFGS + stopping criteria from Connection between BFGS and ROOT.Minuit. Stopping criteria - #4 by Yuan-Ru-Lin

which is yet another set of minimizer

julia> sol_lbfgs.minimizer |> print
[3.724592263478385, 0.7652684260382869, 4.556910181502994, 1.2983875079831064, -0.1450263191617602, -0.037556051055012134]

I think this is consistent with what @ChrisRackauckas said about the line search in Optim is “bad” (not sure what it means exactly).

yeah, I’m okay with Optimization.LBFGS() fails I guess, but right now Optim.LBFGS() also fails when used from Optimization.jl which is not ideal.