Reproducer: MWE.converge.jl · GitHub

interestingly, this is reporting success while clearly failing:

```
optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = :bfgs) = retcode: Success
u: [1.0e-5, 0.001, 0.001, 0.001, 0.001, 0.001]
Final objective value: 7142.997313963175
```

Using Optim.jl directly works though:

```
using Optimization, OptimizationOptimJL
using Optim
using Zygote
using ForwardDiff
@. ATLAS_f_6p(x, p) = p[1]*(1-x)^p[2] * x^(p[3] + p[4]*log(x) + p[5]*log(x)^2 + p[6]*log(x)^3)
function chi2(os, cs, σs)
sum(@. (os - cs)^2/σs^2)
end
function optim_fit(func::T, xs, ys, σs, p0; algo) where T
F = function (p, _)
cs = func(xs, p)
return chi2(ys, cs, σs)
end
optf = OptimizationFunction(F, AutoForwardDiff())
prob = OptimizationProblem(optf, p0; lb = zeros(length(p0)) .- 1e5, ub = zeros(length(p0)) .+ 1e5)
if algo == :nm
solve(prob, NelderMead(); x_tol=1e-7, g_tol=1e-7, maxiters=10^5)
elseif algo == :lbfgs
solve(prob, Optimization.LBFGS())
end
end
function optim_fit2(func::T, xs, ys, σs, p0) where T
F = function (p)
cs = func(xs, p)
return chi2(ys, cs, σs)
end
G!(g, x) = !isnothing(g) ? g .= ForwardDiff.gradient(F, x) : x
return @time Optim.optimize(F, G!, p0, LBFGS())
end
################ example data
function main()
xs = collect(121:2:329) ./ 13600
ys = [202.0, 198.0, 221.0, 211.0, 195.0, 169.0, 147.0, 168.0, 170.0, 169.0, 156.0, 164.0, 147.0, 135.0, 143.0, 136.0, 121.0, 137.0, 128.0, 129.0, 124.0, 109.0, 100.0, 104.0, 104.0, 99.0, 95.0, 102.0, 90.0, 94.0, 89.0, 80.0, 87.0, 65.0, 98.0, 80.0, 79.0, 59.0, 63.0, 73.0, 73.0, 65.0, 61.0, 54.0, 53.0, 47.0, 53.0, 54.0, 59.0, 55.0, 46.0, 61.0, 41.0, 46.0, 40.0, 47.0, 47.0, 44.0, 32.0, 46.0, 44.0, 29.0, 34.0, 33.0, 36.0, 35.0, 37.0, 30.0, 45.0, 42.0, 23.0, 20.0, 27.0, 39.0, 32.0, 24.0, 18.0, 23.0, 31.0, 18.0, 15.0, 28.0, 14.0, 19.0, 26.0, 20.0, 28.0, 25.0, 22.0, 13.0, 14.0, 14.0, 21.0, 19.0, 18.0, 17.0, 17.0, 23.0, 17.0, 20.0, 17.0, 12.0, 12.0, 13.0, 15.0]
sigmas = sqrt.(ys)
p0_6para = [1e-5; 0.001*ones(5)]
@show optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo=:nm)
@show optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo=:lbfgs)
r = optim_fit2(ATLAS_f_6p, xs, ys, sigmas, p0_6para)
@show r
@show r.minimizer
end
```

```
julia> main()
optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = :nm) = retcode: Success
u: [0.03756371444660228, -7.340228730228301, 1.0019315916033942, 0.1351637703868156, -0.32867928276850933, -0.04894659548629979]
Final objective value: 98.5061556041032
optim_fit(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = :lbfgs) = retcode: Success
u: [1.0e-5, 0.001, 0.001, 0.001, 0.001, 0.001]
Final objective value: 7142.997313963177
0.040708 seconds (93.77 k allocations: 40.363 MiB)
r = * Status: failure (reached maximum number of iterations)
* Candidate solution
Final objective value: 9.850700e+01
* Found with
Algorithm: L-BFGS
* Convergence measures
|x - x'| = 9.03e-03 ≰ 0.0e+00
|x - x'|/|x'| = 3.51e-03 ≰ 0.0e+00
|f(x) - f(x')| = 1.00e-05 ≰ 0.0e+00
|f(x) - f(x')|/|f(x')| = 1.02e-07 ≰ 0.0e+00
|g(x)| = 1.15e+02 ≰ 1.0e-08
* Work counters
Seconds run: 0 (vs limit Inf)
Iterations: 1000
f(x) calls: 2743
∇f(x) calls: 2743
r.minimizer = [0.3162552206761995, -1.0013989067701674, 2.575248450283215, 0.6800532460951797, -0.23266037063777448, -0.04227604261832813]
6-element Vector{Float64}:
0.3162552206761995
-1.0013989067701674
2.575248450283215
0.6800532460951797
-0.23266037063777448
-0.04227604261832813
``
```

Well “works” , it still hit 1000 iterations, but sure the cost is close to what one would expect. LBFGS should finish (much) faster

I looked a bit at the gradient, and it seems while probably correct, it seems to be unstable.

For the following figure: The gradient can be used to provide a linear approximation (first order Taylor) to the cost. The figure plots the approximation error and checks whether it has slope 2 (the check is the dashed line). This is the linear approximation at the start point `p0`

.

I also looked at the gradient norms during a (similar) LBFGS run (from Manopt.jl for the simple reason that I could get the following print easily)

```
# 119 f(x): 98.521056 |grad f(p)|:112.59798058575441
# 120 f(x): 98.521053 |grad f(p)|:0.11722103028124436
# 121 f(x): 98.521053 |grad f(p)|:0.10405872433131812
# 122 f(x): 98.521053 |grad f(p)|:0.36430340562416974
# 123 f(x): 98.521053 |grad f(p)|:0.04506270919617104
# 124 f(x): 98.521053 |grad f(p)|:0.05409064169436933
# 125 f(x): 98.521053 |grad f(p)|:0.14359850776461958
[...]
# 997 f(x): 98.509641 |grad f(p)|:71.62963904974352
# 998 f(x): 98.509630 |grad f(p)|:44.0010184317672
# 999 f(x): 98.509627 |grad f(p)|:147.03947404681605
# 1000 f(x): 98.509622 |grad f(p)|:92.35763900759153
```

So I would carefully conclude, the gradient computation is not that stable.

edit: with `using Manopt, Manifolds`

this is what I added in the original snipped above

```
elseif algo == :lbfgsManopt
M = Euclidean(length(p0))
f = p -> F(p,1)
mn_f = (M,p) -> f(p)
mn_g = (M, p) -> ForwardDiff.gradient(f, p)
p3 = quasi_Newton(M, mn_f, mn_g, p0;
debug=[:Iteration, " ", :Cost, " ", :GradientNorm, 1, "\n", :Stop],
# return_state = true,
)
return p3
end
```

but careful, this prints one line per iteration for the 1000 iterations – and returns the state eventually (uncommenting a line) the one can also see it hits max tier)

I also checked whether the difference is ForwardDiff or Zygote. But both works.

Ideally, Optimization.jl should just wrap Optim.jl. But both with default options Optimization.jl fails.

Does that mean I should reformulate the problem somehow? To avoid gradient being unstable

This is a pretty standard chi2 fit, so idk what I’m doing wrong

Chi2 has a nice gradient part in the chain rule (with respect to cs),

your atlas_magic function is not really nice and probably the culprit. But for me it is too late to try to give you the analytical gradient of that.

I know of cases where SciML secretly switches the AD backend out from under you. Could this be one of those @Vaibhavdixit02?

So, here is an approach to the gradient. Since you called it `atlas_6p`

we can even to the general case.

We consider for given data x, y, and \sigma.

f_n(p) = p_1(1-x)^{p_2}x^{g(x)} with g_n(p) = p_3+p_4\log x + \ldots+ p_n(\log x)^{n-3}

as well as X_2(x) = \frac{(y-x)^2}{\sigma^2}, then X_2'(x) = -\frac{2(y-x)}{\sigma^2}.

Then the gradient of g_n is easy, since all variables appear linear, we have

\nabla g_n(p) = \begin{pmatrix} 0\\0\\1\\\log x\\\vdots\\(\log x)^{n-3}\end{pmatrix}

and hence \nabla f(p) has the components (\nabla f(p))_i with (in vector form maybe a bit spacey):

- (1-x)^{p_2}x^{g(x)} for i=1
- p_1\ln(x-1)(x-1)^{p_2}x^{g_n(p)} for i=2
- p_1(1-x)^{p_2}ln(x)x^{g_n(p)}(\nabla g_n(p))_i = p_1(1-x)^{p_2}ln(x)x^{g_n(p)}(\log x)^{i-3} for i=3,\ldots,n

The final gradient of F_n(p) = X_2(f_n(p)) is now another chain rule. Each entry reads as

(\nabla F_n(p))_i = -\frac{2(y-f_n(p))}{\sigma^2}(\nabla f_n(p))_i

if one implements these step-by-step (`gn`

, `grad_gn`

, `fn`

, `grad_fn`

) the gradient is given in closed form.

The only warning on this is, that the computation was done after sundown (and not yet in December where that is 2pm here in Trondheim, but more like 9pm).

*edit*: If one has multiple data x_j,y_j,\sigma_j and performs this X_2 element wise and sums that, the overall gradient is the sum of the single ones \nabla F_{n,x_j,y_j,\sigma_j}(p).

*edit2:* The first few things I missed is, that when 0<x<1 *and* numerically p_2<0 this all does not work.

thanks for the insight, so if I updated your code to increase iterations to 10^5:

```
return Optim.optimize(F, p0, LBFGS(), Optim.Options(;iterations=10^5); autodiff=:forward)
```

I get

```
Status: failure (line search failed)
* Candidate solution
Final objective value: 9.802206e+01
* Found with
Algorithm: L-BFGS
```

btw, even with more iterations, the result from NelderMead and LBFGS don’t converge:

```
function optim_fit2(func::T, xs, ys, σs, p0; algo = LBFGS()) where T
F = function (p)
cs = func(xs, p)
return chi2(ys, cs, σs)
end
return Optim.optimize(F, p0, algo, Optim.Options(;iterations=5000); autodiff=:forward)
end
sol_nm = optim_fit2(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = NelderMead())
sol_lbfgs = optim_fit2(ATLAS_f_6p, xs, ys, sigmas, p0_6para; algo = LBFGS())
julia> sol_nm.minimum, sol_lbfgs.minimum
(98.51662613370482, 98.0220580738921)
julia> sol_nm.minimizer, sol_lbfgs.minimizer
([0.014343016762772673, 0.36918264249014077, 0.08258611548972156, -0.019076422764126248, -0.31125906069802134, -0.044992442492889535], [11207.052120386572, -913.9928401545511, 165.74801872441768, 91.61206638696663, 17.418590874571905, 1.1299718078901382])
```

if we use BFGS + stopping criteria from Connection between BFGS and ROOT.Minuit. Stopping criteria - #4 by Yuan-Ru-Lin

which is yet another set of `minimizer`

…

```
julia> sol_lbfgs.minimizer |> print
[3.724592263478385, 0.7652684260382869, 4.556910181502994, 1.2983875079831064, -0.1450263191617602, -0.037556051055012134]
```

I think this is consistent with what @ChrisRackauckas said about the line search in Optim is “bad” (not sure what it means exactly).

yeah, I’m okay with `Optimization.LBFGS()`

fails I guess, but right now `Optim.LBFGS()`

also fails when used from Optimization.jl which is not ideal.