 # Differences between NLsolve and Optim in solving system of equations

I am solving a system of equations using 2 methods. Solving it “directly” with NLsolve, and solving it by minimizing the norm of the residuals of the system. Here’s an example:

using NLsolve
using Optim
using LinearAlgebra
using Random

Random.seed!(9876)

const szE = 2
const szG = 2
const szA = 9
const β = 0.92
const δ = 0.10

Π = 1000 .* rand(9,2,9,2)
Sm = 1000 .* rand(9,2)
Sf = 1000 .* rand(9,2)

function maxT(am, af; Z=9)
Z - max(am, af) + 1
end

function solvnl!(res, μ0; Sm=Sm, Sf=Sf, Π=Π)
res2 = reshape(res, (szA, szE, szG))

μ0′ = reshape(μ0, (szA, szE, szG))

prodsummand = similar(μ0, (szA, szE, szA, szE))
for idx in CartesianIndices(prodsummand)
am, em, af, ef = Tuple(idx)
prodsummand[idx] = Π[am,em,af,ef]*sqrt(Sm[am,em]*Sf[af,ef])*prod( abs((μ0′[am+k,em,1]*μ0′[af+k,ef,2])/(Sm[am+k,em]*Sf[af+k,ef]))^(0.5*(β*(1-δ))^k) for k in 0:(maxT(am,af; Z=9)-1) )
end

for idx in CartesianIndices(res2)
am, em = Tuple(idx)
res2[am, em, 1] = Sm[am, em] - μ0′[am,em,1] - sum( prodsummand[am,em,af,ef] for af in 1:szA, ef in 1:szE )
end
for idx in CartesianIndices(res2)
af, ef = Tuple(idx)
res2[af, ef, 2] = Sf[af, ef] - μ0′[af,ef,2] - sum( prodsummand[am,em,af,ef] for am in 1:szA, em in 1:szE )
end
end

function solvopt!(μ0; Sm=Sm, Sf=Sf, Π=Π)
res2 = similar(μ0, (szA, szE, szG))

μ0′ = reshape(μ0, (szA, szE, szG))

prodsummand = similar(μ0, (szA, szE, szA, szE))
for idx in CartesianIndices(prodsummand)
am, em, af, ef = Tuple(idx)
prodsummand[idx] = Π[am,em,af,ef]*sqrt(Sm[am,em]*Sf[af,ef])*prod( abs((μ0′[am+k,em,1]*μ0′[af+k,ef,2])/(Sm[am+k,em]*Sf[af+k,ef]))^(0.5*(β*(1-δ))^k) for k in 0:(maxT(am,af; Z=9)-1) )
end

for idx in CartesianIndices(res2)
am, em = Tuple(idx)
res2[am, em, 1] = Sm[am, em] - μ0′[am,em,1] - sum( prodsummand[am,em,af,ef] for af in 1:szA, ef in 1:szE )
end
for idx in CartesianIndices(res2)
af, ef = Tuple(idx)
res2[af, ef, 2] = Sf[af, ef] - μ0′[af,ef,2] - sum( prodsummand[am,em,af,ef] for am in 1:szA, em in 1:szE )
end

return norm(res2)
end

init = rand(9,2,2)

nlsol = nlsolve(solvnl!, init; iterations=10_000, autodiff=:forward, method=:newton)

optsol = optimize(solvopt!, init, BFGS(), Optim.Options(iterations=10_000); autodiff=:forward)


The problem is that, while nlsolve finds a zero, optimize does not. In fact, it is strange that nlsol.zero does obtain a zero for solvopt! that Optim can’t find. Even starting very close to the zero, the solution by optimization goes somewhere else:

julia> optsol = optimize(solvopt!, nlsol.zero .+ (10 .* rand((szA, szE, szG))), ConjugateGradient(), Optim.Options(iterations=10_000, f_tol=1e-10, x_tol=1e-10, g_tol=1e-10); autodiff=:forward)
* Status: success

* Candidate solution
Final objective value:     3.816617e+06

* Found with

* Convergence measures
|x - x'|               = 0.00e+00 ≤ 1.0e-10
|x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
|f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
|f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 1.0e-10
|g(x)|                 = 2.30e+09 ≰ 1.0e-10

* Work counters
Seconds run:   0  (vs limit Inf)
Iterations:    15
f(x) calls:    67
∇f(x) calls:   59


What could be happening?

What happens if you start Optim from the solution of NLsolve?

1 Like

says line search failed:

julia> optsol = optimize(solvopt!, nlsol.zero, BFGS(), Optim.Options(iterations=10_000); autodiff=:forward)
* Status: failure (line search failed)

* Candidate solution
Final objective value:     1.925837e-04

* Found with
Algorithm:     BFGS

* Convergence measures
|x - x'|               = 0.00e+00 ≤ 0.0e+00
|x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
|f(x) - f(x')|         = 1.93e-04 ≰ 0.0e+00
|f(x) - f(x')|/|f(x')| = 1.00e+00 ≰ 0.0e+00
|g(x)|                 = 1.63e+08 ≰ 1.0e-08

* Work counters
Seconds run:   0  (vs limit Inf)
Iterations:    1
f(x) calls:    87
∇f(x) calls:   87


Weird, it reports that the objective value is low (1.925837e-04), but the gradient is large (norm = 1.63e+08). Usually, the line search fails when the solution of the subproblem is not a descent direction.
All in all, this suggests that the gradients are wrong 1 Like

Are you sure your initial perturbation for the initial guess is actually small, in the case of optimization? Small relative to the second derivatives of f?

Generally it’s not a good idea to solve f(x)=0 by minimizing the norm of the residual \|f(x)\|, neither for univariate f nor f : R^n \rightarrow R^n, n>1. You’re trading finding the intersection of a curve with a line for finding the minimum of a quadratic, locally.

4 Likes

I tried with numerical derivatives, so I don’t know how the gradient is wrong.

julia> optsol = optimize(solvopt!, nlsol.zero .+ (10 .* rand((szA, szE, szG))), ConjugateGradient(), Optim.Options(iterations=10_000, f_tol=1e-10, x_tol=1e-10, g_tol=1e-10); autodiff=:finite)
* Status: success

* Candidate solution
Final objective value:     3.266168e+06

* Found with

* Convergence measures
|x - x'|               = 0.00e+00 ≤ 1.0e-10
|x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
|f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
|f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 1.0e-10
|g(x)|                 = 9.22e+03 ≰ 1.0e-10

* Work counters
Seconds run:   1  (vs limit Inf)
Iterations:    29
f(x) calls:    294
∇f(x) calls:   275


Generally, don’t do that. Besides the problem mentioned by @John_Gibson, you also square the condition number. For solving a nonlinear system, use a nonlinear rootfinder.

4 Likes

Apart from NLsolve, any other package suggestions?

(WIP, but I use it for my own work)

3 Likes

Fantastic, thank you! and Judd for the theory again?

One more thing: root solvers for the n-dimensional systems f(x) = 0 utilize the n \times n derivative \partial f_i/\partial x_j, but an optimizer applied to \|f(x)\|^2 is guided by the n-dimensional gradient \nabla \| f(x) \|^2 = 2 \nabla f\cdot f or componentwise, 2\sum_j f_j \, \partial f_i/\partial x_j. You lose information about the location of the solution when you contract from the n^2-dimensional derivative (root finding) to the n-dimensional gradient (optimization), making the optimization harder.

6 Likes

I see that they mention those problems in my references. But I thought I had seen that approach before, hence why I was trying it.

1 Like

I would recommend Numerical Optimization (2nd ed) by Nocedal & Wright instead for an intro. I find it intuitive, accessible, and practical.

3 Likes

I use NLSolvers for this. i had this function defined,as NLSolvers provides the bare minimum:

using NLSolvers,ForwardDiff
function nlsolve(f!,x0,method=TrustRegion(Newton(), Dogleg()),options=NEqOptions())
len = length(x0)
xcache = zeros(eltype(x0),len)
Fcache = zeros(eltype(x0),len)
JCache = zeros(eltype(x0),len,len)
jconfig = ForwardDiff.JacobianConfig(f!,x0,x0)
function j!(J,x)
#@show J
ForwardDiff.jacobian!(J,f!,Fcache,x,jconfig)
end
function fj!(F,J,x)
#@show J,F
ForwardDiff.jacobian!(J,f!,F,x,jconfig)
F,J
end

function jv!(x)
function JacV(Fv,v)
ForwardDiff.jacobian!(JCache,f!,Fcache,v,jconfig)
Fv .= Jcache * v
end
return LinearMap(JacV,length(x))
end
vectorobj = NLSolvers.VectorObjective(f!,j!,fj!,jv!)
vectorprob = NEqProblem(vectorobj)
res = solve(vectorprob, x0,method , options)
end

1 Like

@longemen3000 I’m looking a the NLSolvers package. What is that jv! function? It is not described in the documentation.

It’s a Jacobian-vector product (J*v). The idea behind this is that this operation could be written without allocating a full Jacobian (my version is naive,as it just calculates the Jacobian in an internal cache)

1 Like

I see. I will have to read more about that. But you mentioned that it’s naive because you actually compute the Jacobian, which is supposed to be bypassed by the Jacobian-vector product operation. But, still that is quicker than computing finite differences?

I would say yes, as there are only length(x) *k function evaluations where:

• if n<=12, k= 1 (ForwardDiff evaluates all partials at once)
• n > 12, k = div(n, 12)+1 (ForwardDiff evaluates the partials in chunks of twelve)
Finite differences, if I’m correct, requires around n^2 evaluations
1 Like

Ok. One final question: that Jacobian-vector product arises in Forward mode AD, right? as I understand, it’s not part of the algorithm itself for solving the system, nor it arises in Reverse mode AD, is that correct?

1 Like

It does not come from any optimization algorithm needs, but from the need to reduce the amount of computation needed to obtain derivative information. I don’t know if reverse AD is different here

1 Like