Differences between NLsolve and Optim in solving system of equations

I am solving a system of equations using 2 methods. Solving it “directly” with NLsolve, and solving it by minimizing the norm of the residuals of the system. Here’s an example:

using NLsolve
using Optim
using LinearAlgebra
using Random

Random.seed!(9876)

const szE = 2
const szG = 2
const szA = 9
const β = 0.92
const δ = 0.10

Π = 1000 .* rand(9,2,9,2)
Sm = 1000 .* rand(9,2)
Sf = 1000 .* rand(9,2)

function maxT(am, af; Z=9)
    Z - max(am, af) + 1
end

function solvnl!(res, μ0; Sm=Sm, Sf=Sf, Π=Π)
    res2 = reshape(res, (szA, szE, szG))

    μ0′ = reshape(μ0, (szA, szE, szG))

    prodsummand = similar(μ0, (szA, szE, szA, szE))
    for idx in CartesianIndices(prodsummand)
        am, em, af, ef = Tuple(idx)
        prodsummand[idx] = Π[am,em,af,ef]*sqrt(Sm[am,em]*Sf[af,ef])*prod( abs((μ0′[am+k,em,1]*μ0′[af+k,ef,2])/(Sm[am+k,em]*Sf[af+k,ef]))^(0.5*(β*(1-δ))^k) for k in 0:(maxT(am,af; Z=9)-1) )
    end

    for idx in CartesianIndices(res2)
        am, em = Tuple(idx)
        res2[am, em, 1] = Sm[am, em] - μ0′[am,em,1] - sum( prodsummand[am,em,af,ef] for af in 1:szA, ef in 1:szE )
    end
    for idx in CartesianIndices(res2)
        af, ef = Tuple(idx)
        res2[af, ef, 2] = Sf[af, ef] - μ0′[af,ef,2] - sum( prodsummand[am,em,af,ef] for am in 1:szA, em in 1:szE )
    end
end

function solvopt!(μ0; Sm=Sm, Sf=Sf, Π=Π)
    res2 = similar(μ0, (szA, szE, szG))

    μ0′ = reshape(μ0, (szA, szE, szG))

    prodsummand = similar(μ0, (szA, szE, szA, szE))
    for idx in CartesianIndices(prodsummand)
        am, em, af, ef = Tuple(idx)
        prodsummand[idx] = Π[am,em,af,ef]*sqrt(Sm[am,em]*Sf[af,ef])*prod( abs((μ0′[am+k,em,1]*μ0′[af+k,ef,2])/(Sm[am+k,em]*Sf[af+k,ef]))^(0.5*(β*(1-δ))^k) for k in 0:(maxT(am,af; Z=9)-1) )
    end

    for idx in CartesianIndices(res2)
        am, em = Tuple(idx)
        res2[am, em, 1] = Sm[am, em] - μ0′[am,em,1] - sum( prodsummand[am,em,af,ef] for af in 1:szA, ef in 1:szE )
    end
    for idx in CartesianIndices(res2)
        af, ef = Tuple(idx)
        res2[af, ef, 2] = Sf[af, ef] - μ0′[af,ef,2] - sum( prodsummand[am,em,af,ef] for am in 1:szA, em in 1:szE )
    end

    return norm(res2)
end

init = rand(9,2,2)

nlsol = nlsolve(solvnl!, init; iterations=10_000, autodiff=:forward, method=:newton)

optsol = optimize(solvopt!, init, BFGS(), Optim.Options(iterations=10_000); autodiff=:forward)

The problem is that, while nlsolve finds a zero, optimize does not. In fact, it is strange that nlsol.zero does obtain a zero for solvopt! that Optim can’t find. Even starting very close to the zero, the solution by optimization goes somewhere else:

julia> optsol = optimize(solvopt!, nlsol.zero .+ (10 .* rand((szA, szE, szG))), ConjugateGradient(), Optim.Options(iterations=10_000, f_tol=1e-10, x_tol=1e-10, g_tol=1e-10); autodiff=:forward)
 * Status: success

 * Candidate solution
    Final objective value:     3.816617e+06

 * Found with
    Algorithm:     Conjugate Gradient

 * Convergence measures
    |x - x'|               = 0.00e+00 ≤ 1.0e-10
    |x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 1.0e-10
    |g(x)|                 = 2.30e+09 ≰ 1.0e-10

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    15
    f(x) calls:    67
    ∇f(x) calls:   59

What could be happening?

What happens if you start Optim from the solution of NLsolve?

says line search failed:

julia> optsol = optimize(solvopt!, nlsol.zero, BFGS(), Optim.Options(iterations=10_000); autodiff=:forward)
 * Status: failure (line search failed)

 * Candidate solution
    Final objective value:     1.925837e-04

 * Found with
    Algorithm:     BFGS

 * Convergence measures
    |x - x'|               = 0.00e+00 ≤ 0.0e+00
    |x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|         = 1.93e-04 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.00e+00 ≰ 0.0e+00
    |g(x)|                 = 1.63e+08 ≰ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    1
    f(x) calls:    87
    ∇f(x) calls:   87

Weird, it reports that the objective value is low (1.925837e-04), but the gradient is large (norm = 1.63e+08). Usually, the line search fails when the solution of the subproblem is not a descent direction.
All in all, this suggests that the gradients are wrong :face_with_raised_eyebrow:

Are you sure your initial perturbation for the initial guess is actually small, in the case of optimization? Small relative to the second derivatives of f?

Generally it’s not a good idea to solve f(x)=0 by minimizing the norm of the residual \|f(x)\|, neither for univariate f nor f : R^n \rightarrow R^n, n>1. You’re trading finding the intersection of a curve with a line for finding the minimum of a quadratic, locally.

I tried with numerical derivatives, so I don’t know how the gradient is wrong.

julia> optsol = optimize(solvopt!, nlsol.zero .+ (10 .* rand((szA, szE, szG))), ConjugateGradient(), Optim.Options(iterations=10_000, f_tol=1e-10, x_tol=1e-10, g_tol=1e-10); autodiff=:finite)
 * Status: success

 * Candidate solution
    Final objective value:     3.266168e+06

 * Found with
    Algorithm:     Conjugate Gradient

 * Convergence measures
    |x - x'|               = 0.00e+00 ≤ 1.0e-10
    |x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 1.0e-10
    |g(x)|                 = 9.22e+03 ≰ 1.0e-10

 * Work counters
    Seconds run:   1  (vs limit Inf)
    Iterations:    29
    f(x) calls:    294
    ∇f(x) calls:   275

Generally, don’t do that. Besides the problem mentioned by @John_Gibson, you also square the condition number. For solving a nonlinear system, use a nonlinear rootfinder.

Apart from NLsolve, any other package suggestions?

(WIP, but I use it for my own work)

Fantastic, thank you! and Judd for the theory again?

One more thing: root solvers for the n-dimensional systems f(x) = 0 utilize the n \times n derivative \partial f_i/\partial x_j, but an optimizer applied to \|f(x)\|^2 is guided by the n-dimensional gradient \nabla \| f(x) \|^2 = 2 \nabla f\cdot f or componentwise, 2\sum_j f_j \, \partial f_i/\partial x_j. You lose information about the location of the solution when you contract from the n^2-dimensional derivative (root finding) to the n-dimensional gradient (optimization), making the optimization harder.

I see that they mention those problems in my references. But I thought I had seen that approach before, hence why I was trying it.

I would recommend Numerical Optimization (2nd ed) by Nocedal & Wright instead for an intro. I find it intuitive, accessible, and practical.

I use NLSolvers for this. i had this function defined,as NLSolvers provides the bare minimum:

using NLSolvers,ForwardDiff
function nlsolve(f!,x0,method=TrustRegion(Newton(), Dogleg()),options=NEqOptions())
    len = length(x0)
    xcache = zeros(eltype(x0),len)
    Fcache = zeros(eltype(x0),len)
    JCache = zeros(eltype(x0),len,len)
    jconfig = ForwardDiff.JacobianConfig(f!,x0,x0)
    function j!(J,x)
        #@show J
        ForwardDiff.jacobian!(J,f!,Fcache,x,jconfig)
    end
    function fj!(F,J,x) 
        #@show J,F
        ForwardDiff.jacobian!(J,f!,F,x,jconfig)
        F,J
    end
    
    function jv!(x)
        function JacV(Fv,v)
            ForwardDiff.jacobian!(JCache,f!,Fcache,v,jconfig)
            Fv .= Jcache * v
        end
        return LinearMap(JacV,length(x))
    end
    vectorobj = NLSolvers.VectorObjective(f!,j!,fj!,jv!)
    vectorprob = NEqProblem(vectorobj)
    res = solve(vectorprob, x0,method , options)
end

@longemen3000 I’m looking a the NLSolvers package. What is that jv! function? It is not described in the documentation.

It’s a Jacobian-vector product (J*v). The idea behind this is that this operation could be written without allocating a full Jacobian (my version is naive,as it just calculates the Jacobian in an internal cache)

I see. I will have to read more about that. But you mentioned that it’s naive because you actually compute the Jacobian, which is supposed to be bypassed by the Jacobian-vector product operation. But, still that is quicker than computing finite differences?

I would say yes, as there are only length(x) *k function evaluations where:

  • if n<=12, k= 1 (ForwardDiff evaluates all partials at once)
  • n > 12, k = div(n, 12)+1 (ForwardDiff evaluates the partials in chunks of twelve)
    Finite differences, if I’m correct, requires around n^2 evaluations

Ok. One final question: that Jacobian-vector product arises in Forward mode AD, right? as I understand, it’s not part of the algorithm itself for solving the system, nor it arises in Reverse mode AD, is that correct?

It does not come from any optimization algorithm needs, but from the need to reduce the amount of computation needed to obtain derivative information. I don’t know if reverse AD is different here