Newton optimization with Optim doing strange things


#1

I’m having some issues with optimizing an objective with Newton from Optim. I started a brief discussion on slack which has led me to isolate the problem:

Okay I am getting closer to the issue. So consider gradient and hessian:

st0=[160.2, -54.8, -210.8, -0.8, 106.2, -0.0];
st1=[137.6 -34.4 -34.4 -34.4 -34.4 -1.0; -34.4 137.6 -34.4 -34.4 -34.4 -1.0; -34.4 -34.4 137.6 -34.4 -34.4 -1.0; -34.4 -34.4 -34.4 137.6 -34.4 -1.0; -34.4 -34.4 -34.4 -34.4 137.6 -1.0; -1.0 -1.0 -1.0 -1.0 -1.0 0.0]

the newton step I am expecting to happen is

Main> st1\st0
6-element Array{Float64,1}:
  0.931395
 -0.318605
 -1.22558
 -0.00465116
  0.617442
  4.90803e-14

which (if you do two iterations of this) you get close to my optimum. The Optim package is using PositiveFactorizations approach (because st1 is not PosDef) and does (basically) this:

Main> F=cholfact(Positive, LowerTriangular(st1))
Base.LinAlg.Cholesky{Float64,Array{Float64,2}} with factor:
[11.7303 0.0 … 0.0 0.0; -2.93258 11.3578 … 0.0 0.0; … ; -2.93258 -3.78594 … 1.0 0.0; -0.0852493 -0.110056 … 0.0 0.340997]

Main> F\st0
6-element Array{Float64,1}:
  -0.457849
  -1.70785
  -2.61483
  -1.3939
  -4.12115e-13
 -26.55

(which corresponds to the values I see in the trace of the Newton output). This does not converge to the right solution. What is the problem here?

These two approaches boil down to small differences in the hessian that is used in the Newton step:

Main> F.factors*ctranspose(F.factors)
6×6 Array{Float64,2}:
 137.6  -34.4  -34.4  -34.4  -34.4  -1.0
 -34.4  137.6  -34.4  -34.4  -34.4  -1.0
 -34.4  -34.4  137.6  -34.4  -34.4  -1.0
 -34.4  -34.4  -34.4  137.6  -34.4  -1.0
 -34.4  -34.4  -34.4  -34.4  138.6   4.0
  -1.0   -1.0   -1.0   -1.0    4.0   0.232558

Main> st1
6×6 Array{Float64,2}:
 137.6  -34.4  -34.4  -34.4  -34.4  -1.0
 -34.4  137.6  -34.4  -34.4  -34.4  -1.0
 -34.4  -34.4  137.6  -34.4  -34.4  -1.0
 -34.4  -34.4  -34.4  137.6  -34.4  -1.0
 -34.4  -34.4  -34.4  -34.4  137.6  -1.0
  -1.0   -1.0   -1.0   -1.0   -1.0   0.0

Below is 1 iteration using Optim.Newton (showing values that correspond to the PositiveFactorizations)

Main> res=Optim.optimize(logLC, g!, h!, zeros(m+1), show_trace=true, extended_trace=true, method=Optim.Newton(linesearch=LineSearches.Static(alpha=1.0)), iterations=1)
Iter     Function value   Gradient norm
     0     1.348190e+03     2.108000e+02
 * Current step size: 1.0
 * g(x): [160.2, -54.8, -210.8, -0.8, 106.2, -0.0]
 * h(x): [137.6 -34.4 -34.4 -34.4 -34.4 -1.0; -34.4 137.6 -34.4 -34.4 -34.4 -1.0; -34.4 -34.4 137.6 -34.4 -34.4 -1.0; -34.4 -34.4 -34.4 137.6 -34.4 -1.0; -34.4 -34.4 -34.4 -34.4 137.6 -1.0; -1.0 -1.0 -1.0 -1.0 -1.0 0.0]
 * x: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
     1     9.554691e+02     7.972837e+01
 * Current step size: 1.0
 * g(x): [10.9235, -30.7894, -2.10531, -31.0505, -79.7284, -6.17442]
 * h(x): [79.4839 -23.51 -23.3 -21.8481 -10.8258 -1.0; -23.51 135.839 -54.0743 -41.7406 -16.5137 -1.0; -23.3 -54.0743 138.641 -45.3491 -15.9179 -1.0; -21.8481 -41.7406 -45.3491 124.487 -15.5493 -1.0; -10.8258 -16.5137 -15.9179 -15.5493 58.8067 -1.0; -1.0 -1.0 -1.0 -1.0 -1.0 0.0]
 * x: [0.457849, 1.70785, 2.61483, 1.3939, 4.12115e-13, 26.55]
Results of Optimization Algorithm
 * Algorithm: Newton's Method
 * Starting Point: [0.0,0.0,0.0,0.0,0.0,0.0]
 * Minimizer: [0.45784883720971925,1.7078488372097196, ...]
 * Minimum: 9.554691e+02
 * Iterations: 1
 * Convergence: false
   * |x - x'| ≤ 1.0e-32: false
     |x - x'| = 2.66e+01
   * |f(x) - f(x')| ≤ 1.0e-32 |f(x)|: false
     |f(x) - f(x')| = 4.11e-01 |f(x)|
   * |g(x)| ≤ 1.0e-08: false
     |g(x)| = 7.97e+01
   * Stopped by an increasing objective: false
   * Reached Maximum Number of Iterations: true
 * Objective Calls: 2
 * Gradient Calls: 2
 * Hessian Calls: 2

#2

The point of positive factorizations, as far as I understand, is to guarantee a descent direction, which pretty much ensures convergence to a local minimum. What’s the problem here? Does it not converge when you let it run for several iterations? Or is it just less efficient than “standard” Newton?


#3

No it does not converge when I use the Optim.Newton method. I’ve been bashing my head trying to figure out the reason because when I manually do the newton steps (inv(st1)*st0) i converge in 2 steps. I’ve narrowed it down to this PositiveFactorization approach.


#4

I made an issue to see if I can request a normal newton step rather than this PositiveFactorizations step:


#5

This is unfortunate, and I think you should make an issue at PositiveFactorizations. I intend to enable other factorizations, just never got around to it. Maybe now is a good time :slight_smile:


#6

So what does it actually do when you increase the number of iterations? Can you post the full trace? The positive factorization is supposed to guarantee a descent direction so I don’t see how it can not converge.


#7

The problem is that it produces NaNs, so there might be a bug


#8

An MWE would help a lot in debugging this.


#9

I’ve provided the gradient that results in the strange PositiveFactorization step, what more do you need?


#10

I’m not sure Tamas saw your example, but I may of course be wrong. I’m not too familiar with the exact implementation in PositiveFactorizations, but I plan to allow user chosen factorizations very soon, so that should at least allow people to choose another globalization strategy.


#11

I still can’t see logLC, sorry if it is there and I am just missing it.


#12

Ah, sure you can’t run that part of the code, what I viewed as the MWE was the gradient and hessian combination that caused the issue.


#13

I was interested in the original problem because of the following:

  1. Optimization should be robust to small (or even medium-sized) errors in the derivatives, since it is sometimes calculated by finite differences, or the input can be noisy (numerically or inherently). Most packages in widespread use cope with these just fine. When this does not happen, the whole problem is very useful as a test case.

  2. I am not an expert on the theory of (quasi-)Newton methods, and while the original justification for PositiveFactorizations.jl makes sense, I am not aware of a proof (cf the literature on quasi-Newton with line search). So I am curious about failures in that perspective: is it the line search failing?

  3. Finally, in my experience Newton’s method (even with linesearch) is not very robust, and instead I would recommend (L)BFGS or CG. So it would be interesting to try them on this problem.


#14

Okay, I guess it depends if you read this as a “why can’t I optimize this function” question or “why does PositiveFactorizations.jl not work here?” question :slight_smile:


#15

I don’t know what you want to prove, but if the matrix G is SPD, then f(x - alpha G nabla f(x)) < f(x) for alpha small enough, and so (with proper line search) the objective can only go down, usually ensuring convergence to a local minimum. I’m also interested in understanding why PositiveFactorizations doesn’t work in this case.


#16

The first thing I would like to understand is what

means in this context. It (ultimately) converges to a different but valid local extrema? Or fails somehow?

In any case, I don’t think that asking for an MWE was an unreasonable request.


#17

In my experience PositiveFactorizations actually helped make sure that there was convergence (as pointed out above, it’s needed to make sure that the algorithm takes a descent direction).

Every time I couldn’t get it to converge to a minimum was because I had made a mistake in the Hessian computation, so maybe it’s good to have a MWE where it’s easy to check that the provided Hessian is correct.


#18

LBFGS works great on this problem. I was trying to get Newton Raphson to work to compare it. I’m confident I have coded up the gradient and hessian correctly because they match what ForwardDiff.gradient and ForwardDiff.hessian give.

The code is far too complex and lengthy to paste here as a “MWE”. That said I am not trying to debug my optimization, as I said LBFGS works fine, NR works fine with the standard Newton step, the gradients and hessians seem correct. All that seems wonky is the Positive Factorization.


#19

Interesting, can you check by hand whether the value of the objective function is decreasing with every step of the Newton algorithm with Positive Factorization?


#20

So here the newton raphson trace output with using the PositiveFactorizations approach and without. Note the obj function does decrease with PF, but I think its because overflow occured in my objective function:

julia> @time cml_nr(n,m,r,s,f, false, false, true)
Iter     Function value   Gradient norm
     0     1.348190e+03     2.108000e+02
 * Current step size: 1.0
 * g(x): [160.2, -54.8, -210.8, -0.8, 106.2, -0.0]
 * h(x): [137.6 -34.4 -34.4 -34.4 -34.4 -1.0; -34.4 137.6 -34.4 -34.4 -34.4 -1.0; -34.4 -34.4 137.6 -34.4 -34.4 -1.0; -34.4 -34.4 -34.4 137.6 -34.4 -1.0; -34.4 -34.4 -34.4 -34.4 137.6 -1.0; -1.0 -1.0 -1.0 -1.0 -1.0 0.0]
 * x: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
     1    -2.320286e+06    6.083194e+301
 * Current step size: 120.68070968502755
 * g(x): [5.44665e180, -9.51189e33, -2800.07, -7.52919e66, -6.08319e301, -745.133]
 * h(x): [-Inf NaN NaN Inf Inf -1.0; NaN -3.03611e65 NaN NaN NaN -1.0; NaN NaN 1.04026e-45 NaN NaN -1.0; Inf NaN NaN -1.90231e131 -Inf -1.0; Inf NaN NaN -Inf -Inf -1.0; -1.0 -1.0 -1.0 -1.0 -1.0 0.0]
 * x: [55.2535, 206.104, 315.559, 168.216, 4.97343e-11, 3204.07]
ERROR: Initial value and slope must be finite

without positive factorization:

julia> @time cml_nr(n,m,r,s,f, false, false, true)
Iter     Function value   Gradient norm
     0     1.348190e+03     2.108000e+02
 * Current step size: 1.0
 * g(x): [160.2, -54.8, -210.8, -0.8, 106.2, -0.0]
 * h(x): [137.6 -34.4 -34.4 -34.4 -34.4 -1.0; -34.4 137.6 -34.4 -34.4 -34.4 -1.0; -34.4 -34.4 137.6 -34.4 -34.4 -1.0; -34.4 -34.4 -34.4 137.6 -34.4 -1.0; -34.4 -34.4 -34.4 -34.4 137.6 -1.0; -1.0 -1.0 -1.0 -1.0 -1.0 0.0]
 * x: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
     1     1.102441e+03     4.917147e+01
 * Current step size: 1.2213854333908536
 * g(x): [8.48736, -17.6975, 49.1715, -26.7709, -13.1904, -4.44089e-16]
 * h(x): [63.4333 -18.038 -16.9569 -16.4714 -11.967 -1.0; -18.038 131.916 -51.2141 -38.5868 -24.0776 -1.0; -16.9569 -51.2141 132.1 -40.6341 -23.2946 -1.0; -16.4714 -38.5868 -40.6341 117.364 -21.6712 -1.0; -11.967 -24.0776 -23.2946 -21.6712 81.0105 -1.0; -1.0 -1.0 -1.0 -1.0 -1.0 0.0]
 * x: [-1.13759, 0.389139, 1.49691, 0.00568086, -0.754134, -5.9946e-14]
     2     1.091575e+03     7.564867e-01
 * Current step size: 0.9515390676584067
 * g(x): [0.756487, -0.539144, -0.278554, -0.0014293, 0.0626405, -0.0]
 * h(x): [59.351 -16.3114 -16.2906 -15.3438 -11.4052 -1.0; -16.3114 138.994 -54.642 -41.8713 -26.1698 -1.0; -16.2906 -54.642 144.362 -46.3867 -27.0428 -1.0; -15.3438 -41.8713 -46.3867 127.793 -24.1915 -1.0; -11.4052 -26.1698 -27.0428 -24.1915 88.8093 -1.0; -1.0 -1.0 -1.0 -1.0 -1.0 0.0]
 * x: [-1.24602, 0.470332, 1.2328, 0.16681, -0.623927, 6.02991e-13]
     3     1.091570e+03     1.272630e-03
 * Current step size: 1.0039156456976184
 * g(x): [0.000311023, 0.00127263, -6.37752e-5, -0.000822663, -0.000697215, -0.0]
 * h(x): [58.8431 -16.1746 -16.1288 -15.216 -11.3237 -1.0; -16.1746 139.011 -54.7283 -41.924 -26.1841 -1.0; -16.1288 -54.7283 144.273 -46.3901 -27.0263 -1.0; -15.216 -41.924 -46.3901 127.725 -24.1954 -1.0; -11.3237 -26.1841 -27.0263 -24.1954 88.7295 -1.0; -1.0 -1.0 -1.0 -1.0 -1.0 0.0]
 * x: [-1.25612, 0.474913, 1.23598, 0.168406, -0.623179, -3.57198e-12]
     4     1.091570e+03     1.583828e-09
 * Current step size: 0.999999752811262
 * g(x): [4.02963e-10, 4.77087e-10, -1.58383e-9, -1.58423e-10, 9.03981e-10, -0.0]
 * h(x): [58.8431 -16.1746 -16.1288 -15.216 -11.3237 -1.0; -16.1746 139.011 -54.7283 -41.924 -26.1841 -1.0; -16.1288 -54.7283 144.273 -46.3901 -27.0263 -1.0; -15.216 -41.924 -46.3901 127.725 -24.1954 -1.0; -11.3237 -26.1841 -27.0263 -24.1954 88.7295 -1.0; -1.0 -1.0 -1.0 -1.0 -1.0 0.0]
 * x: [-1.25613, 0.474906, 1.23598, 0.168411, -0.623172, -4.25406e-12]
  3.995424 seconds (2.88 M allocations: 130.049 MiB, 0.95% gc time)