Highly Asymmetric Hessian after optimization

The Hessian after optimization is not even close to symmetric; for example the 1, 2 element is 16.7 while the 2, 1 element is 1.4. The 12, 13 element is -3945 and 13, 12 is -16 (I’m rounding the numbers).

I’m using autodifferentiation, and so exactly where things are going wrong is unclear. These gross asymmetries appear only when Newton() is the optimization method; BFGS() does not (by eye) have them. Both methods produce essentially the same solution, though BFGS reports convergence failure. BFGS produces generally larger standard error estimates, and no negative variances.

Or maybe num_hess in the code below is just a reference to f, the latter gets garbage-collected, and a reference to random memory is left?

I’m using development code for StatsModels 0.6, though I haven’t updated it in a couple of weeks.

This may be related to my older, though still open, report of asymmetry (https://github.com/JuliaMath/Calculus.jl/issues/91), but that was in a much different context. And in my old report the asymmetry was small. See also https://github.com/JuliaDiff/ForwardDiff.jl/issues/253#issue-252103867.

This is not a nice, self-contained example, but the key code is

using DataFrames, StatsModels
using NamedArrays
using Optim, NLSolversBase
using LinearAlgebra  # for dot

function mysolve(formula, df, weight=nothing)::AbstractResult
    nsamp = size(df, 1)
    if isnothing(weight)
        w = ones(nsamp)
    else
        w = nsamp/sum(weight)*weight
    end
    xp = prep(formula, df, w)
    myll(β) = -myllike(β, xp)
    f = TwiceDifferentiable(myll, zeros(size(xp.mm, 2)); autodiff=:forward)
    r = optimize(f, zeros(size(xp.mm, 2)), Newton(), Optim.Options(store_trace=true))
    p = Optim.minimizer(r)
    num_hess = hessian!(f, p)  # this is the Hessian I reported
    vcv = inv(num_hess)
    v = diag(vcv)
    sd = v
    sd[sd .< 0.0] .= 0.0  # hack for something that should  not happen
    sd = sqrt.(sd)
    withname(x) = NamedArray(x, (xp.nms,))
    with2name(x) = NamedArray(x, [xp.nms, xp.nms])

    return SimpleResult(withname(p), withname(sd), with2name(vcv),
        with2name(num_hess), r, xp)
end

More package versions:

  [324d7699] CategoricalArrays v0.5.4 
  [a93c6f00] DataFrames v0.18.3
  [f6369f11] ForwardDiff v0.10.3
  [38e38edf] GLM v1.1.2 [`dev\GLM`]
  [54eb57ff] InteractiveCodeSearch v0.3.1
  [d41bc354] NLSolversBase v7.3.1
  [86f7a689] NamedArrays v0.9.3
  [429524aa] Optim v0.18.1
  [df47a6cb] RData v0.6.0
  [3eaba693] StatsModels v0.6.0 [`dev\StatsModels`]
  [bd369af6] Tables v0.2.6

P.S. When I copy and paste out of the Junu REPL many line breaks vanish. Is there a way to fix that? I’m on Windows.

What happens if you use hessian(f, p)?

EdiT: Also, what if you use :forwarddiff instead of :forward?
edit2: What about hessian!!(f, p)?

1 Like

With hessian without the the !:

ERROR: MethodError: no method matching hessian(::TwiceDifferentiable{Float64,Array{Float64,1},Array{Float64,2},Array{Float64,1}}, ::Array{Float64,1})

I tried copy(hessian!(... to try to get around possible object lifetime issues; results were identical.

hessian!! is the method that hessian! dispatches to if the results are not already in the cache. Interestingly, it gives a different, better answer. The two examples I gave originally now match to all printed digits: H[1, 2] = H[2, 1]=16.7; H[12, 13] = H[13, 12] = -3945. The largest absolute discrepancy H[i, j] - H[j, i] is now 1.5e-11. It seems reasonable to infer

  1. The TwiceDifferentiable object thinks it has the results for the final solution in its cache.
  2. But the value in the cache is invalid.
  3. The values in the upper triangle (j>i) of the invalid data are the ones that came out when hessian!! forced a fresh computation, suggesting they are correct.
  4. It seems possible that somewhere in the optimization the upper triangle is being used and the lower triangle ignored. There are good reasons to do so; it would be wasteful to update both for most purposes.
  5. Even the improved values show minor asymmetry; LinearAlgebra.issymmetric() remains false.

If the Hessian were kept in a more specialized structure that was for a symmetric matrix it seems to me most of this could be avoided. If numerical methods produce different values for H[i, j] and H[j, i] there’s no obvious way to decide the true value, and such a structure would arbitrarily favor the j>i or i>j answer. That seems to me the lesser of several evils.

With this change from hessian!() to hessian!!() the results of BFGS and Newton are virtually identical for both the solution and the standard errors; only Newton thinks it has converged

I’m glad it worked, but your analysis is not 100% correct (and that’s not weird, I’m afraid this isn’t documented anywhere).

The problem is that upon convergence we skip the hessian evaluation - we only need the gradient to check first order convergence. This means that the hessian cache does not hold the hessian upon convergence. The reason is that at this point the allocated memory has been used to do the factorization in-place when finding the search direction.

The hessian evaluation is often expensive, so I figured I was being smart here. However, I’ve since realized that it can cause people to be confused about this. In the next major version of Optim (which I hope to release not too long after the summer, and which I will describe in some detail at JuliaCon in a few weeks) this will be changed, such that the Hessian is always evaluated. Since the hessian evaluation often shares considerable calculations with the gradient and value, it’s almost always better to evaluate them together (and with automatic differentiation you can actually get the gradient and value for free given you’ve decided to evaluate the hessian, so…). You only save some calculations in the last iteration, but all other iterations become more expensive - seems like stupid trade-off. Luckily, I only have myself to blame :slight_smile:

I’ve surveyed a few packages/libraries that support Newton’s method (I didn’t write the results down…) and some update always, some only update if convergence does not occur (measured by the gradient norm).

Edit: Just because it might interest some people, Newton’s method is going to change in Optim as well. You’ll still be able to chose the line search version, but if you just specify Newton(...) without the line search, it will default to a trust region approach. It’s much easier to work with imo, and works great for problems where the objective and derivates are expensive to evaluate (because only one such evaluation is done per “iteration”).

3 Likes

OTOH I think that quasi-Newton users should understand that the “Hessian” built up by the method (eg BFGS) may not have much to do with the “real” Hessian. I think that simply not providing an interface for it, or discouraging its use, could also be a viable option.

Hessians from optimization methods are side effects at best, IMO it is a very roundabout API design to obtain one from optimization.

What I wrote is irrelevant to quasi-Newton methods. You cannot get the BFGS approximation to the inverse of the hessian from Optim.hessian(sometwicedifferentiable) in Optim at all, and I actually intentionally did not just put that into the “hessian” or even an “inverse hessian” field in the types exactly because using it post-hoc is probematic at best. If you want it out, you have to go through the hacky way of initializing the caches up front, and use that struct to get them out after-the-fact. That said, the approximation can actually be quite useful to track and look at in case something goes wrong, so I’m not sure I agree that it should be very hard to get.

I think the only reason why OP looked at the BFGS approximation was that the actual “hessian” (which happened to contain information from the factorization attempt without their knowledge) was so weird.

Thanks for the clarification. Is this a cache invalidation issue then?

Just being curious, do you gain a lot from preallocated buffers for gradients and Hessians so that it is worth the code complexity?

Sometimes yes, sometimes no. But to give the answer you didn’t ask for, the new version will have a significantly simplified cache setup (and a fully not-in-place code path as well that accepts numbers, arrays, static arrays, …)

1 Like

Please let me know when the rewrite is available as a branch/PR. I face these API design problems for my own packages and would be interested in seeing how you do it.

@pkofod I actually used BFGS first, and turned to Newton only because BFGS reported convergence failure. Also, I figured since I had the second derivative for “free”, why not use it? I only started getting noticeably weird Hessians with Newton.

I do realize that just because I have the second derivative doesn’t necessarily mean using it is a good idea, and in my limited tests Newton runs significantly slower and probably uses more memory than BFGS. It also arrives at the same solution, but it does report convergence. So, apparently, safer but slower.

As a practical matter for the short-run, should using hessian!! instead of hessian! in my code after optimization be reliable for either Newton or BFGS? E.g.,

    f = TwiceDifferentiable(myll, zeros(size(xp.mm, 2)); autodiff=:forward)
    r = optimize(f, zeros(size(xp.mm, 2)), Newton(), Optim.Options(store_trace=true))
    p = Optim.minimizer(r)
    # hessian! should suffice, but the value in the cache is no good
    # alternately, I could copy the upper triangle over the lower one
    num_hess = hessian!!(f, p)
    vcv = inv(num_hess)

Also, do your remarks mean the upper triangle from hessian! is not necessarily correct either, and hence that my comment in the code is wrong?

Just a blue-sky thought, but how about an optimizer that gets time and memory for evaluation of the objective and available derivatives and then picks an appropriate strategy?

I’d bet somebody has already thought of this, and I realize getting reliable timings and memory useage is a bit dicey.

You don’t, though. The BFGS inverse hessian approximation cannot be proved to converge to the real inverse hessian as the procedure converges. This means that when the optimize call stops, you really shouldn’t use the resulting invH to calculate standard errors for example. I know you will find this done from time to time, but there’s no guarantee to the correctness of that approach.

But aren’t you also getting your Hessian using finite differences? Finite difference hessians can work, especially if you have known sparsity patterns, but it’s usually very slow. That’s probably why it’s slower. If your hessian isn’t positive definite, then there’s also quite some extra work involved with finding a search direction. Newton’s method isn’t necessarily safer either, but given a starting point that’s sufficiently close to the solution, it can be much faster (unless your hessian takes a long time to calculate of course).

Those functions work directly on the objective function representation, they have nothing to do with either Newton or BFGS. You can optimize your function with NelderMead() if you want, and still plug in the resulting estimates to hessian!!. But to answer the question, yes that should work.

I’d use hessian!! yes. That function should always return what you’re looking for.

1 Like

I thought you said earlier that the interface provided no way to get to BFGS’s internal hessian. I was getting the result from hessian!(f, p) with BFGS() as the optimization method in my original code. Are you saying that was unreliable?

Perhaps you misunderstood my original “for free” remark. That wasn’t a reference to the approximation to the Hessian that (I think) BFGS builds up; it was referring to the fact that with automatic differentiation I didn’t need to work out the derivatives explicitly to get, in effect, analytic derivatives. So, since it was easy to get a second derivative, I could try Newton.

I thought using autodiff=:forward for TwiceDifferentiable meant I was using automatic differentiation for all derivatives. Are you saying that’s not what the code I wrote will do?

I am calling those functions on f, a TwiceDifferentiable object that includes caches that apparently are used by the optimizers. hessian!, at least, will return values from the cache in some circumstances, which is how I got into this problem. Even hessian!! isn’t being called directly with the objective function, though presumably it gets there eventually. So your statement doesn’t seem quite right, but since you know a lot more about it than I do, maybe I’m misunderstanding.

You can get the internal BFGS inverse hessian approximation out, but it requires you to manually provide the cache types. I think it’s better to avoid doing it though.

Okay, then I think we’re talking past each other. I’m just saying that hessian!(f, p) is independent of the optimization routine. Even before running optimize, you can hessian!(f, p) for whichever p you want.

Yes, I misunderstood then.

Sorry, I don’t know why I thought you said :finiteforward. Yes, that is indeed ForwardDiff.jl being used in the background.

I misunderstood what you meant with “using it with [insert solver]”. You’re just saying that you run optimize and then after that you call hessian!!. My bad.

Thanks to @pkofod and @Tamas_Papp for your help and explanations.

So using hessian!! instead of hessian! solves the problem of highly asymmetric Hessians, and future releases may restore hessian! to a safe choice.

What about the minimal asymetries that still occur? It seems odd to be getting “Hessians” that aren’t actually symmetric.

I guess those questions are for the AD package maintainers… :slight_smile:

No, that is to be expected (numerical error). Enforce symmetry with LinearAlgebra.Symmetric.

Well, I guess expectations depend on the expector. From the standpoint of statistics or real analysis, the results should be exactly symmetric except for degenerate cases I would not expect in practice.

From an implementation or numerical perspective, differences due to numerical error can only arise if elements on both sides of the diagonal are computed. I would expect that in the interests of efficiency and sanity of the results (that is, avoiding estimated Hessians that aren’t symmetric) the quantities of interest would be computed once only.

That should be handled on the package side though. For example, with DiffEqDiffTools we at least copytri! to enforce it: https://github.com/JuliaDiffEq/DiffEqDiffTools.jl/blob/master/src/hessians.jl#L123 .

That is indeed something reasonable to expect, but even subsequent, ostensibly symmetry-preserving operations on initially symmetric matrices can mess this up.

I think that a Symmetric wrapper would be preferable (because of what I said above: you can enforce symmetry in your output, but it can get lost easily again).