Gradient norm does not change in `Optim` using `autodiff`

I’m trying to figure out how to interpret the occasional failure of an optimization routine to move form a candidate point. Below (at bottom) is output from Optim using NewtonTrustRegion, with autodiff = true. The neither the objective value (log likelihood) or gradient norm move in iterations 7 through 11, but then starting moving again. However, there are related behaviors (that are perhaps variants or perhaps unique) wherein the problem just gets stuck (LL and gradient norm never move) or stuck with gradient norm returning NaN but LL giving a valid value.

I haven’t figured out how to write MWE that replicates this behavior, and I am unsure how to troubleshoot this behavior, or even if it is troublesome (though I think it is).

Here are important details:

  • The objective function is nested logit, and so is non-very quadratic and not concave. My intuition is that a particular parameter lam enters choice probabilities like this: P = exp(u1/lam) / (exp(u1/lam) + exp(u2/lam)), and so there are overflow concerns as lam → 0. This value does pop up
  • I was concerned about NaN values messing up gradient construction, so I had coded in isnan(P) logic to deal with this. However, I am unsure how isnan plays with autodiff.
  • So I instead modified the choice probs to be P = 1.0 / (1.0 + exp((u2-u1)/lam)), which is seems fully robust to +/-Inf values from no more than one of {u1, u2}, but is not robust to, e.g., {Inf, Inf}.
  • I am aware of LogExpFunction.jl and use them where I can, but there is not code that necessarily works in all situation when trying to write choice probabilities.
  • The obj function accumulates ll += logeps(P) to deal with P=0.0, where to disallow -Inf I define:
function logeps(x::T) where T
    log(max(eps(T), x))
end

To summarize, I think my primary questions are

  1. What is the best autodiff-compatible way to deal with overflow in choice probabilities in a setting like this (e.g., nested logit)?

  2. Is this definition of logeps compatible with autodiff?

  3. It may not be that autodiff is causing this behavior. If not, what are candidates?

  4. Does anyone have ideas for how I can troubleshoot this more? A lot of Optim and autodiff feels like a black box (I understand things in theory, but not always the implementation details).

  5. Are there functions/behaviors to avoid when writing complex objective functions intended to be used with autodiff?

Reference Optim output:

Iter     Function value   Gradient norm 

...

     6     7.631247e+05     6.546539e+04
 * time: 22.29205012321472
     7     6.942308e+05     1.115398e+05
 * time: 26.592118978500366
     8     6.100967e+05     1.183649e+06
 * time: 31.020194053649902
     9     6.100967e+05     1.183649e+06
 * time: 31.09648108482361
    10     6.100967e+05     1.183649e+06
 * time: 31.188799142837524
    11     6.100967e+05     1.183649e+06
 * time: 31.308167934417725
    12     5.997016e+05     4.227758e+05
 * time: 35.722825050354004
1 Like

Do you have constraints? Perhaps the solver is working on decreasing infeasibility and the objective simply does not move.
My guess for the constant step norm is that 1.183649e+06 is the current trust-region radius and the three consecutive steps within the trust region make the trust-region constraint active.

Thanks @cvanaret. No explicit constraints (see NB below), though I do use the old-school lam = exp.(b[n:m]) trick to ensure some parameters are non-negative. So, as b[n:m]-Inf, then lam → 0 and hence my overflow worry.

But that’s only a couple of parameters, most parameters do not have that behavior, and so I’m surprised that the gradient norm moves not at all.

NB: I’ve never quite gotten the Fminbox syntax down in my use standard use (which uses an anonymous function assignment plus autodiff), e.g.:

tdf = TwiceDifferentiable(vars -> ll_emsimp(vars, choices, tt, hvec), bni; autodiff = :forward)

optimize(tdf, bni, method = NewtonTrustRegion(), iterations = 200, show_trace = true, show_every = 1, g_tol = 1e-4)