I’m trying to figure out how to interpret the occasional failure of an optimization routine to move form a candidate point. Below (at bottom) is output from Optim
using NewtonTrustRegion
, with autodiff = true
. The neither the objective value (log likelihood) or gradient norm move in iterations 7 through 11, but then starting moving again. However, there are related behaviors (that are perhaps variants or perhaps unique) wherein the problem just gets stuck (LL and gradient norm never move) or stuck with gradient norm returning NaN
but LL giving a valid value.
I haven’t figured out how to write MWE that replicates this behavior, and I am unsure how to troubleshoot this behavior, or even if it is troublesome (though I think it is).
Here are important details:
 The objective function is nested logit, and so is nonvery quadratic and not concave. My intuition is that a particular parameter
lam
enters choice probabilities like this:P = exp(u1/lam) / (exp(u1/lam) + exp(u2/lam))
, and so there are overflow concerns aslam
→ 0. This value does pop up  I was concerned about
NaN
values messing up gradient construction, so I had coded inisnan(P)
logic to deal with this. However, I am unsure howisnan
plays withautodiff
.  So I instead modified the choice probs to be
P = 1.0 / (1.0 + exp((u2u1)/lam))
, which is seems fully robust to +/Inf
values from no more than one of{u1, u2}
, but is not robust to, e.g.,{Inf, Inf}
.  I am aware of
LogExpFunction.jl
and use them where I can, but there is not code that necessarily works in all situation when trying to write choice probabilities.  The obj function accumulates
ll += logeps(P)
to deal withP=0.0
, where to disallowInf
I define:
function logeps(x::T) where T
log(max(eps(T), x))
end
To summarize, I think my primary questions are

What is the best
autodiff
compatible way to deal with overflow in choice probabilities in a setting like this (e.g., nested logit)? 
Is this definition of
logeps
compatible withautodiff
? 
It may not be that
autodiff
is causing this behavior. If not, what are candidates? 
Does anyone have ideas for how I can troubleshoot this more? A lot of
Optim
andautodiff
feels like a black box (I understand things in theory, but not always the implementation details). 
Are there functions/behaviors to avoid when writing complex objective functions intended to be used with
autodiff
?
Reference Optim
output:
Iter Function value Gradient norm
...
6 7.631247e+05 6.546539e+04
* time: 22.29205012321472
7 6.942308e+05 1.115398e+05
* time: 26.592118978500366
8 6.100967e+05 1.183649e+06
* time: 31.020194053649902
9 6.100967e+05 1.183649e+06
* time: 31.09648108482361
10 6.100967e+05 1.183649e+06
* time: 31.188799142837524
11 6.100967e+05 1.183649e+06
* time: 31.308167934417725
12 5.997016e+05 4.227758e+05
* time: 35.722825050354004