MAP optimization in Turing with LBFGS doesn't move at all

I’m using maximum_a_posteriori to try to find a MAP fit for a complex model that I can’t show you at the moment.

If I run it with SAMIN() simulated annealing, or BBO_de_... algorithms, it does move around and finds partial optimization based on a hand-picked starting point.

If i ask it to do LBFGS() it runs a few iterations, maybe like a single line search (I know because it has some print statements in the model) and then immediately returns the initial conditions unchanged.

It makes me think there’s a problem with taking the gradient or the hessian or something so that it thinks that the current position is already an optimum. How do I debug this problem?

I have some further info.

By following procedures here: Recommended way to extract logprob and build gradients in Turing.jl?

I was able to find that yes indeed it does calculate the log density and the gradient at the initial condition, and that the printing I see is probably due to that initial gradient eval.

It still doesn’t work.

However I’m running an IPNewton() run and it’s taking much much more computing time… so I have hope that it will suddenly give me a meaningful answer! I will probably be crushed when it returns something meaningless, but I have been succeeding in debugging a lot of stuff anyway by utilizing the BlackBoxOptim optimizers and thinking hard about the problem and trying lots of stuff… so I’ll post back here and let people know if IPNewton worked.

Ok, IPNewton took many many minutes, and then returned the initial conditions.

Any suggestions for alternatives? I thought about trying NLopt algorithms but every time I tried with lower and upper bounds it complained that it wouldn’t work.

Added info: now I can do NLOpt algorithms somehow even with lower and upper bounds. I’m not at all sure why!

  1. LD_LBFGS aborts with failure to converge quickly, similar to LBFGS() from Optim.jl
  2. LN_NELDERMEAD runs for a while, moves a couple of the axes, but in the end accomplishes not much
  3. LD_TNEWTON fails
  4. LN_COBYLA runs a while and didn’t do much but didn’t fail

I tried various others and had very little in the way of success. I’m in the process of trying LN_BOBYQA at the moment.

Ok crazy enough LN_BOBYQA not only works but does so quickly and perfectly. It just finds the perfect fit for my simulated data in tens of seconds or a minute or two. The residuals are like chefs kiss.

But it still makes no sense to me why LBFGS or many NLopt algorithms don’t work. I’m suspicious of the autodiff stuff giving wrong answers or something.

If you suspect autodiff to be the issue, one minimal effort check you can do is try switching the autodiff backend and see what happens. You can do that like so:

maximum_likelihood(model, LBFGS(); adtype=adtype)

Here adtype would be e.g. ADTypes.AutoReverseDiff().

If you want to go a bit deeper, you could create the objective function for the optimisation like this, and then try calling the gradient function of a chosen AD backend on it to see that it’s correct. An example of how one might do that could be this one.

Thanks! I definitely tried both ForwardDiff and ReverseDiff at different points. Under Reverse diff I sometimes got a result about not being able to convert an irrational constant to a tracked real (I think it was log2pi).

I did not look into your problem at all, but here are a few basic questions that you should clear up:

  • are you certain your problem is differentiable?
  • if so, have you checked your gradient against simple finite differences at many points?
  • you’re not saying whether your problem has constraints, but you could try to check first-order optimality yourself at the initial guess;
  • have you tried other initial guesses (to see if solvers make progress from them)?

Methods for smooth problems such as LBFGS and IpNewton will be completely thrown off by inaccurate gradients.

Note that neither LBFGS nor IpNewton check second-order optimality, so the Hessian has nothing to do with them stopping (LBFGS doesn’t even use it)

Is my problem differentiable… I believe so, it’s a several hundred parameter Bayesian model. The priors are all differentiable, the likelihood is poisson and differentiable wrt the rate, the rate is computed from differentiable functions like logistic, multiplication, division, exponentiation and soforth so I can’t think of any reason it’s not differentiable. In fact, there’s lots of data, so I pretty much expect it to be asymptotically normal around most of the parameters. There’s probably some significant correlations between the parameters though.

I’ve tried from lots of different initial conditions in the past. Most recently I’ve been trying from hand-tuned initial conditions based on adjusting the parameters in an interactive Pluto session. These are decent but pretty obviously not optimal.

The “manual” auto-diff gradient I computed by the above linked method produced a gradient with large components (order a few hundred to thousand) so it’s not anywhere near a first order optimum according to ForwardDiff. It’s also pretty obviously not perfect in the Pluto graphs it’s just not way off.

I am constraining many of the parameters at the moment to be near zero while trying to get it to optimize the other parameters independently. So yes it has a big high dimensional box constraint.

i haven’t tried finite differences at a point, I’ve only just discovered how to extract the logp and gradient yesterday. I can do that.

The fact that BOBYQA absolutely NAILS the fit in a minute or so tells me that this problem is essentially a big parabola, I would expect it to be not hard to optimize. There have been some problems with the model in the past, but having debugged a lot of that, the BOBYQA just nails the parameters to the wall, where LBFGS just refuses to even start.

I’m suggesting finite differences to check that the gradient is accurate, not necessarily to use as a substitute for the actual gradient.

The fact that methods for derivative-free optimization (BOBYQA, SAMIN, BB, etc) make progress whereas methods for smooth optimization (LBFGS, IpNewton) don’t points to the gradient as a likely culprit; it could be inaccurate or not even exist. But that’s just a first hunch.

What implementation of LBFGS are you using? By default, LBFGS does not treat bound constraints. There is a variant, called LBFGS-B, that does.

Yes, I understood that, but thanks for making sure though.

Yes I too think that BOBYQA making not only progress but amazing progress while LBFGS didn’t suggests a gradient issue. It also suggests that the function shouldn’t be “too hard” to optimize because BOBYQA uses quadratic approximations and goes BANG right to the optimum, so I feel like this should be approximately a big normal distribution with probably a dense covariance matrix, lots of correlated parameters etc.

This is based on the maximum_a_posteriori interface in Turing, so underlying that it’s using Optimization.jl which specifies algorithms in terms of passing structs.

I can pass the LBFGS() struct, in which case I believe that Optimization.jl uses Fminbox with LBFGS inside that… so Fminbox is basically adding penalties and then running LBFGS and then adjusting the penalties in a loop I think.

It’s entirely possible it’s this Fminbox that’s the problem? I don’t know. But that wouldn’t explain why LD_LBFGS in NLopt doesn’t work. Though perhaps there it was not supporting box constraints… There are many things that are confusing to me.

One thing I may try is to fit with BOBYQA the full model with less constraints… find an optimum, perturb it a bit, and then run LBFGS from this perturbed optimum with NO constraints and see if it fits. perhaps Fminbox is the real problem and LBFGS would work fine if it were started close enough to the proper optimum.

I’ll do some experimentation, but a 12 hour marathon session yesterday means I probably will take today off. I’ll be back in the next few days with any info I can figure out though.

Ok, so I ran this code:


function testgradient(model,init)

    vec = init.values.array 
    #vec = vec .+ 0.01 * rand(length(vec))
    ld = LogDensityFunction(model)
    ldg = ADgradient(:ForwardDiff,ld)
    (ldval,ldgrad) = LogDensityProblems.logdensity_and_gradient(ldg,vec)
    @show (ldval,ldgrad)

    println("testing gradient at initial condition: ")
    @show init.values.array

    for i in 1:length(vec)
        dx = 1e-7
        dvec = copy(vec)
        dvec[i] = dvec[i] + dx
        lddy = LogDensityProblems.logdensity(ldg,dvec)
        gradi = (lddy - ldval)/dx
        @printf("gradient estimate for dimension %d is %.3f , ADest = %.3f, relerr = %.3f\n",i,gradi,ldgrad[i],(gradi - ldgrad[i])/ldgrad[i])
    end

    println("testing at perturbed condition: ")
    vec = init.values.array .+ .01 * rand(length(vec))

    for i in 1:length(vec)
        dx = 1e-7
        dvec = copy(vec)
        dvec[i] = dvec[i] + dx
        lddy = LogDensityProblems.logdensity(ldg,dvec)
        gradi = (lddy - ldval)/dx
        @printf("gradient estimate for dimension %d is %.3f , ADest = %.3f, relerr = %.3f\n",i,gradi,ldgrad[i],(gradi - ldgrad[i])/ldgrad[i])
    end

end

and at the optimum found by BOBYQA I get

gradient estimate for dimension 1 is 17704.084 , ADest = NaN, relerr = NaN
gradient estimate for dimension 2 is 17875.809 , ADest = NaN, relerr = NaN
gradient estimate for dimension 3 is 21775.342 , ADest = NaN, relerr = NaN
gradient estimate for dimension 4 is 6974.457 , ADest = NaN, relerr = NaN
gradient estimate for dimension 5 is -229.566 , ADest = NaN, relerr = NaN
gradient estimate for dimension 6 is -6428.048 , ADest = NaN, relerr = NaN
gradient estimate for dimension 7 is 34741.851 , ADest = NaN, relerr = NaN
gradient estimate for dimension 8 is 18674.355 , ADest = NaN, relerr = NaN
gradient estimate for dimension 9 is 26003.116 , ADest = NaN, relerr = NaN
gradient estimate for dimension 10 is 26944.482 , ADest = NaN, relerr = NaN
gradient estimate for dimension 11 is 10235.465 , ADest = NaN, relerr = NaN

Then at the perturbed value where I add a random small perturbation to everything (second part of this function)

testing at perturbed condition: 
gradient estimate for dimension 1 is 17297751523.070 , ADest = NaN, relerr = NaN
gradient estimate for dimension 2 is 17297750512.030 , ADest = NaN, relerr = NaN
gradient estimate for dimension 3 is 17297754910.220 , ADest = NaN, relerr = NaN
gradient estimate for dimension 4 is 17297738140.167 , ADest = NaN, relerr = NaN
gradient estimate for dimension 5 is 17297733625.846 , ADest = NaN, relerr = NaN
gradient estimate for dimension 6 is 17297727349.401 , ADest = NaN, relerr = NaN
gradient estimate for dimension 7 is 17297768334.215 , ADest = NaN, relerr = NaN
gradient estimate for dimension 8 is 17297750531.370 , ADest = NaN, relerr = NaN
gradient estimate for dimension 9 is 17297758827.056 , ADest = NaN, relerr = NaN
gradient estimate for dimension 10 is 17297756425.504 , ADest = NaN, relerr = NaN
gradient estimate for dimension 11 is 17297743769.864 , ADest = NaN, relerr = NaN
gradient estimate for dimension 12 is 17297735312.730 , ADest = NaN, relerr = NaN
gradient estimate for dimension 13 is 17297731805.592 , ADest = NaN, relerr = NaN

umm… wow?

Then I tried changing the code to use a normal perturbation even smaller…

    println("testing at perturbed condition: ")
    vec = init.values.array .+ rand(Normal(0.0,.001),length(vec))

testing at perturbed condition: 
gradient estimate for dimension 1 is -28645839.783 , ADest = NaN, relerr = NaN
gradient estimate for dimension 2 is -28645392.270 , ADest = NaN, relerr = NaN
gradient estimate for dimension 3 is -28641709.860 , ADest = NaN, relerr = NaN
gradient estimate for dimension 4 is -28656684.783 , ADest = NaN, relerr = NaN
gradient estimate for dimension 5 is -28663605.792 , ADest = NaN, relerr = NaN
gradient estimate for dimension 6 is -28669807.431 , ADest = NaN, relerr = NaN
gradient estimate for dimension 7 is -28628833.286 , ADest = NaN, relerr = NaN

So, the gradients exist, but AD doesn’t calculate them. so no surprise that gradient based methods don’t work.

Why doesn’t ForwardDiff calculate gradients? What can cause that to go wrong?

I decided to try out AutoReverseDiff(false), with the normal perturbation

gradient estimate for dimension 1 is 17704.084 , ADest = 17704.086, relerr = -0.000
gradient estimate for dimension 2 is 17875.809 , ADest = 17875.812, relerr = -0.000
gradient estimate for dimension 3 is 21775.342 , ADest = 21775.345, relerr = -0.000
gradient estimate for dimension 4 is 6974.457 , ADest = 6974.464, relerr = -0.000
gradient estimate for dimension 5 is -229.566 , ADest = -229.564, relerr = 0.000
gradient estimate for dimension 6 is -6428.048 , ADest = -6428.046, relerr = 0.000
gradient estimate for dimension 7 is 34741.851 , ADest = 34741.855, relerr = -0.000
gradient estimate for dimension 8 is 18674.355 , ADest = 18674.362, relerr = -0.000
gradient estimate for dimension 9 is 26003.116 , ADest = 26003.124, relerr = -0.000
gradient estimate for dimension 10 is 26944.482 , ADest = 26944.499, relerr = -0.000
gradient estimate for dimension 11 is 10235.465 , ADest = 10235.470, relerr = -0.000
gradient estimate for dimension 12 is 1808.442 , ADest = 1808.446, relerr = -0.000
gradient estimate for dimension 13 is -2605.592 , ADest = -2605.592, relerr = 0.000
gradient estimate for dimension 14 is -817.344 , ADest = -817.344, relerr = 0.000
gradient estimate for dimension 15 is -1745.760 , ADest = -1745.760, relerr = 0.000
gradient estimate for dimension 16 is -1324.381 , ADest = -1324.381, relerr = -0.000
gradient estimate for dimension 17 is -706.838 , ADest = -706.838, relerr = 0.000
gradient estimate for dimension 18 is -211.529 , ADest = -211.529, relerr = 0.000
gradient estimate for dimension 19 is 44712.491 , ADest = 44712.499, relerr = -0.000
gradient estimate for dimension 20 is 42743.151 , ADest = 42743.166, relerr = -0.000
gradient estimate for dimension 21 is 50166.946 , ADest = 50166.962, relerr = -0.000
gradient estimate for dimension 22 is 35201.962 , ADest = 35201.991, relerr = -0.000
gradient estimate for dimension 23 is 6663.269 , ADest = 6663.279, relerr = -0.000
gradient estimate for dimension 24 is -7842.856 , ADest = -7842.848, relerr = 0.000
gradient estimate for dimension 25 is -389.426 , ADest = -389.425, relerr = 0.000
gradient estimate for dimension 26 is 123.694 , ADest = 123.694, relerr = 0.000
gradient estimate for dimension 27 is -428.161 , ADest = -428.161, relerr = 0.000
gradient estimate for dimension 28 is 1898.894 , ADest = 1898.895, relerr = -0.000
gradient estimate for dimension 29 is 456.530 , ADest = 456.531, relerr = -0.000
gradient estimate for dimension 30 is -1991.917 , ADest = -1991.917, relerr = 0.000
gradient estimate for dimension 31 is -378.437 , ADest = -378.437, relerr = -0.000
gradient estimate for dimension 32 is -304.205 , ADest = -304.205, relerr = 0.000
gradient estimate for dimension 33 is -0.038 , ADest = NaN, relerr = NaN
gradient estimate for dimension 34 is -54.432 , ADest = NaN, relerr = NaN
gradient estimate for dimension 35 is 67.982 , ADest = NaN, relerr = NaN
gradient estimate for dimension 36 is -46.776 , ADest = NaN, relerr = NaN
gradient estimate for dimension 37 is 16.698 , ADest = NaN, relerr = NaN
gradient estimate for dimension 38 is 18.293 , ADest = NaN, relerr = NaN
gradient estimate for dimension 39 is 76.991 , ADest = NaN, relerr = NaN
gradient estimate for dimension 40 is -24.460 , ADest = NaN, relerr = NaN
gradient estimate for dimension 41 is 50.048 , ADest = NaN, relerr = NaN
gradient estimate for dimension 42 is -245.711 , ADest = NaN, relerr = NaN
gradient estimate for dimension 43 is 101.773 , ADest = NaN, relerr = NaN

So it calculates SOME of the gradient dimensions but not others.

How about at the perturbed location? There are MASSIVE relative errors between finite difference and AutoReverseDiff, but it still has NaN for the 33 dimension onwards

testing at perturbed condition: 
gradient estimate for dimension 1 is -438338326.307 , ADest = 17704.086, relerr = -24760.162
gradient estimate for dimension 2 is -438338283.146 , ADest = 17875.812, relerr = -24522.307
gradient estimate for dimension 3 is -438334189.793 , ADest = 21775.345, relerr = -20130.839
gradient estimate for dimension 4 is -438349504.072 , ADest = 6974.464, relerr = -62851.636
gradient estimate for dimension 5 is -438356333.853 , ADest = -229.564, relerr = 1909511.965
gradient estimate for dimension 6 is -438362551.859 , ADest = -6428.046, relerr = 68194.309
gradient estimate for dimension 7 is -438321282.970 , ADest = 34741.855, relerr = -12617.519
gradient estimate for dimension 8 is -438337508.705 , ADest = 18674.362, relerr = -23473.690
gradient estimate for dimension 9 is -438329918.286 , ADest = 26003.124, relerr = -16857.818
gradient estimate for dimension 10 is -438329783.466 , ADest = 26944.499, relerr = -16268.876
gradient estimate for dimension 11 is -438345883.202 , ADest = 10235.470, relerr = -42827.161
gradient estimate for dimension 12 is -438354309.549 , ADest = 1808.446, relerr = -242393.802
gradient estimate for dimension 13 is -438358719.526 , ADest = -2605.592, relerr = 168236.655
gradient estimate for dimension 14 is -438356918.963 , ADest = -817.344, relerr = 536317.929
gradient estimate for dimension 15 is -438357869.928 , ADest = -1745.760, relerr = 251097.580
gradient estimate for dimension 16 is -438357376.581 , ADest = -1324.381, relerr = 330989.418
gradient estimate for dimension 17 is -438356813.879 , ADest = -706.838, relerr = 620164.968
gradient estimate for dimension 18 is -438356319.889 , ADest = -211.529, relerr = 2072324.635
gradient estimate for dimension 19 is -438311174.549 , ADest = 44712.499, relerr = -9803.878
gradient estimate for dimension 20 is -438313499.289 , ADest = 42743.166, relerr = -10255.587
gradient estimate for dimension 21 is -438305534.833 , ADest = 50166.962, relerr = -8737.936

These coefficients from 33 onward are used to construct ApproxFuns which are used to create smooth Chebyshev polynomials in space to affect the predictions based on the spatial location of the flows we’re estimating.

So apparently you can’t differentiate through Chebyshev polynomials? It seems weird, because the Polynomials are smooth functions of the coefficients.

EDIT:

At its core ApproxFuns don’t seem to be a major problem for ForwardDiff?

using ApproxFun,ForwardDiff



function f(v)
    (a,b,c,x) = v
    G = Fun(Chebyshev(),[a,b,c])
    G(x)
end

x = [1.0,1.0,2.0,0.5]
f(x)
ForwardDiff.gradient(f,x)


Gives what looks like sensible answers.

It’s hard to be more specific without seeing details, but at least, the above explains why derivative-based methods give up. Unfortunately, I don’t think Julia has a mechanism to abort when a NaN is generated. Could it be that something in your model gradient overflows? I very recently encountered a case when I could evaluate the objective at a point, but the gradient computation overflows (with AD, but also if implemented manually).

Yep, could be something like that. I’d think you’d get Inf for overflow but maybe some kind of sqrt(-1) or some such thing. I will see what I can do to trace down what causes it.

Well, interestingly if I take 2000 data points and run the BOBYQA optimizer then my testgradient works. At 3000 it still runs… at 4000 it gives NaN gradients.

So somehow as the size of the dataset increases, the gradients eventually fail.

Our full problem has 800,000 rows :sweat_smile:

Not saying you’re as stupid as I am, but have you checked for NaNs in the full data set? I spent about 3 hours the other day working out what went wrong in an optimisation problem that worked when using some of my data but not all of it and it turned out there was an unexpected Nan in the data (stock prices, so a NaN value really didn’t make any sense at all…)

I guess this isn’t the case as you seem to be able to solve the problem on the full data with some optimizers, but thought I’d mention it anyway.

And to throw something else into the mix, this old discussion about potential ways of catching NaNs as they appear:

1 Like

I’d think you’d get Inf for overflow

Yes, but 0 * Inf would give you NaN.

I figured I’d better check for NaN in the data as you suggested, but no, there’s nothing.

It’s a little too early in the morning for me to think why something might overflow but I’ll look into it later today.

1 Like

Don’t know if this helps. I also had another similar post for the optimizers and finding nans. It turns out, maybe obviously, you need to also protect for 0/0’s which produce NaNs on the derivative path of your code that maybe automatically differentiated by the optimizers. It gets tricky to find those as you have to overload the check functions for normal floats but also the dual numbers or whatever else is happening in the automatic differentiation. I ended up triggering assets when I found them and placing the checks throughout my code. Then I’d use the symbolic libraries to calculate the gradients or jacobians to find the offending math. Then I’d be able to make protections. Things like caps were a problem I recall when hitting up on the limits. Stuff like that. I’m going from memory. I’d have to go dig up the posts.

This is really kinda driving me crazy. I started seeing behavior where the optimizer wouldn’t do anything, even BOBYQA, and then I went for a walk with the dog, and it started doing stuff normally again.

Then I tried including the file and calling my function and it worked… then I tried calling it again… and it DIDN"T work.

What I’ve traced this kind of behavior down to is that it only works if I’ve called seed! before calling my function. I put the seed! into my function and it consistently does the optimization properly.

At first I thought It does this no matter what seed I use (i can even use seed!(round(Int64,time()))) but it appears that’s not quite right. (see below)

This is driving me a bit crazy, and has nothing to do with the gradients, which are still not working right

I tried running it in succession, always with 11000 data point subset, but different seed! values.

Random.seed!(1234890);(vals,modval) = runtest() ## this works
(vals,modval) = runtest() ## this one doesn't, it's equivalent to some unknown seed
Random.seed!(1229);(vals,modval) = runtest() ## this didn't work
Random.seed!(1221);(vals,modval) = runtest() ## this worked
Random.seed!(1222);(vals,modval) = runtest() ## nope
Random.seed!(1223);(vals,modval) = runtest() ## nope
Random.seed!(1224);(vals,modval) = runtest() ## nope
Random.seed!(1225);(vals,modval) = runtest() ## nope
Random.seed!(1226);(vals,modval) = runtest() ## YES
Random.seed!(1227);(vals,modval) = runtest() ## YES

So, ok roughly it works with around 50% of seeds.

Ok, so I’m not sure if this has to do with how it’s subsetting my data (some of my data makes things suck?) or if it’s maybe how the BOBYQA optimizer works and its making a random decision differently (but then, it probably has its own randomizer not the Julia one?) or what.

My guess is maybe subsets of my data suck. (which is weird because I’m testing on synthetic data, the real data is not releasable and my collaborator has it)

Other note… if I perturb the initial conditions by a fair amount on seeds that “don’t work” I’ve also found it “does work”… maybe. basically weirdness, but possibly because I have “decent” hand-tuned starting points, some initial conditions appear to the optimizer as “sufficiently good” that it doesn’t try beyond just one step or something.

Another weirdness… if I choose a seed where it works, it works for when I take a subset of 11000, 12000, 20000, 22000, each of these are completely different subsets and they all worked, it wasn’t like 50% of them didn’t work the way 50% of seeds didn’t work for fixed data size… So it makes me think it’s NOT data subsets. UUUUGH

BOBYQA is a deterministic solver. To my knowledge, there is no randomization in it. Which implementation of it are you using? I recommend the one in PRIMA.jl where bugs in the original BOBYQA have been fixed.

I was also wondering why the random seed matters. Is it only due to the sub-samples that are chosen? It the algorithm used is non-random, and if the objective function is also non-random, and if a given sample is used, the random seed should not matter, no?

On the other hand, if the objective function has random elements, that could perhaps explain the difficulties.