Unstable ODE within Turing model

I have a fairly complicated ODE system I am using within Turing for HMC sampling. Sampling seems to work fine: traces and metrics look good in general. However, when I extract a few random samples from the chain and run the same line of code I have in the Turing model, for some of the I get the following warning:

solve( prob, u0=ρ0, p=θ, saveat=Δt)
┌ Warning: Instability detected. Aborting
└ @ SciMLBase ~/.julia/packages/SciMLBase/NjslX/src/integrator_interface.jl:626

for any of these solutions, this normally means a considerable chunk of the solution is essentially zero, where I should have larger positive numbers.

So the questions are:

  1. are these samples being accepted despite having a low likelihood, and why?
  2. or alternatively: are different things happening when running the solver within versus outside of the Turing model?

I’d appreciate any comments or suggestions on how to ascertain what is happening.

You want to set them up to reject by checking the retcode like is shown in the tutorial.

It does a little bit more conservative. To check, you can just run the same solve outside of Turing and see how it goes.

Thanks for the quick reply, @ChrisRackauckas.

That’s all good, but since it’s not really an error, but a solution that is numerically valid, but clearly incorrect, this should be rejected for its low likelihood, or otherwise the solver should be able to generate a correct-looking solution for the set of parameters. Strangely, leaving it to its defaults generates the warning and the wrong solution (as does alg=AutoTsit5(Rosenbrock23()) or alg=AutoVern7(Rodas4)), but alg=BS3() generates a correct solution.

quote=“ChrisRackauckas, post:2, topic:115419”]
To check, you can just run the same solve outside of Turing and see how it goes.

That’s exactly what I said I’m doing, and hence the discrepancy.

I’m not sure I understand this, if it’s more conservative than whatever I specify, if I run it outside of Turing it will not be the same as within. Could you clarify?

Can you isolate that?

That’s required because the derivatives may require a stricter time step to hit tolerance.

I haven’t tried all combinations within and out of Turing, neither tested all solvers for all samples I extracted from the chain, but with Turing-sampled parameters taken out of Turing and used directly with DiffEqs, both auto-switch pairs mentioned fail to produce a stable solution in at least a few cases where BS3 works fine – I haven’t seen any one case where the opposite is true.
So a despite not being completely systematic about it, it seems like BS3 produces “good” solutions more consistently outside of Turing for “random” parameter sets.

That makes a lot of sense, but I would have to know what are the default tolerances for both DiffEqs.jl as a standalone solver, and what they become when inside Turing. Is that in their docs? Thanks.

There will be a giant review on how derivatives are calculated and that will describe it in more detail.

Did you try a lower tolerance? Is your ODE only differentiable 3 times?

Yes. That seems to help, but ideally more details would be needed to make the inner workings of the chain reproducible – if that is eventually going to be documented in detail it would be great. Thanks for the heads up.

I’m guessing you are bringing that up because of how some of the solvers do error control, and you are referring to the solution variable itself [e.g. with respect to y in \frac{dy}{dx} = f(y)]. In general it is not, it depends on the exact parameter values, but there’s a polynomial component which should have nonzero derivative indefinitely. In practice it’s a bit more complicated than that, so it is possible that it the derivative could be zero at some point of some of the equations, and it’s probably not possible to know that before the solution is actually computed.

Yes, so that would be why BS3 does better as higher order ODE solvers are assuming f is more differentiable which if false can effect the time stepping.

Yea, that’s a possibility, but it would be a real hassle to look into specific time points and specific parameter values to find out where and when the derivative may be vanishing, and that would just explain it, but not necessarily point to a general solution.

Still, it’s unlikely that it’s actually zero, but it may be underflowing and causing the whole thing to collapse. The pragmatic thing to do is probably to just adjust the tolerances to something that makes prevents the problem from happening as often as it does. Thanks again.