Hi,
Semi-new user of Julia and Turing here! I have been using Flux together with Turing to sample from a small NN with NUTS. I have come across different discussions online about where to find information regarding divergences, but found very little in the official documentation. From reading discussion threads, I have some questions regarding how divergent transitions is handled and presented in Turing (highlighted in bold):

According to the thread Documentation for internals in Chains object · Issue #339 · TuringLang/MCMCChains.jl · GitHub, the numerical_error field in the chain object directly corresponds to the number of divergences encountered during the sampling (Hamiltonian error is higher than the set threshold). That is, for each sample in the chain we either have a 1 if a divergence occurred during the leapfrog steps, and 0 otherwise. So if we have a numerical error (=divergence?) from one sample to the next we have a 1. I assume that Turing (like Stan) never accept this divergent sample as our new sample, and instead redo the proposal process from the latest sample. Is that correct? But in that case, what if we have multiple errors in a row? Will this count up the number in numerical_error? It seems I never get more than a 1.0 here. For chains that are hundreds of samples long they never climb above 1, even chains that have a lot of numerical errors stay at a maximum of 1.0.

Secondly, it seems that Julia provides (or at some point provided) warnings regarding numerical errors, as shown in this thread Numerical errors in logit normal model using Turing.jl. Can we be given warnings during/after sampling (like in Stan), or can we only extract it from the resulting chain object?

Finally, during my experiments I suspect that some chains get stuck in a bad region of the posterior, as some take a long time to complete sampling. Therefore, when sampling many chains simultaneously it would be nice to be able to shut down chains that get a lot of divergences. Is there a way to get information of a divergence from a chain during sampling, and in that case shut it down?

I am even newer than you! Started using Turing today. Can only say that conditionally aborting chains will break the assumptions of the sampler’s detailed balance laws and bias your distributions. It’s (of course) better to consider how to modify your likelihood or transform your variables to avoid such regions if you can . Your project sounds cool!

Thanks for the input, much appreciated!
I agree that avoiding those troublesome regions is the best solution! However, since I am using a neural net, completely avoiding divergences is (as far as I know) nearly impossible. When you say that aborting chains that diverge will bias the distributions, are you referring to the fact that we then only include samples of the posterior for areas that are easy to sample from (that do not encounter divergences). Or is there something else at play here?

If I understand the discussion in this thread correctly Are outputs with divergent transitions not at all useful? - #9 by llx - General - The Stan Forums, the fact that we have divergences will mean that we no longer have a complete quantification of the posterior. And so, if we have divergences in our chains, our sampling is already biased. But to your point, it seems to me that the posterior distribution becomes less biased if we include the chains that encountered divergences, rather than only pick chains without any divergences.

Like consider sampling from a Beta(α,α) distrbution. In the limit α goes to zero you get two δ-functions at p = 0 and p = 1. With α = 0 or even very small α, your sampler is supposed to get stuck around 0 and 1 to recover that very large relative mass around these points. If you terminate the sampling whenever you get stuck, you will generate samples from a completely different distribution, one which isn’t even “close” (in a KL-divergence sense) to the distribution you are targeting, it wont have mass at 0 and 1! Notes that a logit coordinate transformation can remove these singularities and make sampling better except when α is exactly equal to 0, which is pathological in a deeper sense of being a discrete distribution.

This is just to illustrate the terminating-introduces-bias point. However your case is a bit different since you are making your own likelihood with a net. Maybe try something like a quadratic function + tanh activations to assure that you have a bounded output, so that your likelihood doesn’t diverge when you try and generalize outside your training set, or a kernalized mixture of experts that includes a default concave (gaussian) log-liklihood when none of your experts are confident. I dont know the details but failure to mix isn’t turings fault, because NUTS is very good. Its the fault of the log-ikelihood you built. It has to, at the least go to minus infinity at the data boundaries!