# Advices on convergence issues Turing

Hi all,

I am estimating ~10 ODE parameters using some data with Turing.
While I ran the sampling for quite a long time (~ 10.000 steps) for 10 chains, I got some convergence issues. After analyzing the posteriors, I realized that two of the parameters are certainly creating an issue since they are highly correlated and seem to not have a lot of impacts on the ODE.

corrplots:

traceplot in log space for one parameters for 10 chains:

Given these plots it seems that somes chains are stick in a space and that the sampling is not traversing the whole space. I use NUTS with an acceptance rate of 0.65. Given the information I have, I wonder if increasing the acceptance rate would help to solve the convergence issue? If yes, how much I can use for it?

Another possiblity would be to constrain the priors for these two parameters to one mode but it is not really the direction I want to take since I do not have enought information on them.

If you have any other advices on how I can solve this issue it would be great!
Thanks!

You have two problems. Yes, sampling is not converging, but also these two parameters are non-identifiable. Consider the most extreme case where the data informs only that the sum of two parameters is approximately a constant. They will follow a similar line to the one you are seeing, but because one can form that constant by shifting both parameters in opposite directions by the same arbitrary amount, your posterior variance for both parameters will be very high, and you wonāt be able to say much about them.

This is an example of the folk theorem of statistical computing, where a model problem manifests as a sampling problem, and the right solution here is almost always to reparameterize if possible. e.g. here you could try sampling one of the parameters and the difference between the two parameters. But maybe thereās a good reason they are correlated; maybe one depends on the other, or they both depend on some other parameter not being modeled. In those cases, reparameterizing can actually lead to a more informative model.

You can try increasing the acceptance rate, but this will likely make sampling slower and will not resolve identifiability issues. Large values like 0.9 or 0.99 are acceptable. But I would try this second.

Also, I noticed that your parameters have very low values. Typically, you want to reparametetrize so that parameters will be on similar scales in the posterior. Often this means using prior information to identify ātypicalā values of parameters. e.g. if sampling the radius of a star, donāt sample in meters. Identify a ātypicalā radius of a star, and sample the unitless radius of your star divided by that star. Otherwise, you need a longer warm-up period to learn that your parameters need to be sampled at very different scales.

Lastly, the trace getting stuck like that may indicate multimodality or divergences. Iād also check for what parameter values divergences are appearing (`chn[:numerical_error]`) those can give you a hint for any other problems.

4 Likes

Thanks a lot Seth for these meaningfull comments.

On parametrization, I completely agree with you! This model has some identifiability issues. But I already re-parametrized the model and reduced the number of parameters and wanted to check sampling issues first. I will try to modify the model again if my issue persit.

Sampling the differences between the two parameters seems a good idea indead, I will try that!

About the acceptance rate, do you have good documentation on that? I didnāt find any really advanced resources on the NUTS solver speaking about acceptance rate, especially on how to choose it depending on your problem.

For low values and scale, I log transform the parameters and perform the sampling in log space and with log of data as well in order to avoid scaling issue so I donāt think it is coming from this.

About Divergences, I read that reducing the step size is a good way to avoid it, correct? But NUTS isnāt dynamically changing the step size already? Is there a way to have a maximum step size implemented or something related?

On multimodality, do you any advices or resources on how to implement it in Turing for only 2 parameters and not the wole set?

Thanks a lot!

Thereās some info on this from the Stan docs: Brief Guide to Stanās Warnings. See also Diagnosing Biased Inference with Divergences and https://arxiv.org/pdf/1411.6669.pdf (Figure 2).

Reducing step size can be a good way to avoid it, but itās not guaranteed to. If for some reason you have essentially a discontinuity in your log density, divergences will appear for all step sizes. NUTS does indeed dynamically change the step size, and in fact the target acceptance rate (delta, or adapt_delta depending on the PPL) is used to adapt the step size (larger values will adapt smaller step sizes). No, thereās no way to set a maximum step size.

Multimodality is a property of the posterior distribution. What do you want to implement, precisely?

1 Like

Also, if youāre able to share your model or an MWE using simulated data that shows the same issues, I may be able to offer more useful advice.

2 Likes

With the help of Tor, I got an ordered constraint working which might be useful for your model too: Bayesian Latent Profile Analysis (mixture modeling) - Huijzer.xyz

Iām actually surprised that then NUTS run is able to finish here. NUTS usually has lots of problems with multimodality

Thanks Seth and Rikh,

@sethaxen, I already started implement some of the advices and I wonder if you have any idea on convergence diagnostic for multimodal proposals?

@rikh, really usufull link, thanks! Are you aware of any other sampler that are better than NUTS for dealing with multimodality? I am trying HMC right now and it seems indead a bit better as mentionned by your link.

All I know about it is sort of in that blog post. Iāve read that SMC should also be very good for dealing with multimodality somewhere at a Stan forum, but I found HMC more capable for the problem at hand then. When working in Pluto, I found it quite easy to test out different samplers and see what works best

For multimodal proposals, R-hat should be fine, but I donāt think it would find all failures to converge. But then again thatās not guaranteed even for unimodal. If you intend to pool the draws, then you need them to explore all modes equally well. If a given chain transitions only rarely between modes, then it may have sampled each mode well, but it has not well learned the relative weighting of the modes. The Bayesian workflow paper has some suggestions for what to do here: http://www.stat.columbia.edu/~gelman/research/unpublished/Bayesian_Workflow_article.pdf. Namely, if better priors and reparameterization donāt get rid of the multimodality, as long as each chain individually has a good R-hat, you could try a re-weighting scheme such as stacking to combine chains.

I suspect weāre conflating some things here. AdvancedHMC uses adaptation to 1) tune a āmetricā, a positive definite matrix which is used to attempt to reparameterize the actual posterior that is sampled to have a unit covariance matrix. The metric can be āunitā (un-adapted), ādiagonalā (just rescales all parameters), or ādenseā (also decorrelates them). 2) it tunes a step size, using a target āacceptance rate.ā

Consider a case that we have a bimodal bivariate posterior, say a mixture model with equally weighted bivariate normal distributions with their means on the x-axis and with a unit covariance that are close enough to allow (somewhat rare) transitions. During initial short warm-up phases, a chain probably enters a single mode and stays there. The step size adapts for exploration of that mode. Then after a short while, the sample covariance is taken, and that is used to set the metric. Now everything is well tuned to explore that mode. The next warm-up phase is longer, so it has a higher chance of suddenly escaping to a new mode. The step size is fine in this new mode, because this mode looks the same, but the sample variance in the x direction is much larger than the variance in the y-direction, so when the metric is updated, it compresses the posterior in the x direction. Now the step size is bad for exploring either mode, so it re-adapts. But note that the new compressed posterior has also lengthened the two Normal components in the y-direction, so that theyāre correlated, so theyāre harder to sample. This is why adaptation has a hard time with multiple modes.

All this generically applies to HMC. NUTS just adds a dynamic termination criterion to each HMC transition, so I canāt think of a reason for HMC to perform better than NUTS here. I suspect that the reason for the different performance is because in Turing `HMC` defaults to a unit metric (no metric), while `NUTS` defaults to a diagonal metric, which would raise the problems above. If thatās the case, Iād suggest using `NUTS` with a unit metric, which should be better than `HMC` on the same problem.

4 Likes

This was something we discussed on the Turing Slack recently (@cpfiffer). Nice that there is now an idea on what could be causing the difference

Thanks @sethaxen.

I think I am going somewhere with multimodality and still using NUTS. I tried using HMC to see if it is improving, but actually the convergence is worse.

Do you think the following is a good framwork for a general MCMC procedure with multimodality for one parameter for example:

1. Run an initial sampling using naive priors (unimodal).
2. Built a new prior for the multimodal parameter using the previous posterior with two distribution corresponding to the two modes:
``````Ī¼1  = LogNormal(log(1),0.1)
Ī¼2  = LogNormal(log(2),0.5)
Ī¼   ~ truncated(MixtureModel(LogNormal[Ī¼1,Ī¼2],[0.5,0.5]),0,5)
``````
1. new sampling using the MixtureModel distribution.

For now it is really improving the convergence even for a quick run. I still need to run it for a better amount of steps. I wonder if I should have a better weigth estimation using a Dirichlet distribution:

``````Ī¼1  = LogNormal(log(1),0.1)
Ī¼2  = LogNormal(log(2),0.5)
w   ~ Dirichlet(2,1)
Ī¼   ~ truncated(MixtureModel(LogNormal[Ī¼1,Ī¼2],w),0,5)
``````

Not sure if it is a conventional way of doing it but at least it is improving the convergence without solving the identifiability that is another problem to solve.

In general, Iād be very cautious creating a new prior from the posterior and then conditioning on the same data, as thatās effectively conditioning the same data twice. Thereās a subtle but important difference between this and the iterative approach to model building usually recommended. If you observe multimodality in your posterior that should cause you to re-examine the assumptions in the model, i.e. your prior information, which can lead to a new prior, but here you are not using the initial posterior draws to define the new prior; rather, they exposed a problem with your initial prior. A question to ask is whether one of the modes makes more sense than the other. If so, thereās likely information you left out of your model. If not, is there other information you didnāt use that could help inform which of the modes is more reasonable? Other data?

If not, thatās fine; sometimes a parameter really is non-identifiable. Then I think you want to look at methods like stacking.

1 Like

Thanks a lot Seth!