I have some autocorrelation issues in my chains and tried different things that didn’t help much. My issue is certainly internal to the model I used, where I have some parameter correlations.
I read some piece of work about thinning the chains. So I ran an important amount of chains with p steps and then combine the posteriors to have new longer chains and then randomly sample them to have at the end new chains of size p. It really helps to reduce the autocorrelation.
I wonder if it is still a valid way of Thinning and if it is okay to apply this?
The question you should be asking is why are you concerned about autocorrelation in the first place?
The typical goal when doing MCMC is computing expectations with respect to some target probability density, and autocorrelation does not matter for that. This means that a chain with less autocorrelation is as useful as a (accordingly longer) chain with autocorrelation.
The relevant metric here is the effective sample size.
Autocorrelation therefore only matters for computational efficiency: you need more draws to reach a target effective sample size and need to wait longer to get the results you are interested in.
Thinning was popular in a time when algorithms were inefficient (before Hamilton Monte Carlo) and computer memory was limited. The recommendation today is to only apply thinning if your samples would otherwise not fit into RAM, as it would just mean to throw away information.
There is one big caveat though and that is computational faithfulness: What you describe sounds like chains have not properly converged and the autocorrelation you observe is the result of pathological sampling behavior.
If this is the case no amount of thinning can help you because you have every reason to believe that you are not sampling from the distribution that you think you are.
You should then reevaluate your algorithms (are you using something other than HMC?) but more importantly take a close look at your model again and understand where that behavior is coming from (posting the code would help).
It is unfortunately difficult to give specifics here, I would take a look at a pairs plot of the relevant parameters first, for example you could see a funnel structure as one of your variance parameters goes towards zero.
The question you should be asking is why are you concerned about autocorrelation in the first place?
With my understanding, strong autocorrelation can be a result if a chain did not converge properly.
In my case, there are strong correlation between some model parameters, creating some identifiability issue inherent to the type of models I use. Then we can have some patterns in the chain due to this correlation between parameters, increasing autocorrelation of the chains. Actually, I used other models with the same data and did obtain good convergence of the chains with low autocorrelation, so I really think it is then related to the model parameters entanglement.
The typical goal when doing MCMC is computing expectations with respect to some target probability density, and autocorrelation does not matter for that. This means that a chain with less autocorrelation is as useful as a (accordingly longer) chain with autocorrelation.
The relevant metric here is the effective sample size.
Autocorrelation therefore only matters for computational efficiency: you need more draws to reach a target effective sample size and need to wait longer to get the results you are interested in.
I totally agree that I will have similar results with longer chains, but running these chains takes too much time, and it is why I decided to ran smaller chains and then combines them. When thinned, it helps increases ESS and then autocorrelation is lower.
Thinning was popular in a time when algorithms were inefficient (before Hamilton Monte Carlo) and computer memory was limited. The recommendation today is to only apply thinning if your samples would otherwise not fit into RAM, as it would just mean to throw away information.
There is one big caveat though and that is computational faithfulness: What you describe sounds like chains have not properly converged and the autocorrelation you observe is the result of pathological sampling behavior.
If this is the case no amount of thinning can help you because you have every reason to believe that you are not sampling from the distribution that you think you are.
You should then reevaluate your algorithms (are you using something other than HMC?
I am using NUTS with an acceptance ratio of 0.65.
I was considering changing to another sampler too or to modify the acceptance ratio.
When thinned, it helps increases ESS and then autocorrelation is lower.
I think you are mixing up what thinning means. It means to save every n th iteration (and discarding the others). You cannot take a long chain and then sample from that chain. Maybe this link to a chapter in the Stan documentation helps clearing things up: 16.4 Effective sample size | Stan Reference Manual, particularly section 16.4.4 on thinning.
The definition of effective sample size (and autocorrelation) depends on the lags, if you shuffle a chain around you are inflating your effective sample size estimates, because the computation of the effective sample size assumes that the iterations are given sequentially.
If you observe chains getting “stuck” at different regions of the parameter space, the high autocorrelation is a symptom, not the cause of your problem. If you have a posterior that is difficult to sample from, you probably have to bite the bullet and look for opportunities to reparameterize the model.
Maybe you could show some plots or model code so it becomes easier to see what’s going on.
I do not totally understand what is the difference given the link you just sent (section 16.4.4). By thinning I meant, you have a chain of let say n = 50.000 and you re-sampled it every 10 steps (final n = 5000), isn’t it what they define as thinning?
Attached an example of 7 chains. The ESS is low (~336) associated with high autocorr, but the chains converged to the same regions.
Yes, but this resampled chain will have a lower effective sample size than your initial longer one. Given k chains with n samples each, there is no procedure you can apply that will yield a higher effective sample size than just taking all your chains with all iterations.
But we are way beyond caring about effective sample size, your problem in this case is computational faithfulness.
Are both of these runs with the same data? The first run looks okayish, sampling isn’t great but an ESS of 336 should be informative enough for most use cases. Is your \hat{R} looking good for the first run (for all parameters)?
Can you tell my what kind of parameter that is in your model? Is it a variance parameter, is the model hierarchical?
Is this parameter modeled on the log scale? Do you get any divergent transitions?
A band-aid fix could be to increase your acceptance ratio to 0.99. This will make sampling slower, but maybe you’ll see better behavior during warmup.
I am not sure, what do you mean by “computational faithfulness”?
Yes the \hat{R} are all goods.
There are parameters from an ODE model. Yes, some model parameters are log-transform for the inference. And I don’t think I got any divergent transitions.
Attached is what I obtain with an acceptance ratio of 0.90, with 2000 burning period.
It just means “do you actually sample from the posterior that you think you are” and is one of the steps in a typical Bayesian workflow, for example see Towards A Principled Bayesian Workflow.
But that plot is interesting. With smaller step-sizes, you get even worse behavior.
I’m only guessing, but this might happen because the numerical accuracy of the ODE part of the model is not high enough and the ragged structure induced by the ODE leads to faulty sampling behavior. You could maybe tighten the tolerance for that? This might get better with a lower acceptance ratio, does it work better with an acceptance ratio of 0.5?
Another thing that might be useful to understand what’s going on are traceplots of the warm-up samples.
But maybe someone else could comment on that, I don’t have a lot of experience with ODE models.
Yes, I agree that the problem may come from the ODE itself. ODE can have stiff behavior with some small parameter changes, and then the sampling is bad. I am considering changing the ODE integrator or the tolerance. Another problem may be the fact that some parameters are sloppy and others are stiff and the correlation between these parameters create this weird sampling behavior?
I am going to try an acceptance ratio of 0.5
Here are a few other ideas. NUTS works best when the posterior distribution is unit multivariate normal. If your priors are very constrained, oddly shaped, or your likelihood is unusual, sampling might be difficult. You can plot your likelihood as a function of each parameter to see if it is irregular in some way.
If your model does not fit the empirical data well, convergence might be difficult. Try fitting to simulated data generated by your model. If it works with simulated data, but not for empirical data, your model might be misspecified.