Okay, great! No divergences is good.
R-hats far from 1 when there are no divergences usually indicates the posterior geometry is hard to sample, i.e. multimodality (seems to be there from your posterior plot), high correlations, heavy tails, or even improper-ness. We can use draws with a bad R-hat to diagnose issues, but I wouldn’t put any stock at this point in any expectations computed from the posterior, like mean, even if they agree with what you would expect.
Before you try to fix this, it would be useful to know if this only happens with real data or also with simulated data. Have you tried simulating data using the prior and likelihood and then using it to fit the model? Does that sample fine?
As @dlakelan said, multimodality is best solved through reparameterization. This is model-specific and is not always possible but is worth looking into. The kind that is easiest to fix is the kind where the modes exist due to a permutation/translation symmetry between parameters. e.g. adding x
to one parameter and subtracting it from another would produce the same log-density, or swapping two parameters would produce the same. Multimodality causes problems for sampling, yes, but it also makes interpretation of the posterior challenging due to non-identifiability. If reparameterization and all else fails, you could try sampling with replica exchange with parallel tempering. MCMCTempering.jl might be useful, but this won’t help with interpretation.
To check for high correlations, use pair/corner plots. High correlations likewise can often be addressed through reparameterizations. Sometimes using a dense metric with NUTS
can help here, as it not only attempts to equalize the scale of parameters but also tries to decorrelate them for sampling.
Heavy tails are a bit more subtle. One way to check for them is to use the ArviZ.plot_energy
function and look for an EBFMI less than 0.3 (bad). This usually accompanies tree depth being saturated sometimes (some transitions never u-turn because they’re stuck in the tails). Fixing this requires thinking some more about the sensibility of the priors and likelihood, though this isn’t necessarily a model problem.
An improper posterior is one with infinite probability mass, i.e. the chains will never converge. It happens, but I haven’t encountered it in practice so can’t give advice from experience here.