Hamiltonian Monte Carlo gets stuck at tiny local maxima

Hello! This is my first time trying to perform Bayesian inference, and I have found some problematic behavior when using Hamiltonian Monte Carlo. I’d like some help in figuring out what may be going wrong.

I have a likelihood \mathcal{L}(m | \theta,\phi) where m is a (discrete) possible outcome of an experiment that is parameterized by spherical angles \theta \in [0,\pi] and \phi \in [0,2\pi]. I hand-coded the transformations that implement this domain restriction: for \theta I use \theta \mapsto \arccos (\cos(\theta)), while for \phi it is simply mod2pi. These transformations are piece-wise linear, so no need to worry about the Jacobian. For my prior I use the uniform measure on the sphere \sin(\theta) / 2\pi.

Given a series of observations, which I assume to be independent, I need to sample from the posterior in order to obtain information about it. Note that my observations are simulated from a categorical distribution using Distributions.jl.

First, I tried to use Metropolis-Hastings (AdvancedMH.jl). It worked great for this case, but I’m also interested in higher dimensional versions of this problem, and this algorithm didn’t generalize to well (to many samples in order to get reasonable convergence).

Then, I went on to try Hamiltonian Monte Carlo (AdvancedHMC.jl and DynamicHMC.jl), which, in theory, aims to circumvent many of the problems of Metropolis-Hastings, but now I found some strange behavior. Most of the time, I get the expected result, which is show in the figure below:

In the background we have the rescaled posterior calculated over a grid of parameters. The black dots are samples from the distribution and the yellow cross is the true value of my parameters \theta,\phi.

Then, with the same set of observations, I run another chain and I might get this sort of result

or even this

It were as if HMC “saw” my posterior as multimodal, which, as the figure shows, isn’t the case.

As I mentioned, this only shows up in HMC (both AdvancedHMC.jl and DynamicHMC.jl show the same behavior). When using AdvancedMH.jl, everything works great. I’m pretty sure that the same posterior is being used in all cases, as I’m using the LogDensityProblems.jl interface.

I’d greatly appreciate if someone could help me find what is going wrong. As already mentioned, this is my first contact with Bayesian inference, so unfortunately there could be some basic facts that I’m not familiar with. Could there be at least some sort of diagnosis to check if the sampling was indeed correct? As the setup is reasonably complex, I’m a little bit reluctant in trying to come up with a MWE, but let me know if you think you need more information.

Thank you very much!

The second image looks like its random walking because the chain got initialized to a flat region of the posterior. Ie, there is no/little gradient to follow up there. The last image looks like it started out in the flat region but eventually jumped to the center.

What I found a little odd is that, in both images 2 and 3, the samples seem to lie along the same wrong upper region. If it were simply a question of vanishing gradient, there would not be a preference for this region, right?

I wondered that too, but didn’t see points anywhere else so I figured you must have initialized the chain to the same spot.

Can you run it with multiple chains initialized randomly?

That is already the case for the images show. Each image is a different chain initialized randomly.

I just found out a possible cause: instead of showing the posterior, lets look at the log posterior:

There seems to be these tiny local maxima where the samples get stuck… Any way to deal with that?

I’d guess it must have randomly been in the same area then. Or perhaps an issue with the rng (ie, reusing the same seed for initialization).

But also it isn’t clear what exactly the points represent in the plot. Eg, is this only accepted samples, or does it also show rejected ones? Does this exclude points from adapt and/or burn-in stages?

From what I see that behavior doesn’t look too strange, but if it keeps ending up in that same region up there no matter where you initialize that would be odd. Or if it only gets stuck right there but not equidistant around the posterior maximum.

Ah, check out the last animation here:

still, HMC has problems with sampling from distributions with isolated local minimums
Investigate last distribution at low temperatures — ‘puck’ doesn’t have enough energy to jump from the first minimum to the second over the energy barrier.
Hamiltonian Monte Carlo explained

I also wonder if you actually care about the posterior, or is what you really want an optimization algorithm?

The points are only the accepted samples after the adapt/burn in stage.

So, a possible solution is to initialize many chains different initial guesses and check which one of them has the lower average energy, I guess.

Is that tempering thing that is mentioned in your link already implemented in Julia?

My task is to guess the true parameters that produced the observed outcomes, so I think I need the posterior. I’m using the posterior average as an estimator.

I actually haven’t tried any of the julia MCMC packages yet. There should be some option like increasing the step size (or equivalent). But if you just want that central mode, it is more an optimization problem and MCMC is probably not the most efficient way.

The underlying problem is that the data looks like it was generated from a process different from your model.

Afaict your choice of likelihood assumes a unimodal data generating process determined by exactly two parameters. Are you sure this is something you want to be assuming? If some other kind of process generated the data, then do your parameter estimates mean anything? Ie, whatever parameter estimate you get is based on the assumption you have specified the correct model.

Of course, it is possible your model is correct but you happened to get strange data. But in general finding out about these multiple modes should make you explore other models.

It’s really hard to say what’s going on here without a MWE.
In any case, circular parameters can be tricky and the posterior on the unconstrained is actually multi-modal, e.g., 0.1, 0.1 + 2π, ... all map to the same constrained value under mod2pi. Further, without a proper prior – on the unconstrained space – the posterior might not even be defined.
While this might no be the problem in your example, it could further complicate the situation. Hard to say without a MWE …

2 Likes

This looks like periodicity inducing side lobes, increase the concentration of your prior around pi a little, and it will go away. This is basically you using your prior information that the periodicity is irrelevant…

Just a heads up, this is in the works at MCMCTempering.jl. I had to put it on hold for a bit, but am about to return to it. Unfortunately it’s not quite ready for “public consumption” just yet, but it will be in not too long:)

And I’d expect tempering to be quite useful in this scenario!

There is also Pigeons.jl which is slightly more restrictive (though has fantastic support for distributed workloads), that should also work on Turing.jl: Turing.jl model · Pigeons.jl. Maybe give this a try?:slight_smile:

1 Like

Just in case, Siddharth and I recently developed repelling-attracting HMC to better explore multimodal distributions, as traditional HMC is not designed to explore multimodal distributions. It comes with a Julia implementation here.

2 Likes