ANN: A non-reversible rejection-free HMC sampler

Hej!

Have you ever wondered if momentum flips/refreshments are really needed in HMC or if we somehow can avoid to lose our sense of direction after each proposal step? Or even wondered if we could get rid of rejections in MCMC?

These thoughts and more have new answers in our paper about creating non-reversible, rejection-free and continuous-time Monte Carlo samplers.

Read it on arxiv [2504.12190] Creating non-reversible rejection-free samplers by rebalancing skew-balanced Markov jump processes … or just try it out at

Let’s take for example the evil banana target as Turing.jl model:

using Turing

@model function banana(μ, a, b)
    X ~ Normal(μ, 1/√(2a))
    Y ~ Normal(X^2, 1/√(2b))
    return (X, Y)
end
model = banana(1., 1/10, 100/10)

With a bit of glue code powered by LogdensityProblems.jl and DynamicPPL.jl

using FFFSampler

# from paper
ϵ = 0.035 # stepsize
L = 20 # number of leapfrog steps 
Ī» = 0.04 # the rate of momentum refreshments

fff = FFF(ϵ, L, λ)

chain_fff = @time sample(model, externalsampler(DiscreteFFF(fff,0.1)), 200_000)
chain_nuts = @time sample(model, NUTS(), 200_000)

So what does this sampler do?

  • It moves along level curves of the Hamiltonians like HMC, but it visits more than one point on each Hamiltonian, keeping going in the same direction
  • Internally it uses a continuous notion of time and spends a random time in each point, removing the need for rejections (here we call a wrapper DiscreteFFF to hide this from Turing)

Try it, we are curious about your experience!

Erik, Moritz, Ruben, Akash

30 Likes

Ok, I will define try it! Any suggestion on how to tune the hyper parameters?

1 Like

If Ī» is large (at least on the order of 1) you are essentially back to HMC. So you can start with ϵ, L as for HMC and then try to decrease both L and Ī» by a factor. Our heuristic is that Ī» should roughly be the frequency of flips — you can’t keep going in the same direction forever and at some point you have to turn back, so then you might as well try going somewhere else.

You might now object that it’s quite hard to find ϵ, L that work in HMC… but our sampler works just fine if you happen to put these too small, it will just be slower :slight_smile:

PS: This is how the picture looks like for

ϵ, L, λ  = 0.035/5, 20/5, 0.04/25

(taking more samples to compensate for small steps.)

If you zoom in you can see that also with smaller and fewer leapfrog steps the trace doesn’t become diffusive.

You can actually see the level curves of the Hamiltonian in this picture.

2 Likes

I read the paper, but I’m not sure I understood much, since I’m not particularly knowledgeable in the theory of HMC.

Could someone explain to me as a novice Bayesian who uses MCMC for model estimation: is this sampler better than NUTS? Is it faster? Do its samples have lower autocorrelation? Can it sample more difficult geometries? Does it generally provide higher ESS? (This is related to lower autocorrelations). I did notice the paper says ESS isn’t particularly meaningful in continuous time, but AFAIK I’m supposed to report ESS and \hat{R} in my research to prove convergence, so perhaps ESS is useful in this context anyway.

The plots show the new method visits more of the target density compared to NUTS, which I guess is good. Is it a general property of the method?

6 Likes

I think this is hard to answer fairly. NUTS is amazing and does a lot of work on top for you, such as tuning step sizes, adapting the metric… This extra work has an additional computational cost but presumably is worth it for your practical purposes, since tuning HMC is not easy IMHO. Not to say you couldn’t build that on top of our work, but it doesn’t exist (yet).
That’s why we are comparing to HMC and not NUTS.

I would describe it as higher autocorrelation at low lags, but a faster decrease towards higher lags. It is definitely possible that the resulting integrated autocorrelation time (and so the actual asymptotic variance of your estimates) becomes lower, but it’s hard to uniformly generalize over hyperparameters, especially because you can more or less recover the behavior of HMC inside our sampler.

For reversible samplers like HMC the goal you strive for is always uncorrelated samples (it’s just a lot of tuning getting there). Here, being correlated over short timescales doesn’t ruin your results, and could help you get out of tricky geometrical situations, because the non-reversibility allows you to combine many cheap, small steps over larger timescales:

In particular, comparing a continuous-time notion of ESS with a discrete one (e.g. directly to NUTS/HMC) seems impossible to do fairly. I think one can make sense of ESS in continuous time, and I know of papers that do define such metrics, but the estimators are AFAIK not as well studied as in the discrete case. If you want to use standard discrete ESS tooling you can discretize, but you would lose information and get an additional ā€œfakeā€ hyperparameter to mess around with.
On the other hand, you could definitely work out a sensible version of \hat R (at least with multiple trajectories).

I realize I’m just giving the same caveat over and over, but it really is up to tuning. If an oracle magically told me the optimal hyperparameters you’d do approximately equally well with both NUTS and FFF. Now, NUTS out of the box doesn’t necessarily get there on tricky geometry, and our method gives you more leeway in picking hyperparameters.

In conclusion, we’d love if you try this ā€œfor funā€ and tell us how it worked for you. I’m not claiming that we magically beat established battle-tested methodology with years of effort honing it.
I think a lot of ā€œcool new samplerā€ papers hide how much effort can go into tuning, and I think it’s an important part of the story here that, while asymptotic benefits of non-reversibility in MCMC might sometimes be overstated, there is definitely a case to make for robustness benefits.

5 Likes

Totally agree!