MCMC Benchmark

In MCMC there is always the issue that there are no established benchmarks. C.f. for example machine learning - Performance benchmarks for MCMC - Cross Validated .

There are some related efforts in the community, e.g. GitHub - StatisticalRethinkingJulia/MCMCBenchmarks.jl: Comparing performance and results of mcmc options using Julia and internal tests in packages. I think the only way to approach is to learn from the ML community and put a bit of a competitive element to it, where it’s not enough to use your approach to beat your own half-assed efforts with an alternative method you have no stakes in –
but you have established data sets and metrics and you have to beat previous records on those by people trying to do the best with their own methods…

I also happen to get the impression that https://discourse.julialang.org loves to optimise stuff…

Now I have this downright evil banana shaped (5 dimensional banana to be precise) density, the hybrid Rosenbrock https://arxiv.org/abs/1903.09556 .

using Distributions
struct HybridRosenbrock{Tμ,Ta,Tb}
    μ::Tμ
    a::Ta
    b::Tb
end
function Distributions.logpdf(D::HybridRosenbrock, x)
    n2, n1 = size(D.b)
    n = length(x)
    X = reshape(@view(x[2:end]), n1, n2)
    y = -D.a*(x[1] - D.μ)^2 
    for j in 1:n2 
        y -= D.b[j,1]*(X[j,1] - x[1]^2)^2
        for i in 2:n1
            y -= D.b[j,i]*(X[j,i] - X[j,i-1]^2)^2
        end
    end
    C = 0.5log(D.a) + 0.5sum(log.(D.b)) - n/2*log(pi)
    y + C
end

I’d like to propose that as a benchmark problem.

I think the first thing to do is NUTS from GitHub - TuringLang/AdvancedHMC.jl: Robust, modular and efficient implementation of advanced Hamiltonian Monte Carlo algorithms, I made a simple implementation (see Evil Rosenbrock density with nuts · GitHub. )
With that and some 3 minutes of effort:

Showing the trace plot of the 5 dimensions and the samples of pairs of dimensions.
In black are the NUTS samples, as yellow points samples my python sampler which correctly manages to visit the tails of the distribution. Can I do better? I already try some things, like rescaling the marginals by their standard deviation σ0.

Gist: Evil Rosenbrock density with nuts · GitHub

Also input on how to quantify success here would be valuable.

8 Likes

See also https://github.com/stan-dev/posteriordb

I guess a good benchmark problem would provide an explicit prior and likelihood. Otherwise one rules out algorithms that are not based on the unnormalized joint distribution but on the prior and likelihood (such as e.g. elliptical slice sampling), or one would at least introduce an undesired degree of arbitrariness by allowing different decompositions which might make it difficult to compare results.

You could even factorise f(x) = l(x) p(x), l(x) = f(x)/p(x) and try elliptical slice sampling.
Realistically, wouldn’t that need a prior really close to the density f here to work? It’s a different problem.

This is at first sight an unrealistically hard example, but then it’s not far away of the situation we find ourself in high dimensional problems where the posterior also lives very close to a smaller dimensional subspace (like here: a curve through 5d space).

I plan to do a problem with a high dimensional Gaussian reference measure next though.

This is possible but if it is not specified by the benchmark problem then it is a choice of the user, and one might pick a decomposition that is advantageous for the algorithm at hand. One could consider it as an additional tuning “parameter” of the algorithm but I assume it might make it more difficult to compare results of different algorithms.

BTW this construction is discussed in the paper about generalized elliptical slice sampling (https://jmlr.org/papers/volume15/nishihara14a/nishihara14a.pdf). The naive approach can lead to slow mixing if “the target distribution has much heavier tails than does the approximation”. Therefore in generalized elliptical slice sampling one uses a different construction with a scale-location mixture of Gaussians such that the target distribution is the marginal of an augmented distribution.

1 Like

Ah, we need to choose one, you are right. For this example I suggest the following: You may use any independent distribution on the 5 coordinates as prior.

Maybe with lower bounds on marginal variance?

Yeah that forces you to properly learn the scale. I would also have to think if one can cheat with an unbiased prior (exact matching of mean of prior and posterior) but in general such constraints may suffice.

1 Like

I love this, and thanks for all the hard work! Do you have any similar comparisons of AdvancedHMC with Stan, DynamicHMC, and some other sampler like a boomerang sampler?

Maybe you should make a pull request implementing whatever you did differently with the Python samples, if it makes such a huge difference with highly curved posteriors.

You are on point here: I also want to encourage others to playfully compete on a small set of defined particular problems in order to compare my own Boomerang/Bouncy/Zig-Zag sampler implementations

1 Like

I guess there should be a pass/fail element e.g. if you don’t get the tails of the distribution it’s an instant fail. Maybe a cut off for acceptable rhat values as well.

Once you’ve demonstrated that you found the true posterior, speed should be main measure. But I don’t know if it’s worth have a set number of samples, since some entries may be able to get results on a small number of samples. So maybe ESS per second.

I have another suggestion – let all the samplers run for the same amount of time, then compare the K-L divergence from the true distribution to the samples. Because we know the Rosenbrock function exactly, we can calculate this easily enough, I think. You can vary this time so we can plot a curve comparing time spent against the K-L divergence for each sampler. ESS/second doesn’t take into account goodness of fit, and “Converged to the true distribution vs didn’t converge” is a dichotomy that might break down in some cases. Using a measure like the above lets you compare methods even when you have situations like “This sampler has a higher ESS, but spends less time in the tails than it should, so my estimates are biased but have less variance.”

I have no idea how hard it would be to calculate a power of the Rosenbrock distribution exactly, but if it turns out not to be too difficult, that might be a decent measure that lets you test how accuracy varies for each method as curvature increases. You can plot a similar curve to the time one I described above using a variable α, where the distribution being sampled from is f(x)^α for f(x) equal to the Rosenbrock distribution. For instance, I expect NUTS to break down for fairly small α, while something like SMC should work even for large values of α (but will instead break down for larger numbers of dimensions).

My knowledge of mcmc and bayesian methods is pretty basic. I’ll take your word for it that such a dichotomy doesn’t always exist. I suppose it depends on who the target audience is here. I’d guess that for a lot of people like me we just wanna know “does the thing work?”, “Cool. How fast is it? I’ve got work to do.”. Maybe different measures will be useful for different problems?

Yeah, you’re right that different measures will be useful for different problems. Specifically, the measure you use should depend on your utility function – in your particular use case, maybe overestimating the mean is a lot worse than underestimating it, for example. K-L divergence from the true distribution is just the most generic, well-rounded kind of measure I could think of – it’ll probably be “good enough” for almost any use case.

One more package to be aware of is PosteriorDB.jl. It provides a Julia interface for working with posteriordb, which includes a large number of Stan models, associated data, and gold standard reference posterior draws. Since the BridgeStan submodule in StanSample gives access to the log density and gradient of a Stan model, StanSample and PosteriorDB can be used together to test new inference methods on many models. The draws obtained by the inference methods can then be compared with the gold standard draws or with draws from another inference method using a preferred metric.

PosteriorDB isn’t registered yet, but I’d welcome input regarding the design and documentation of the package.

3 Likes