MCMC landscape

So I think we are kind of mixing things up a little in this thread. Turing has 3 main components:

  1. A ~-based API
  2. Type-stable, AD-enabling book-keeping of parameters and their metadata
  3. Inference algorithms

Turing started with these 3 intertwined but the “modern” Turing approach involves making standalone inference algorithms in modules and wrapping them to be compatible with the Turing book-keeping approach. This is the case for example with AdvancedHMC.jl. Some samplers are still fairly intertwined though like the AdvancedSMC sub-module of Turing. 1 and 2 above are still heavily intertwined in the sense that 1 expands into all the boilerplate needed for book-keeping. There are no plans to break them as of now.

However, there is a plan to abstract away all of the algorithms in Turing in terms of their requirements. The requirements can be defined as input arguments to the algorithm like the the log joint probability for HMC and RAM, or as functions to be overloaded for some front-end Model type. It was pointed out to us in https://github.com/TuringLang/Turing.jl/issues/788 that this is similar to the approach that POMDPs.jl takes. This is not a short-term goal though (i.e. few weeks), more like medium-term.

So for your specific use-case, you want to directly call a single algorithm so you have no use for the Turing front-end and book-keeping. You can implement the algorithm in Turing or as a standalone package (question of maintenance) and then we can wrap it to be compatible with the Turing front-end much like we do for AdvancedHMC. Other packages like Soss can then do the same. I hope this clarifies things a little.

4 Likes

As people mentioned, Turing is very hackable. So if you really want to do it now, you can hack Turing model with customized distribution to achieve so.

using Turing

@model mymodel() = begin
  d = 1 # the actual dimension of your problem
  p ~ DummayPrior(d)
  1.0 ~ Target(p)
end

struct DummayPrior <: ContinuousMultivariateDistribution 
  d # the dimension of your problem
end
Base.length(dp::DummayPrior) = dp.d
Distributions.logpdf(::DummayPrior, _::Vector{T} where {T}) = 0.0
Turing.init(::DummayPrior) = [0., 1.] # your initialization
Distributions.rand(dp::DummayPrior) = randn(dp.d)

struct Target <: ContinuousMultivariateDistribution
  p
end
Distributions.logpdf(t::Target, _) = our_log_target(t.p)

mf = mymodel()
alg = MH(1_000) # MH for 1_000 samples; you can also customize your proposal
chn = sample(mf, alg)
# If our_log_target is differentiable, you can also use HMC or NUTS, e.g.
# alg = HMC(1_000, 0.2, 3) # HMC for 1_000 samples with 0.2 as step size and 3 as step number

The DummayPrior just plays a role for initialization and allow Turing to book-keep the parameter space, and the Target will try the computation of our_log_target. However, the problem of working with the our_log_target is that you need to take care of transformed space by yourself. But I think in general all models that you can implement as our_log_target by hand could also be implemented using @model with Turing.

5 Likes

There is another implementation of the affine invariant sampler here: https://github.com/mauro3/KissMCMC.jl (and also a MH). I’m pretty happy with the affine invariant sampler, and it requires usually zero tuning. Although by the sounds of it, RAM might be better for multi-model distributions.

1 Like

As several people were talking about ensemble based methods, I implemented the differential evolution markov chain approach of ter braak and vrugt (https://github.com/chrished/DEMC.jl). There is basically no tuning needed, just provide an initial matrix of samples that is somewhat dispersed.

If you have many dimensions it might not do well (similar to Goodman & Weare’s algorithm), see Typical Sets and the Curse of Dimensionality

As some of you seem to use ensemble based methods in practice: Are you using those because you can not provide gradients, which prohibits HMC? Or is there a reason why ensemble methods are better than the mc-stan post I linked let’s on?

1 Like

AFAIK the main advantage is no tuning, compared to MH & variants. This is great when the posterior is “friendly”, may not be when it isn’t. But in the latter case tuning may be difficult already so in practice it is rarely done. If one is willing to accept a 20–80% efficiency drop, a mixture of proposal matrices sometimes helps. Textbooks from the late 1990s are full of these tricks.

IMO the bottom line is that for small parameter spaces you don’t care, for larger ones HMC/NUTS or approximations like variational inference are the only viable options.

1 Like

https://github.com/mcreel/Econometrics/blob/master/src/Bayesian/mcmc.jl does what you mention. It’s written mostly for teaching purposes, but I also find it useful in my research.

3 Likes

I’m a bit late to this question, but I was wondering if of any of the packages for HMC, can we mix metrics, i.e. diagonal for some parts of the model, dense for others? This is seems like a major drawback in Stan, for example, when working with big heterogeneous models where full dense just kills performance, and we’d want something like 98% diagonal.

1 Like

AFAIK Stan uses diagonal metrics by default.

If you want to manually specify the adaptation, I can add this feature to DynamicHMC, just please open an issue.

1 Like

hm, it’s not the adaptation I’d like to specify but choosing diagonal and dense metrics for different parts of the parameter space. If in Stan I have

model {
  real alpha[5];
  real states[5, 1000];
}

where each alpha parametrizes transitions between corresponding row of the states matrix, I’d put a diagonal metric on states but dense on alpha.

Yes, but metrics come from adaptation, so it would need to know how to do that for custom ones.

This should be fairly easy to do, I would just need to think about the interface.

1 Like

ah I knew I shouldn’t try to do it myself :slight_smile: Before opening an issue, I’ll come up with a toy problem which works better on a (full) dense metric than diagonal, which would be a good test for a mixed-metric algorithm.

I believe AdvancedHMC (and Turing, by extenstion) was working towards that, but I don’t know the status of it. @Kai_Xu might know.

I am under the impression that for nontrivial models (say, > 50 parameters) with non-uniform curvature (ie basically everything except the multivariate normal :wink:), off-diagonal entries/correlations are more trouble than they are worth since they can easily adapt to something accidental and then just hinder efficiency.

OTOH, not having a dense metric leads to a minor efficiency loss even in the theoretically ideal cases.

YMMV, and counterexamples would be interesting.

2 Likes

As with everything in statistics, yes and no :slight_smile: .

I don’t think it’s as much an issue of “accidental adaptation”, but rather for very non-Gaussian distributions, there is no good adaptation. Any type of Gaussian is equivalent to constant curvature and if the curvature is highly non-constant (as an extreme example, the function is discontinuous!), you can run into issues quickly.

But it’s worth noting that it’s often the case that many of the parameters are approximately Gaussian in the posterior. After all, maximum likelihood theory tells us that as the sample size gets large, relative to the parameter space, the log likelihood approaches a quadratic function, i.e., a Gaussian distribution. So often, it is reasonable to think that at least some subsets of the parameter are approximately Gaussian. If they are approximately Gaussian and highly correlated, you generally want to include the off diagonals (assuming the Hessian doesn’t get too large).

For trickier problems, it can be a good idea to divide up your sampler into different chunks; for example, concurrently update all the Gaussian-like parameters into one sampler and all the other non-Gaussian-like parameters in another sampler(s). Doing this properly is hard to do in the abstract (i.e., not hand tuning the sampler with prior knowledge), but there’s been at least one stab at the problem that I’m aware of. Admission of bias: I’m a coauthor on that paper. Also, note that the linked paper is not about HMC, but rather an adaptive partitioning of the parameter space for MH with block updates.

1 Like

Certainly, this is well known. The diagonal metric can fix some scale issues though, and NUTS does not appear to be very sensitive to the rest (when you can’t do anything about non-constant curvature), so I think that the Stan folks made the right call here.

By “accidental”, I meant adapting to the initially encountered region. NUTS usually recovers from this very robustly in most cases though, so it matters little.

It was an interesting read, but it remains to be seen whether something like this is relevant for HMC/NUTS. Personally, I think that after you fix scaling, there are diminishing returns to improving a constant metric. I think that approaches like Betancourt (2013) are more promising.

3 Likes

Is there anything in Julia doing RHMC? It seems like its been years in the making in Stan, with difficulties around higher-order derivatives.

There is nothing for using RHMC in julia right now.
I contemplate using the new Stan OCaml parser as a frontend for julia. Then one could use the Stan language for model definition while running everything in julia, essentially bypassing the Stan math library. What’s holding me back is the significant development effort of course, and I don’t know how happy the Stan team would be about a project like this.

I just seems like julia is perfectly suited for Bayesian Inference and MCMC sampling, with nice AD, the new threading mechanism and distributed parallelism, but there are probably unforeseen issues that you don’t have using C++.

I think Julia’s macro system would be better than relying on an OCaml parser. Why not do everything in Julia?
What’s the advantage that would provide over Turing, Gen, Soss, or ProbabilityModels?

1 Like

The advantage would be that Stan is pretty much the standard for model definition. Many (and I guess even most) people doing Bayesian Inference know how to write their models in Stan and already have many working models.

Of course it’s easy to translate them from one DSL to another (or even just writing them out without any DSL), but the field of Bayesian Statistics is somewhat special because the people using and writing these models have a vastly different background to those that are writing these tools, and I think having a somewhat common language with other ones as a nice addition is beneficial for bringing Bayesian Statistics to as many scientists as possible.

1 Like

I had a toy implementation based on DynamicHMC, but obtaining the Hessian was the bottleneck (I was using ForwardDiff). The current refactor of DynamicHMC & related libraries will keep the possibility of adding an RHMC kernel open, but the idea will become practical when we can do Hessians fast, possibly with something like Zygote. The SoftAbs metric itself is relatively easy to implement.