MCMC landscape

Any plans to also provide a simple MH sampler for this interface? And maybe this RAM sampler I linked to above?

I think it is great that there is so much advanced work going on in this space, but at the same time it would be great if it was simple to just write a log likelihood, pass it to a simple MH sampler and get some results.

The MH sampler exists already as part of Turing’s core, and it’ll be incorporated as part of the interface refactoring. RAM probably won’t be implemented by the core team (at least for now), as we’re hoping to get more community involvement in sampler development.

Yeah, I’d agree – I don’t actually really know how you’d implement this using our planned extensions, perhaps @yebai might have a better vision as to how simple likelihood function passing might work. Ultimately Turing’s models are just functions, so I suspect it’s entirely possible to do something like this, but I think I’m the wrong person to say how complex it might be for us to get this working performantly.

Yes, I can confirm this.

If your model is AD’able, you can get gradients using

using ForwardDiff, ReverseDiff, Flux, or Zygote (ordered from robustness to speed). There are examples in

Also, I would recommend looking at

where @goedman implemented a lot of models very nicely using CmdStan, DynamicHMC, TuringLang, and Mamba.

Note that MH is trivial to implement, but for nontrivial models, tricky to tune, and then orders of magnitude less efficient than HMC methods. Unless getting gradients is actually impossible after all tricks are exchausted (eg integrating out discrete parameters), I would recommend that you invest in using a NUTS/HMC variant, with either of the frameworks above.

1 Like

It is not.

While trivial to implement, I might make a mistake (much easier to go unnoticed in one-off code), it won’t have tests if I do it quickly myself, I won’t have any diagnostics etc. etc. I think having a simple, robust, tested MH implementation available as a package would have many benefits.

Apart from the fact that we can’t use AD in our situation, we know that a sampler like RAM works well for our use-case. We have everything running with Klara.jl on 0.6, but we need to move things over to 1.0, and ideally we don’t want to revisit things like algorithmic choice. Even if things could be done more efficiently, we don’t need that at this point.

3 Likes

There is also AffineInvariantMCMC.jl, which has the log density function to chain behavior that you seek.

4 Likes

DiffEqBayes.jl is a library that lets people use Bayesian inference on ODEs without knowledge of those libraries.

What we’ve found so far is that Turing’s NUTS is the most stable, even more stable than the auto-generated Stan kernels. DynamicHMC diverges for us sometimes. Turing is much faster than it used to be. These tests show that it’s the thing to use, at least for now.

3 Likes

Soss.jl aims to provide a clean way to manipulate and combine models. While Turing gives more attention to the flexibility needed for “universal” probabilistic programming, I’m focusing on models with more structure: dimensionality (and supports, at least for now) should be known at statically at inference time.

Given this, Soss can do things like

  • Condition on a variable (yielding a different model)
  • Find the prior or posterior predictive distribution, or the likelihood (all models)
  • Evaluate the log density
  • “Forward sample” from a model
  • Sample from the posterior (currently using @Tamas_Papp’s DynamicHMC.jl)
  • “Particle importance sampling” that can be used for variational inference, using @baggepinnen’s MonteCarloMeasurements.jl
  • Progress toward symbolic simplification using SymPy.jl, arriving at a for of the log density that can be evaluated more efficiently
  • Model dependencies, using @scheinerman’s SimplePosets.jl

Generally “inference” in Soss means generating source code specialized for a given model. But it’s all “normal” Julia code. This has some tradeoffs, but inspecting the generated code is very easy. One of my hopes is that this will be good for building code a user can hand-tune (if they want), and that the approach might also be helpful pedagogically.

Benefits:

  • The approach is relatively simple
  • Because a Model is very close to the user-specified code, inference is extremely flexible
  • It can already do some useful things
  • In principle, inference can be very fast
  • Very open to collaboration

Drawbacks:

  • Code generation is currently via @eval and invokelatest. Still hoping to find better alternatives for this.
  • It’s still in early stages, and is currently at the tail end of a complete refactor (the refactor branch)
  • Development has been in fits and starts. Hoping to find ways to improve this and move forward more consistently
  • The recent refactoring pass was nearly a “scrap it and start over”, kind of brutal and certainly uglied-up my git logs

I’m sure really selling it here :wink:

3 Likes

Oh, and if you’re after performance, @Elrod’s ProbabilityModels.jl is looking pretty great. Hoping to find ways to connect with his work soon as well :slight_smile:

1 Like

The RAM algorithm looks like a relatively straightforward extension of MH. Perhaps take a look at how MH is implemented in Turing https://github.com/TuringLang/Turing.jl/blob/master/src/inference/mh.jl, and make something similar for RAM? There is a fair bit of boilerplate in there, but hopefully @cpfiffer’s PR will make this much more readable. If you can wait until JuliaCon, this PR may be merged by then. You can also ask questions on the #turing slack channel or to Cameron directly in JuliaCon :slight_smile:

1 Like

I agree. Here’s the algorithm, from https://arxiv.org/pdf/1011.4381.pdf

All it requires is a density function on \mathbb{R}^n. There’s no reason for this to be specialized to a given library. Making it available as a standalone function would allow anyone to use it, and then @davidanthoff could have a choice of front-ends :slight_smile:

This is all incredibly helpful, thanks everyone!

It seems to me that at one level we need the equivalent of https://github.com/JuliaOpt/MathProgBase.jl for MCMC samplers, right? And then different modeling environments (Turing.jl, Mamba.jl etc.) could all target that. And maybe one additional, new “modelling” package would be one that simply has an API where I can pass a log target callback function.

Having said that, a standalone package with just the RAM sampler would probably help us in the short term already. Heck, one could probably extract it somehow from https://github.com/JuliaStats/Klara.jl/blob/master/src/samplers/RAM.jl

1 Like

All the interface needed for this kind of sampler is a mapping from the location to the value. LogDensityProblems adds some frills by adding a way to query the dimension.

IMO one could simply implement RAM to take a function, a starting point, and return a bunch of vectors. Not much value is added by some unified interface.

The two key statistics you want to check about mixing are \hat{R} and effective sample size. I could find neither in the RAM article linked above, so I am not sure it is a big improvement over naively tuned, jittered MH.

There are a couple of other reference implemetations as well. Here’s one in R, and another in MATLAB, from the author of the original paper

Turing.jl has that and a ton of samplers. You can even use DyanmicHMC as the back end for Turing.

2 Likes

Alright, I looked around a bit more, and I really like the Turing.jl approach to samplers etc.!

So I think the one thing we would need to make all of this usable for us is an API like this:

function our_log_target(x)
   # Run whatever we need to run
  return res
end

chain = sample(MH(), our_log_target, initial_x = [1., 3., 5.])

# or 

chain = sample(RAM(), our_log_target, initial_x = [1., 3., 5.])

I looked through the Turing.jl code, but it didn’t seem like that could be easily done, or at least I didn’t see how :slight_smile:

So I think we are kind of mixing things up a little in this thread. Turing has 3 main components:

  1. A ~-based API
  2. Type-stable, AD-enabling book-keeping of parameters and their metadata
  3. Inference algorithms

Turing started with these 3 intertwined but the “modern” Turing approach involves making standalone inference algorithms in modules and wrapping them to be compatible with the Turing book-keeping approach. This is the case for example with AdvancedHMC.jl. Some samplers are still fairly intertwined though like the AdvancedSMC sub-module of Turing. 1 and 2 above are still heavily intertwined in the sense that 1 expands into all the boilerplate needed for book-keeping. There are no plans to break them as of now.

However, there is a plan to abstract away all of the algorithms in Turing in terms of their requirements. The requirements can be defined as input arguments to the algorithm like the the log joint probability for HMC and RAM, or as functions to be overloaded for some front-end Model type. It was pointed out to us in https://github.com/TuringLang/Turing.jl/issues/788 that this is similar to the approach that POMDPs.jl takes. This is not a short-term goal though (i.e. few weeks), more like medium-term.

So for your specific use-case, you want to directly call a single algorithm so you have no use for the Turing front-end and book-keeping. You can implement the algorithm in Turing or as a standalone package (question of maintenance) and then we can wrap it to be compatible with the Turing front-end much like we do for AdvancedHMC. Other packages like Soss can then do the same. I hope this clarifies things a little.

4 Likes

As people mentioned, Turing is very hackable. So if you really want to do it now, you can hack Turing model with customized distribution to achieve so.

using Turing

@model mymodel() = begin
  d = 1 # the actual dimension of your problem
  p ~ DummayPrior(d)
  1.0 ~ Target(p)
end

struct DummayPrior <: ContinuousMultivariateDistribution 
  d # the dimension of your problem
end
Base.length(dp::DummayPrior) = dp.d
Distributions.logpdf(::DummayPrior, _::Vector{T} where {T}) = 0.0
Turing.init(::DummayPrior) = [0., 1.] # your initialization
Distributions.rand(dp::DummayPrior) = randn(dp.d)

struct Target <: ContinuousMultivariateDistribution
  p
end
Distributions.logpdf(t::Target, _) = our_log_target(t.p)

mf = mymodel()
alg = MH(1_000) # MH for 1_000 samples; you can also customize your proposal
chn = sample(mf, alg)
# If our_log_target is differentiable, you can also use HMC or NUTS, e.g.
# alg = HMC(1_000, 0.2, 3) # HMC for 1_000 samples with 0.2 as step size and 3 as step number

The DummayPrior just plays a role for initialization and allow Turing to book-keep the parameter space, and the Target will try the computation of our_log_target. However, the problem of working with the our_log_target is that you need to take care of transformed space by yourself. But I think in general all models that you can implement as our_log_target by hand could also be implemented using @model with Turing.

4 Likes

There is another implementation of the affine invariant sampler here: https://github.com/mauro3/KissMCMC.jl (and also a MH). I’m pretty happy with the affine invariant sampler, and it requires usually zero tuning. Although by the sounds of it, RAM might be better for multi-model distributions.

As several people were talking about ensemble based methods, I implemented the differential evolution markov chain approach of ter braak and vrugt (https://github.com/chrished/DEMC.jl). There is basically no tuning needed, just provide an initial matrix of samples that is somewhat dispersed.

If you have many dimensions it might not do well (similar to Goodman & Weare’s algorithm), see https://mc-stan.org/users/documentation/case-studies/curse-dims.html

As some of you seem to use ensemble based methods in practice: Are you using those because you can not provide gradients, which prohibits HMC? Or is there a reason why ensemble methods are better than the mc-stan post I linked let’s on?

1 Like

AFAIK the main advantage is no tuning, compared to MH & variants. This is great when the posterior is “friendly”, may not be when it isn’t. But in the latter case tuning may be difficult already so in practice it is rarely done. If one is willing to accept a 20–80% efficiency drop, a mixture of proposal matrices sometimes helps. Textbooks from the late 1990s are full of these tricks.

IMO the bottom line is that for small parameter spaces you don’t care, for larger ones HMC/NUTS or approximations like variational inference are the only viable options.

1 Like