Keep it simple stupid vs PPLs like Turing

Hello everyone.
I come from a general Machine Learning and Neural Networks background and learned about probabilistic programming and Bayesian Inference in the last year. Naturally I wanted to see what was available in the Julia ecosystem and was not disappointed.

But then, a thought crossed my mind, and I wanted to hear opinions from more experienced people.

Why do we actually need a Probabilistic Programming Language (PPL)? In Julia it seems as simply syntactic sugar, instead of specifying a distribution/sampler using tools from Distributions.jl.

The advantage I envision for a direct approach might be easier composition of packages, samplers etc, in line with Julia packages at large. Also, perhaps less overhead of the PPL/ less compilation time?

What would be the disadvantages? Inconvenience or is there something more? Also, what do different PPLs offer relative to one another?

Keen to read your thoughts,
Lior

========

To clarify, lets take an example from the Turing tutorial - Coin Flipping. The Turing model is

using Turing
@model function coinflip(; N::Int)
    # Our prior belief about the probability of heads in a coin toss.
    p ~ Beta(1, 1)

    # Heads or tails of a coin are drawn from `N` independent and identically
    # distributed Bernoulli distributions with success rate `p`.
    y ~ filldist(Bernoulli(p), N)

    return y
end;

Now, conditioning in Turing is

coinflip(y::AbstractVector{<:Real}) = coinflip(; N=length(y)) | (; y)

and sample with, i.e.

chain = sample(model, NUTS(), 2_000, progress=false);

An alternative that reuses Turing inference code would be (please excuse workarounds for product_distribution)

using Distributions
prior = DirichletMultinomial(1, [1,1])
likelihood(x) = Multinomial(1, [x, 1-x])
coinflip_joint(x) = product_distribution([likelihood(x) for i in axes(data)[1]])
coinflip_logpdf(x) = logpdf(prior, x) + 
                     logpdf(coinflip_joint(x), 
                            stack(map(x -> [x, 1-x], data))
                     ) 
                     # + const

and we can reuse Turing inference libraries by defining it as a LogDensityProblem (didn’t completely understand the docs though)

# CoinFlipProblem <: LogDensityProblem ?
chain = sample(CoinFlipProblem, NUTS(), 2_000, progress=false);

With the disclaimer that I’m not very experienced, the biggest advantage could be that a PPL makes it easier for more people to build, infer, and use more kinds of models. If you know statistics and inference algorithms very well, you might not care about this as much and you could work at a slightly or much lower level. I know next to nothing about statistics or math but I can learn and use, say, DynamicPPL+NUTS (like Stan). If sampling is really infeasible, I think I can similarly learn and use GraphPPL.jl for RxInfer.jl to use message passing instead.

1 Like

My 2 cents is that PPLs are nice because it helps you work at a higher level of abstraction (arguably, the right one, where you are focused on the statistical model, rather than the details of Julia, or whatever language code). In an ideal world, when working at a higher level of abstraction, composing models and general mathematical objects together to form more complex structures becomes easier.

(Disclaimer: Turing developer here.) Sure, it’s syntactic sugar, in the same way that Julia is a layer of syntactic sugar over machine code :slight_smile:

You are right that you can define ‘models’ quite easily with the LogDensityProblems.jl interface, and sample from them. (I’m fairly sure you can’t use Turing.NUTS() for non-Turing models – that should run into a MethodError somewhere – but you should definitely be able to use AdvancedHMC.NUTS, which the Turing version is a thin wrapper around.)

In fact, I’d expect this to be somewhat more performant than Turing, because you cut out a lot of the extra code we have that we use for tracking the state of models, etc. Also, there is a lot to be said for being ‘in control’ of your own code and knowing what exactly it’s doing at each stage. But in return you lose a lot of the functionality – you’ll have to hand-roll your own chains, figuring out which parameter is which, making your model work with AD, and basically everything else that the modelling bits of Turing give you.

Whether this is worth it to you is for you to judge – much like how one chooses whether to use a high-level vs a low-level programming language. In my PhD I hand-rolled code to do some quantum mechanical simulations, which was orders of magnitude faster than the best library out there. That was fine because I only needed to simulate one specific experiment and I used a lot of tricks that were specific to that experiment. It wouldn’t have worked if I needed to write something more general.

If you want to hand-roll your own models, and you think there are ways to better compose it with e.g. samplers or things, feel free to open an issue – we’re actually very keen to try to decouple the sampling and the modelling sides of Turing as much as possible. The idea being that you should be able to use your own models with Turing’s samplers, and your own samplers with Turing’s models.

# CoinFlipProblem <: LogDensityProblem ?

LogDensityProblems doesn’t provide an abstract type, so whether x implements the LDP interface is determined entirely by calling LogDensityProblems.capabilities(x).

There are pros and cons to this – I have actually quite often wished for an abstract type to dispatch on (for example, NUTS could be specified such that it only works with something that knows how to calculate gradients). But, as it happens, the way Turing uses LogDensityProblems wouldn’t quite work with an abstract type. Turing (technically, DynamicPPL.jl) has a type called LogDensityFunction{T} (source code), and the type parameter T determines whether it ‘knows’ how to perform AD – and there’s no way to subtype a different abstract type depending on T.

3 Likes

Thank you everyone for your thoughtful comments.
It makes sense then to view a PPL as a prototyping tool, or even just a faster/better solution in some cases. Then once common inference algorithms are tested, we can decide if we need to try other samplers or write a more performant version without the PPL.

I would be interested if anyone wished to share their experience using Turing or other PPLs, vs other approaches.

In my experience, using Turing’s “syntactic sugar” is usually worth it. As @slwu89 said, it lets you think at the right level of abstraction and avoid a lot of boilerplate that, while not really that onerous, still adds up to an avoidable cognitive burden and source of potential bugs. There are times when I’ve decided to hand-code models, though. When they are large and/or complex, it is easier to debug and performance-optimize your own code rather than Turing’s macro-expanded models.

If you’re just starting out with Bayesian inference, I’d recommend using Turing–though ultimately, as with anything like this, it’s a matter of preference and what works for you.

1 Like