PPL without observations as function parameters

cscherrer · August 22, 2019, 2:48am

Hi!

I’m working on some changes in Soss, and they’re different than the usual way of going about PPL. Thought I’d share them here in case other are curious or have any thoughts blah blah (the usual stuff on that front)

To this point, my approach in Soss.jl has been in line with previous probabilistic programming systems I’ve used. For example, here’s a very simple model:

m₀ = @model σ,x begin
    μ ~ Cauchy()
    N = length(x)
    x ~ Normal(μ,σ) |> iid(N)
end

There’s a lot going on in these few lines:

σ is a free variable: everything else in the join distribution is conditional on it.
x is observed: its is assumed to come from the joint distribution described in the body of the model, but its value is known.
μ is a parameter; our typical goal in PPL is to find its distribution, conditional on σ and x.
N is deterministic, and is a function of x

Model manipulation is one goal of Soss that sets it apart from most PPLs. We’d like to turn a generative model into a conditional one, or a model with data into one with only a prior. But the way we write models makes this very weird.

What are the dependencies in the above model? Does x come first, or last? It’s a mess.

As one small step forward, I’m looking into an alternate approach for observed values. Rather than being passed as parameters, they are attached to the model struct.

In this approach, the above model is written


m = @model σ begin
    μ ~ Cauchy()
    N = length(x)
    x ~ Normal(μ,σ) |> iid(N)
end

Data are then observed like this:

julia> m_obs = m(x=randn(10))
@model σ begin
        μ ~ Cauchy()
        N = length(x)
        x ⩪ Normal(μ, σ) |> iid(N)
    end

This creates a new model with everything the same, except that m_obs.data is now a NamedTuple with field x.

Some other aspects:

No inference code has been generated yet (that’s next)
The type of the data is known statically
The ⩪ is a hint to the user that data is attached. Data is expected to be too large to easily view inline.
“Statements” (the body of the function) are unordered, and acquire an ordering at printing or inference time. There’s still some of this to work out.
For running a model on different data, one option is to encode is an a NamedTuple of Refs.

What do you think? See any big problems or opportunities?

trappmartin · August 22, 2019, 7:01am

First, I think using a special symbol in the modelling syntax is not a good idea. I had several talks about that with non Julia users (related to a different project) and they all agreed that it makes the syntax pretty but too difficult to use. Maybe you can achieve the same without using the special tilde operator.

Second, I think this would very interesting especially if you can provide the data in a lazy or batched way. But I guess you can also achieve that with a custom Array type.

Tamas_Papp · August 22, 2019, 7:38am

Personally I find PPL DSLs constraining without adding a lot of value, but that may just be an idiosyncratic preference.

That said, I think that you should consider separating the construction of the DAG from its application (generating data or producing a log density). Eg (I will be making up syntax here) in your example

m = @dag begin
    N = length(x)
    x ~ Normal(μ,σ) |> iid(N)
end

could just describe the DAG, and you could generate a posterior with

log_posterior(m, data = (x = randn(),),
              priors = (μ = Cauchy(), σ = Uniform(0, 10)))

by passing data and priors and named tuples and just transforming whatever representation m has into a callable. Or generate data with

data = generate_data(m, observables = (:x, ),
                     parameters = (μ = 0, σ = 1, N = 40))
data.x # has x

cscherrer · August 22, 2019, 1:50pm

I agree! My first way of writing models required users to write ⩪ for observation, but I worried it would be to awkward to type so I got rid of it.

In what I’m doing now, the user always uses ~. But since observations aren’t passed as arguments, that leaves no way to tell what’s observed. So I use ⩪ in the pretty printing to make this more clear.

That’s an interesting idea. It would be easy to make m.data an iterator for this case.

I think the PPL arguments are the same as the argument for any programming abstraction:

There’s common structure or behavior we’d like to describe without having to type it each time
We want a separation of concerns (here, modeling vs inference)
We want higher-order reasoning by making the objects of study (here, probabilistic models) first class objects, in order to easily compose them
We’d like high performance not to require deep understanding

In particular for HMC, I don’t think it’s unreasonable to expect we can find a way to allow writing any HMC-friendly model in a way that is composable, simpler than Base-Julia, and makes no performance sacrifice. This has been exactly Stan’s goal relative to C++, and I think they’re succeeding. The difference here is that we have a better language to start from

When you write the inference by hand, you’re committed to the inference algorithm, and you’re also committed to which variables are sampled vs observed. Want to do posterior predictive checks? You’ll need to write a fair amount of new code. Writing new code takes time, and is an opportunity to make mistakes. It also divides one’s attention between the domain of interest and the implementation.

Your @dag approach is interesting, but it’s not clear to me what it would expand to. Writing priors as a named tuple is very limiting, since it does not allow dependencies. Usually a prior does take a form like this, but I don’t see a reason to impose this as a constraint.

I’d also like to get away from the idea of having one special place for parameters, and another for observations. In my experience, this gets in the way of model composition, and prevents things like decomposition for Gibbs sampling. So I’ve been favoring an approach with a proper generative (or conditional on free variables) model, with observations imposed after model definition.

I really appreciate the feedback and discussion. Working on this in isolation would almost certainly lead to cases where I fail to question some assumptions, or miss some opportunity or coming difficulty. Thank you!

Tamas_Papp · August 22, 2019, 2:07pm

As much as I admire Stan, I don’t think it is succeeding when it comes to writing general models in a composable way. A lot of Stan’s syntax is repetitive and constraining. Which is a fair trade-off, since they compile it to a highly optimized runtime, but nevertheless it results in very complex, monolithic code.

My main point was that I would separate the DAG specification from the applications, which would take extra information to turn the DAG into a log density or a generative model.

The tension you highlight here between PPLs and expressiveness is precisely my concern: you can make it simple, but then there is always something more general that will not fit. I think that in the context of Julia, a fair solution would be allowing the user to break out of the PPL (or extend it) if necessary. Eg if you have interdependence in your prior \alpha and \beta, it should be easy to code up a log prior for that and use it directly.

cscherrer · August 22, 2019, 2:27pm

I don’t think Stan was targeting composability. My point is that Stan has succeeded relative to C++. They’ve gotten C++ performance, with much better syntax (than C++). I want the Julia version of that, but with composability.

This is very much like my approach, but even the DAG is delayed. The goal is to compose models as needed (which may add or remove dependencies), find the DAG (if it’s a DAG-oriented model), and then produce a log-density, random sampler, etc as needed.

Yes! A major goal is to make extension very easy. And user-defined distributions are no problem

scheidan · August 22, 2019, 2:55pm

This proposal looks very reasonable to me.

I’ve opened once an issue for Turing with a similar proposal, maybe it can serve as inspiration
https://github.com/TuringLang/Turing.jl/issues/792

Tamas_Papp · August 22, 2019, 3:30pm

They certainly did, but my understanding that this was orthogonal to the syntax: the key was the NUTS algorithm, with an efficient AD implementation (in C++). The syntax has pretty much evolved from the various BUGS incarnations that came before.

In any case, I am following your PPL work with interest. I think that the key to efficiency is an efficient algorithm using a robust source-to-source AD implementation (and I think Zygote is getting there), but if this is exposed with a very nice model syntax to the user, we would get the best of both worlds.

hessammehr · August 22, 2019, 3:40pm

For what it’s worth as an extra datapoint, I’ve been coding moderately complicated models using Tamas’ DynamicHMC library and don’t find myself missing the PPL syntax of Turing/Stan, nor does the code look any more complicated. Because the log joint probability is just a normal function it’s also very easy to reason about perfomance or any bugs. If things like sampling in parallel threads and performant reverse mode AD using Zygote were to be implemented/facilitated I can absolutely see myself leaving PyMC3 for DynamicHMC.

cscherrer · August 22, 2019, 3:44pm

Great point.

I agree this is important, and the ability t o specify efficient adjoints of logpdfs will be especially helpful. Also on the critical path to efficiency is fast logpdf evaluation. I think the key to this is finding ways of simplifying summations, there’s lots of ground to gain in this respect.

This is great to hear, thank you!

I completely agree. Optimizing this is one of the goals of posting questions here. I’d be very interested in models you think might be difficult to express in my current approach.

Yes, for Stan-like problems, I’d much rather use DynamicHMC than Stan. My question is, how much more can we have? Again, the big potential I see is model composability, and high efficiency without a requirement for hand-tuning and deep understanding of Julia. This can hopefully help to bring new users to Bayesian inference, and to Julia in general.

Tamas_Papp · August 22, 2019, 5:44pm

My main motivation for writing DynamicHMC was to be able to formulate problems which are not feasible or tedious using Stan.

cscherrer · August 22, 2019, 6:28pm

Do you have examples? A side by side comparison of Stan corner cases would be really useful

Tamas_Papp · August 23, 2019, 4:54am

Any sufficiently complicated model. Eg consider a DSGE model in economics, where the solution itself is found numerically.

Since I don’t think these are feasible in Stan, I cannot provide a side-by-side comparison.

trappmartin · August 23, 2019, 1:44pm

Thanks, this proposal is indeed interesting. The key challenge here is the one you pointed out in the issue yourself. But it is worth looking into it which is why it is on the roadmap of Turing.

trappmartin · August 23, 2019, 1:50pm

This is probably quite true. People at StanCon where very surprised to hear that the NUTS sampler in AdvancedHMC is on par with the NUTS sampler in Stan. There seems to be an understanding that the NUTS sampler in Stan is somewhat magical.

I think the benefits of a PPL (DSL) are in part that you can define models more efficiently and it is often more flexible (in terms of model definition) if various inference schemes are supported. This is of course only one motivation. Specifically, in Turing this is relevant as we support models with stochastic control flow, discrete RVs and model composition. But if you have a specific problem and rather implement the log joint yourself it is of course unnecessary and probably easier to use AdvancedHMC or DynamicHMC instead.

Topic		Replies	Views
Julia PPL survey paper? Probabilistic Programming	15	2492	September 22, 2020
PPL collaboration Probabilistic Programming	33	5172	April 8, 2020
JuliaCon 2020 Birds of a Feather Probabilistic Programming	24	3271	August 28, 2020
Current state of Julia Probabilistic Programming Languages and functionalities General Usage	9	2365	April 22, 2022
PPL connection to MLJ.jl Probabilistic Programming mlj	12	1649	September 22, 2019

PPL without observations as function parameters

Related topics