Probabilistic programming with source transformations

I have the very beginnings of a probabilistic programming language:
Here’s a link to the gist

The high-level idea is to represent a model by a quoted expression, and to express inference in terms of macros to transform that expression. My hope is that this will give a lot of flexibility and speed.

The gist includes a very rough Stan-like example. For optimization and posterior sampling I’m expecting @dfdx’s XGrad or @jrevels’s ReverseDiff to be really useful. I’m still getting my head around the tradeoffs of different design decisions for autodiff.

Anyway, I’m very interested in questions, design feedback, or ideas for next steps. Thank you!

EDIT: Here’s the latest

7 Likes

Looks exciting! I have never used probabilistic programming or Stan myself, but have been looking for a convenient interface to get started. Please, let me know if I can help with anything related to AD part. I’m also open to learning specific inference methods (e.g. ADVI), but I may need some guidance here.

As for your example, I’m curious in why to use macros? In my experience it’s easier to work with expressions in run-time and don’t mess with macroexpand-time magic. At least debugging is much easier this way.

Yeah, reasoning about macros is kind of a pain. The idea is to get through the painful part once, and then make it easy and really fast to apply it to different models. Whether you’re optimizing or sampling, that evaluation is in the inner loop, so evaluation needs to be fast.

I modified the nutsinjulia used by Turing to take a gradient argument. They use ForwardDiff, but the edit makes it a bit more flexible for XGrad, etc.

I didn’t see a way to do the rewrites I need without macros, and I really don’t understand the approach you use for your AD work. Do you see a better way to express it for easier reasoning without sacrificing speed?

You don’t need macros to transform and compile expressions in runtime. For example, using Espresso.jl:

using Espresso

ex = :(mu ~ Normal(0, 5))
pat = :(_v ~ _dist)
new_pat = :(ℓ += logpdf(_dist, _v))

rewrite(ex, pat, new_pat)
# ==> :(ℓ += logpdf(Normal(0, 5), mu))

I use this approach in XGrad to analyze algebraic expressions and find derivatives w.r.t. its arguments according to a set of rules. I then compile and
cache resulting expressions so runtime performance of the main code (something you have in the loop) isn’t affected at all.

Macros do the same transformation, but on a bit earlier stage, so the best you can get from them is a it faster initialization.

I modified the nutsinjulia used by Turing to take a gradient argument. They use ForwardDiff, but the edit makes it a bit more flexible for XGrad, etc.

Can you show this code? I was under impression that Turing.jl provides only sampling-based inference, but usage of ForwardDiff indicate the opposite.

I was thinking of a front end like you’ve described for a long term project.

I’m not a salesman, and my efforts to produce a notebook demonstrating the idea have been slow.
But a brief summary:
Given a D dimensional probability distribution that we evaluate and take its gradient at N data points, we have N*(D+1) points for interpolation.
We could be extremely conservative, throw away the gradients, and approximation the CDF with a step function (degree 0 spline). Or we could interpolate some function of the N*(D+1) values of the PDF using either polynomials (iffy) or higher degree simplex splines that are easy to integrate, marginalize, etc.
At least for modest D, this should lead to dramatically faster convergence relative to N.

Furthermore, MCMC can struggle with high degrees of correlation between parameters, such as in tall hierarchical models. You can see this with Stan, as the sampling can slow down dramatically, as it cranks up the leap frog steps / work put into solving the Hamiltonian differential equation.

Many hierarchical models are easy to factor into DAGs. Components that can’t be factored / represented well by a DAG can exist as a large many-parameter node of a graph, but in many cases other components can still be factored.
Each node can be evaluated independently, fit with a spline, and pieces quickly marginalized to give updated interpolated values to pass on as messages to connected nodes. For example, if nodes A[params: w, x, y, z] and B[t,u,v,w,x] are connected, y and z can be marginalized out of A, and a message send containing the density of w,x at each of the N points where B is evaluated to update the interpolation conditions of the spline fit to node B.
Thus, large models can readily be factored into small pieces that can be solved independently, and it should be able to overcome the problems associated with high correlations.

If one constructs grids a priori (eg, quasi monte-carlo) this can also all be run on a GPU, which should really help speed up extended simulations. But I’ve also wondered if some adaption of HMC could be used to find a representative region of the posterior.
Right now my best idea involves first completing a maximization step, and using the maximum and Hessian at the maximum to center and scale the grid.
Higher order derivatives are also additional value to interpolate.

If you’re mildly interested, I’d like to ask if anyone working on your project can make things modular enough that

  • You can easily switch back ends.
  • Possibly work factoring probability densities into graphical models into the model interpretation, or keep everything modular enough that it wont break too much for someone to try and go in and change things like that in the future. BUGS/JAGS use graphical models, but Stan does not because it does not need to, and graphs were restrictive. My proposed solution to the latter is to leave anything that can’t be broken down in one large node, so that “non-graphical” models are simply a one-node subset.

Hamiltonian Monte Carlo requires gradients.
Stan (written in C++) uses reverse mode AD, and is about 10x faster according to Turing’s benchmarks when otherwise the same algorithms.

Out of curiosity, your Dict{Any,Any}s for caching don’t cause any problems with type inference?

IMO Stan’s abstraction model is more trouble than it is worth. It all boils down to incrementing the (log) posterior, which is pretty easy to declare, and that’s what ~ statements in Stan do. Which is neat, except that you need to escape this all the time (Jacobians of transformed parameters, early rejection of the sample by aborting computation).

Coding up a log likelihood as a sum is in fact very easy and conceptually clean. I think that the whole DAG approach is a leftover from the era of BUGS, where you had to know how the graph was built to do Gibbs sampling. But, at least for continuous parameter spaces, that need is gone, modern methods like NUTS are much more efficient and scalable.

I just find it easier to code the likelihood directly in Julia, as a function that maps \mathbb{R}^n to \mathbb{R}, then AD it. I have a modular collection of tools for this that basically replicates Stan’s default HMC functionality as described in Betancourt (2017), see DynamicHMC.jl, ContinuousTransformations.jl, DiffWrappers.jl, MCMCDiagnostics.jl. I will clean up and announce these after v0.7 (named tuples make the surface syntax much nicer).

@Elrod: approximations are indeed useful, but splines are more trouble than they are worth IMO (I think that as the dimensionality increases, you have to pick between bad and slow/complex approximations; think 1000-dimensional parameter spaces). A successful approximation strategy is variational inference.

7 Likes

Forward-mode AD is always slow for \mathbf{R^m} \rightarrow \mathbf{R^n} where m is large, reverse-mode AD should make it much better.

Out of curiosity, your Dict{Any,Any}s for caching don’t cause any problems with type inference?

I’m pretty much sure it does, but with even middle-sized data arrays it’s far from being a bottleneck. However, if you find a case where it’s important, please don’t hesitate to report so I could look for possible solutions.

Note, that XGrad allows you to generate derivative expressions and then manually optimize them to find bottlenecks or check influence of each individual part of the code (e.g. using Dict{Any,Any}).

Oh, I think this approach sounds much better. Thank you!

Here’s the NUTS code used by Turing:
https://github.com/xukai92/nutsinjulia

I had been waiting for a response to my issue about the license before posting my changes, but maybe at a point I can make some assumption about it. Just not sure yet what that assumption should be.

From your comment, it sounds like you might think of gradients as mostly used for optimization. And maybe that’s generally true, but there are sampling algorithms like NUTS and Hamiltonian Monte Carlo that use them.

Hi @Elrod,

I’m not sure about the spline-fitting part, but what you describe sounds a bit like variational message passing, or the sum-product algorithm for factor graphs. Have you looked into either of these?

I agree with the idea of switching back ends. I think a model should be expressed abstractly, in a way that makes it easy to try out and compare results for different back ends.

1 Like

Hi @Tamas_Papp,

For me the biggest added value of Stan’s approach is automatic reparameterization. It requires a continuous parameter space, and automatically turns it into a distribution over \mathbb{R}^ n. This is easy in principle, but it’s so error-prone it makes sense to pass this off to a compiler. Beyond this, I’d like to have something that can be passed to multiple back ends.

Thanks for the pointers to your tools. Oh, and I see the transformations thing is covered as well, that could be really handy. I really like the modular approach. This possibility is one thing that drew me to using Julia for this in the first place. I can do transformations without learning a separate template metaprogramming language, tie into an existing AD package, plotting facilities, etc.

Note that this seems to be the original NUTS code, not what Stan uses (by default) these days. For that, see the Betancourt paper that I linked, I have that programmed with lots of unit tests (~90% coverage) in DynamicHMC.jl (that’s kind of the only thing that package does :smile:)

My motivation for the approach that I linked above is the following: when I code the likelihood function, I can use the standard tools (@code_warntype, Profile, BenchmarkTools etc) for optimization, have access to @simd and @inbounds, and all kinds of tricks. Whereas if I am dealing with a log posterior built up by some other tool, I may not have access to it in the same way, and even if I identify a bottleneck, the construction is indirect.

That said, a DAG could be transformed to a log posterior with something like your code. I just don’t think it that starting from DAGs should be a requirement for MCMC, because then it becomes tricky to construct a log posterior that you can’t formulate in terms of the existing language. For Turing.jl, you can define a custom distribution, and for Mamba.jl I was helped out by the package author with a nice trick, but repeating this for multiple iterations of a model is tedious.

In any case, my own packages are kind of underdocumented now, but if you want to use them, feel free to ask questions.

1 Like

Oh interesting, I didn’t know the algorithm had evolved since the first paper.

On your point about the standard tools vs DAG representation, it’s common in probabilistic programming to express a model as a “normal” program, with some added primitives, usually called sample and observe. These can be interpreted in different ways, depending on the inference method used.

I haven’t yet made the distinction between sample and observe in my code, but maybe I should. Other than these (and possibly declarations), the code will be standard Julia, and should allow the use of tools as you describe. But there’s no DAG-oriented intermediate representation necessary.

What’s not clear to me is whether “optimization macros” might sometimes need to change depending on the inference algorithm. For example, there could be one that’s really helpful for NUTS, but it’s not thread-safe. Leaving it in place for an SMC-based approach would then be a mess.

I haven’t looked at your packages in detail yet, but from the sound of it they should be really helpful. Thanks again for letting me know about them.

Will it work with black boxes, like putting an ODE solver in there? Usually there’s a tradeoff between source transformations for extra performance vs compatibility with a larger domain of problems, and it sounds like this is taking the route of being hyper-performant on a small (but standard and very widely used) set of problems. Is that correct?

1 Like

Hi Chris,

This approach isn’t quite at the extreme of “hyper-performant”; Hakaru is more in that direction. They aim for aggressive code transformations; it’s much more along the lines of compiler development, but specifically for sampling from distributions. It’s exciting research work, and has a really great potential, but I think currently there are still a lot of corner cases. Still, in cases where their transformations help they can sometimes help a lot, and it can seem like magic.

It’s possible to build lots of inference algorithms for this approach, but the logdensity I have in the gist is along the lines of Stan. Stan has a few constraints:

  • Parameters must be continuous (but observed data can be discrete).
  • The parameter space must be a fixed dimension, known statically in advance.

This covers a lot of cases, including some that might not seem to fit at first. For example, you can’t have a parameter to indicate whether an unobserved coin is heads or tails (discrete parameter), but there’s no problem at all having a parameter for the probability of the coin having been heads.

Still, there are plenty of cases it misses. Sometimes we really want discrete parameters. Other times the number of parameters is itself random.

To answer your black box question… it depends. For some inference algorithms, you really need the gradient. If it’s really a black box, autodiff isn’t an option. Still there are other inference algorithms that would work just fine. Or you could do numeric diff instead of AD.

Also, since you mentioned ODE solvers, here’s one in Stan.

For our ODE solvers, you can autodiff through the native Julia ones, or use sensitivity analysis, so that’s not a problem. We just need inference algorithms which are generic enough to accept them.

And their existence is about the best thing that can be said about them, which is why we are putting work into this because there really isn’t a good solution out there for Bayesian parameter estimation right now. We have things setup with Turing.jl (and with a source transformation to Stan, with the limitation that we have to use its internal solvers), and would like to try out more backends as well.

1 Like

Thanks for all the great discussion. I think I can simplify this even further. The model from the gist,

myModel = quote
    real(μ)
    μ ~ Normal(0,5)

    positiveReal(σ)
    σ ~ Cauchy(0,3)
    
    for x in data
        x ~ Normal(μ,σ)
    end
end

, can be rewritten as

myModel = quote
    μ ~ Normal(0,5)
    σ ~ Truncated(Cauchy(0,3), 0, Inf)
    for x in DATA
        x <~ Normal(μ,σ)
    end
end

A couple of points here:

  • Parameters can be declared with ~, with the support obtained using Distributions.support. This will need to be extended to cover multivariate cases, but in principle this should be simple.
  • The name of the observed data can be changed to DATA. This seems unlikely to be otherwise in use.
  • Distributions of observed data can be specified with <~.

For different ways of running the model, we choose replacements for ~ and <~, possibly wrapped in some extra code. In addition to the usual inference methods, here are some simple ideas that should be useful:

  • Sample parameters from the prior
    • ~ calls rand
    • <~ is ignored
    • return parameters
  • Generate fake data
    • ~ calls rand
    • <~ calls rand
    • return DATA
  • Posterior predictive check
    • Take parameters as argument
    • ~ ignored
    • <~ calls rand
    • return DATA

At least for now, my plan for the Stan-like approach is to use @dfdx’s XGrad and some of @Tamas_Papp’s libraries. Once all of this is working, I’d like to get variational inference going as well.

Next steps are to get some very minimal code structure in place and upload to Github.

Hmm, I took it for granted that <~ would parse sensibly. But

Main> ex = :(μ <~ Normal(0,1))
:(μ < ~(Normal(0, 1)))

Maybe it’ll be ok after all:

@jeff.bezanson:

We could potentially require spaces in a < ~b; we’ve added rules like that before. Then we could add ← and <~ as infix operators.

Thanks Jeff!

So what exactly are the limitations of this design? Can you put any Julia function in there for the likelihood? If so, I’d be willing to try out any beta with DiffEqBayes.jl and see how it works out.

That would be great, thanks Chris! I’ll let you know when I have NUTS working on a toy problem.

So far I’ve focused on a Stan-like approach, so the limitations should be exactly those of Stan. They’ll be stricter than that for a while – some of the \mathbb{R}^n bijections will take some work.

BTW, this works now:

myModel = quote
    μ ~ Normal(0,5)
    σ ~ Truncated(Cauchy(0,3), 0, Inf)
    for x in DATA
        x <~ Normal(μ,σ)
    end
end

Main> parameters(myModel)
2-element Array{Symbol,1}:
 :μ
 :σ

Main> supports(myModel)
Dict{Symbol,Any} with 2 entries:
  :μ => Distributions.RealInterval(-Inf, Inf)
  :σ => Distributions.RealInterval(0.0, Inf)

I’m hoping to get the bare-bones structure to Github over the weekend.

1 Like