I have also copied the contents into this comment and the next comment (it is too long for a single Discourse comment), in case anything ever happens to the Google Doc.

Click on “Click to expand” to view the contents.

##
Click to expand (part 1 of 2)

Sivaram Yesterday at 6:44 PM

What’s the difference between Gen and Turing for probabilistic programming?

1

Miles Lucas:star: 6 hours ago

Here’s one comparison: Turing is much closer to PyMC3 and STAN than Gen is. Those 3 all use a DSL for priors, likelihoods, inference, and analysis. (edited)

Sivaram 4 hours ago

Gen looks like it’s more general purpose and lower level

McCoy Becker 3 hours ago

Turing works with a restricted model class - (@trappmartin correct me if I’m wrong) all models which can be expressed by probabilistic programs where the support of the program (i.e. the set of random choices) are known at compile time. This model class is equivalent to anything you can express with a Bayesian network. Furthermore, Turing provides easy-to use APIs to advanced sampling tools (+ a very robust implementation of HMC), as well as the ability to compose pre-defined sampling algorithms. This already allows you to express any classical statistical model.

Gen can express any stochastic computable function, including those where the support is not known at compile time. This includes programs where control flow depends on randomness (like an if construct where the condition depends on a random choice, and the success branch introduces a new random choice while the fail branch does not). Gen is able to support these programs because it operates by tracing - this technique is similar to gradient tape tracing in automatic differentiation, the key difference is that to do this for a probabilistic program, you want to keep track of an auxiliary address map for all random choices when you trace forward, as well as the log probabilities for random choices that you make. The probabilistic program which you express represents a distribution over this choice map (map from addresses of random choices to sampled values). When you sample from the program you wrote (by using the generate function in Gen), you get a sample choice map.

For inference, Gen focuses on sampling but gives you a large amount of flexibility to write your own proposal distributions - the key technique to understand to unravel how this works in Gen is importance sampling. Given a probabilistic program with distribution over choice maps, you can write another program (in the Gen lingo, these are generative functions) which represents a proposal distribution over the same set of choice maps, and use it for sampling based inference. You can also write custom MCMC kernels in a similar way. This is called programmable inference because, once you write a probabilistic program, you have great control over proposals, which can greatly improve the efficiency of inference.

Summary:

Turing represents a restricted model class, but this is already sufficient for almost any use case in classical data analysis. Convenient APIs are provided which allow access to state-of-the-art sampling algorithms, and you are allowed to compose these algorithms across random variables in your program.

Gen can represent any stochastic computable function and has highly customizable inference (including all the sampling algorithms from above, as well as anything you can dream up in proposal form). I would typically recommend Gen if you are doing something “crazy” like hard robotics research, or large scale generative modeling in a new space where contemporary tools will need to be specialized.

Also, I think both have some support for learning probabilistic programs via automatic differentiation but you have to check me on this.

2

:clapping:

3

Jesse Perla 2 hours ago

Here is my long-run dream (1) I want to take what is close to an arbitrarily complicated julia function calculating the solution to some sort of complicated structure model as a bunch of julia functions (constrained only by things that are mathematically required such as invertibility, etc) and where I can write adjoint rules for it/etc.; (2) call it with a random distribution instead of fixed functions; (3) have that geenrate a way to draw from the likelihood.

Is that the longrun goal of something like Gen? Where you need to annotate the inside of your function less and less over time?

McCoy Becker 2 hours ago

That’s actually the goal of a research project I’m working on called Jaynes. It’s not usable yet, but the goal is to do something similar for probabilistic programming which https://arxiv.org/abs/1810.07951 did for AD in Julia with Zygote. In short, you should be able to write an arbitrary stochastic program with rand calls and transform it into a probabilistic program with the right APIs for inference. @Jarred Barber is working on another project in parallel which has a different perspective on the same idea using category theoretic ideas (specifically, compiling to categories alla Conal Elliot).

Gen is not meant to turn arbitrary Julia programs into a “probabilistic interpretation” automatically - the goal is actually to facilitate a highly convenient API for programmable inference. I think the annotations are useful in this regard, because when you explicitly write @trace you are structuring your program in a particular way and you know where the “probabilistic interpretation” begins and ends. From this perspective, you shouldn’t be surprised by anything in the trace

The problem with automatic interpretation is that you run into problems with certain language features. The question is always: what do you support in your interpreter? Do I support control flow constructs, recursion, mutation? If you’re working with IR-based manipulation tools like IRTools and Cassette to try and automatically do what I’ve stated above, it starts to become very complicated very quickly. Trace-based approaches like Gen handle this naturally - the cost now is that you do have to explicitly annotate things. So trace-based approaches are ripe for experimentation for automatic interpretation. The primary design question is then - what API do you give to users to do inference when everything is automatic like this? This is an open question.

arXiv.orgarXiv.org

Don’t Unroll Adjoint: Differentiating SSA-Form Programs

This paper presents reverse-mode algorithmic differentiation (AD) based on source code transformation, in particular of the Static Single Assignment (SSA) form used by modern compilers. The…

McCoy Becker 2 hours ago

Also, I shouldn’t be speaking for directly for Gen, because @marcoct and @Alex Lew are the two members of the community who work on Gen actively in ProbComp. They are better, and likely more accurate, sources of information than myself. These are just observations I’ve made from using the system.

Also sent to the channel

Martin Trapp 2 hours ago

Nope, Turing is a universal PPL.

Turing allows you to express

classical Bayesian models like in Stan (see for example: https://turing.ml/dev/tutorials/2-logisticregression/)

dynamic models (unknown number of parameters) through random measures (see for example: https://turing.ml/dev/tutorials/6-infinitemixturemodel/)

Bayesian neural networks (see for example: https://turing.ml/dev/tutorials/3-bayesnn/)

Compositional models that integrate with GPs PPLs (see for example: https://github.com/willtebbutt/Stheno.jl/tree/master/examples/turing_integration)

And more…

Similar to Gen, Turing uses a tracing approach. Actually various tracing approaches, specifically designed for efficient inference in dynamic model. We have a few papers on this.

Last but not least, Turing uses programmable inference through contexts. Which is tailored to allow efficient implementation of new algos. Gen is more focusing on automatizing inference in my understanding.

2

McCoy Becker 2 hours ago

@trappmartin Thank you for correcting me!

1

McCoy Becker 2 hours ago

One question: are infinite mixture models equivalent to probabilistic programs with stochastic control flow? That’s what I think of when I think of universal PPL.

Jarred Barber 2 hours ago

i would think that they form a subset?

Martin Trapp 2 hours ago

Turing also supports stochastic control flow.

McCoy Becker 2 hours ago

Cool - I’ve been spending time in Gen land for awhile, I should investigate Turing again.

I am very excited for the IRTracker.jl work

Martin Trapp 2 hours ago

I don’t think infinite mixtures are strictly a program with stochastic control flow. This largely depends on the representation that is used. (edited)

Martin Trapp 2 hours ago

Good to hear. Me too. @phipsgabler is doing some cool work and we should hopefully soon have a JAGS style Gibbs sampler realized through IRTracker. This will be a first try run to utilize IRTracker in Turing,

McCoy Becker 2 hours ago

I’m not familiar with nonparametrics beyond GPs, actually recently I’ve been interested in representation for these models in PPLs, which I’m wholly naive about.

Jesse Perla 2 hours ago

Can anyone tell me how the whole “normalizing flows” fits into this world? I ran into it and it roughly seemed what I was interested in: basically take a model written with the least knowledge about probability necessary, shove a distribution throw it, and get something you can sample on the other side to use as a likelihood? Am I missing the point entirely?

McCoy Becker 2 hours ago

@Jarred Barber ^. Sorry putting on the spot .

Jarred can correct me: but from what I understand, you start with a simple distribution (i.e. a diagonal multivariate normal) and seek to learn an invertible transformation (differentiable in a set of parameters) which transforms the distribution into a more complicated one which fits the data generating distribution. The loss function can be computed by using the Jacobian transformation formula for distributions, so you can compute the log PDF of the distribution you want to represent by taking the original log PDF and multiplying by the Jacobian determinant. The goal of efficient flow structures is to make the computation of the determinant computationally minimal.

This model class can likely be represented in either Turing (in Bijectors) or Gen, but you’d need to also use Flux to represent the parametrized transformations with something with a large capacity (i.e. a neural network). I think Bijectors uses Flux under the hood @trappmartin (edited)

Martin Trapp 2 hours ago

@Tor implemented Bijectors with allows you to express normalizing flows in Julia. The gist is that you apply multiple invertible transformations on a simple base density like Gaussian. This allows you to express more complex distributions (trough transforming a simple one) while still being able to evaluate the density exactly.

McCoy Becker 2 hours ago

Words out of my mouth again

McCoy Becker 2 hours ago

Okay I have to actually do work, this has been a good thread see ya

Jesse Perla 2 hours ago

Thanks. Is the longrun vision of this that you could register inertibility/etc. for all sorts of functions in addition to registering adjoints for your functions, and that you should eventually be able to pump out densities from very complicated programs (that fulfill the mathematical requirements of normalizing flows)? Bascially to get a density/likelihood from arbitrary programs?

Martin Trapp 2 hours ago

Me too. One last note, there is a really good review paper on normalising flow. I just can’t seem to find it…

Martin Trapp 2 hours ago

Yes (I think) it is. (edited)

Martin Trapp 2 hours ago

But better ask @Tor.

Jesse Perla 2 hours ago

Thanks to all, now get back to work!

Jesse Perla 2 hours ago

… hopefully work = coding stuff I can free-ride off of in the future.

Philipp Gabler 2 hours ago

http://arxiv.org/abs/1505.05770?

arXiv.orgarXiv.org

Variational Inference with Normalizing Flows

The choice of approximate posterior distribution is one of the core problems in variational inference. Most applications of variational inference employ simple families of posterior approximations…

3

Tor Fjelde 1 hour ago

Soooo normalizing flow doesn’t necessarily have anything to do with Bayesian or variational inference, but it’s an “independent” concept which has a lot of use-cases in Bayesian and variational inference (e.g. the paper @phipsgabler points to).

A normalizing flow is basically a base Distribution combined with an invertible & differentiable transformation (with differentiable inverse), which is sometimes referred to as a Bijector. The reason why you impose these extra conditions on the transformation used, is so that we can compute the logpdf of the output when the input are samples from the base Distribution. A super-simple example is to use a Shift operation (i.e. addition) by some quantity μ to transform the samples from a Normal(0, 1). This results in a Normal(μ, 1), right? And Shift is indeed a Bijector , so if you follow the standard procedure of computing the logpdf of the output of Shift when the input is samples from a Normal(0, 1), you end up with the logpdf of a Norma(μ, 1) And in the same way as you would MLE estimation to find the mean μ of a Normal distribution given observations, you can instead view this as finding the optimal parameter μ of the Shift operation which maximizes the logpdf of the Normal(0, 1) transformed by Shift. This far we haven’t mentioned anything that relates to Bayesian inference or anything of the sort, but it indeed has its uses. For example in VI you can use more “expressive” transformations (but still they need to be a “bijector”) to improve the variational approximation, or in Bayesian inference you can use such transformations to reparameterize your model to make it easier to sample from. In Turing it’s mainly used to transform a Distribution to have support on the entire real line, as this circumvents HMC stepping outside of the domain/support of the logpdf of the Distribution.

Bijectors.jl provides a pretty neat and efficient way of defining these types of transformations (independently of any distributions, as it should be), in addition to simple interaction with Distribution allowing you to create a new distribution using a Bijector from Bijectors.jl combined with a Distribution from Distributions.jl:) So you can use this to construct new distributions because you just need a strange distribution that can be expressed as a transformation of some other distribuiton, construct variational approximation for use in Turing.jl, or just do MLE estimation.

I did a informal presentation on basically this in a Julia meetup last year, and the slides are available at: https://torfjelde.github.io/presentations/cambridge-julia-meetup/

I particularly like the very last bit of the transformation where we use a normalizing flow that looks like a banana as a prior inside a Turing model:)

In addition, the README in Bijectors.jl (https://github.com/TuringLang/Bijectors.jl) explains pretty nicely what a Bijector is and how it fits in with Distribution, and there’s a section on normalizing flows too:)

Jesse Perla 1 hour ago

OK. But from that it doesn’t sound like Normalizing Flows are intended to evolve “differentiable programming” to be something like “differntiable and invertible programming”, which was my hope? . i.e. it has more down-to-earth goals of transforming distributions with a limited number of operations, and not transforming distributions written for an entire program (that is invertible)?

Tor Fjelde 1 hour ago

Thanks. Is the longrun vision of this that you could register inertibility/etc. for all sorts of functions in addition to registering adjoints for your functions, and that you should eventually be able to pump out densities from very complicated programs (that fulfill the mathematical requirements of normalizing flows)? Bascially to get a density/likelihood from arbitrary programs?

@Jesse Perla It’s not quite like that. You could do that, but in the case of bijectors you really want to exploit the fact that a composition of two bijectors is also a bijector. Using this property, it’s so much easier to create new transformations. So if you did something like registering inverses for existing functions and whatnot, composition becomes difficult, in some cases it’s ambiguous whether or not to treat an input as a single input or a batch of inputs, etc. So instead you have to define a subtype of a Bijector{N} where N denotes the dimensionality of a single input ,e.g. if you’re transforming a Real it should be 0, if it’s a Vector it should be 1, etc. Then you implement evaluation, evaluation of the inverse, logabsdetjac and, optionally, forward which computes evaluation and logabsdetjac in one pass (useful in the case where you can share a lot of computation). You can use AD to compute logabsdetjac automatically if you want but it quickly becomes impractical, which is why a lot of research goes into finding new normalizing flows with easy-to-compute logabsdetjac (edited)

Jesse Perla 1 hour ago

Thanks for the reply. To give context, when economists are writing “structural models”, they tend to be far more crazy than something you can express in a PPL. Models to us are mostly elaborate programs. All sorts of nested madness and fixed points. However, each of the components can usually be written to have an adjoint which are nested, which makes AD possible.

The hardest part of estimation when it comes to these structural models is figuring out the appropriate likelihood for these sorts of calculations. I am trying to get a sense on what technologies in the PPL-universe can help.

McCoy Becker 1 hour ago

Theoretically, they can be expressed in any universal PPL

Jesse Perla 1 hour ago

sure, and I can express my julia programs in any turing complete language

McCoy Becker 1 hour ago

I’m trying to tease apart the issue here

Jesse Perla 1 hour ago

But maybe what I am missing is that the universal PPLs with Gen and Turing are actually getting pretty far down the universal side.

McCoy Becker 1 hour ago

When you say likelihood, what do you mean? The probabilistic program is the likelihood

McCoy Becker 1 hour ago

Do you mean you need access to the log probability for the program?

Jesse Perla 1 hour ago

Yeah, I guess the whole program is a likelihood, just a very complicated one.

Tor Fjelde 1 hour ago

That sounds very interesting!

So if each of these components of the SM could be expressed as a Distribution together with a Bijector, it would just be a matter of accumulating the resulting logpdf and you’re done This is also why an AD-like framework is probably overkill: the compositional rule is always the same, and it’s always

(x ↦ f(x), prev_logabsdetjac ↦ prev_logabsdetjac + logabsdetjac(f, x))

McCoy Becker 1 hour ago

@Tor hit it right on the head. Although my main question here is whether or not you want to differentiate the log PDF

Jesse Perla 1 hour ago

Yes, it would be nice to differentiate the log PDF if I wanted to run maximum likelihood or a HMC, but if I can just get out a logpdf of a complicated model and not have it differentiable, that is still useful.

Jesse Perla 1 hour ago

Let me see if I am understanding this alittle:

With turing, I can write my complicated model given the “shocks” and randomness as an input. If I am able to put AD for my function on all of this, then I can get out a density as well as differentiate it for HMC

With something like Gen, I would draw random shocks in the middle of my code where I needed randomness in the algorithm?

(edited)

McCoy Becker 1 hour ago

Slack deleted my message for some reason

Alex Lew 1 hour ago

@Jesse Perla In Gen, the randomness can come at the beginning (and be passed through a deterministic function) or anywhere in your model. (I believe the same is true in Turing.) The question is what you want to do with the output of the deterministic function. Do you want to constrain it to equal a certain observed output value, and then sample plausible inputs? (edited)

Martin Trapp 1 hour ago

What @Alex Lew said.

Jesse Perla 1 hour ago

Hmm… I need to think about the structure here to see if it is fundamentally different from any other PPL (just with uglier detmerinistic functions).

Jesse Perla 1 hour ago

You guys have given me more than enough help to do my homework and think this through more abstractly before I bug you again.

Philipp Gabler 1 hour ago

The usual generative models are of the form

π ~ Prior

Θ ~ ComplicatedStuff(π)

X ~ ObservationDist(θ)

where we want to do things with p(Θ | X = x) for some observations x (and maybe even only parts of \Theta). It sounds like in your case, ComplicatedStuff is just complicated, but determinstic? (edited)

Cameron Pfiffer:turing: 1 hour ago

@Jesse Perla I’m happy to help with your issue specifically (over DM if you prefer not to share your research topic broadly), from a joint econ/PPL standpoint. I’m a big fan of structural modeling. From what I’ve seen of your model, it shouldn’t be different than any other model expressable in a PPL, but perhaps you have some domain issue that’s complicating things.

Alex Lew 1 hour ago

@Jesse Perla

The pseudocode might look something like

x ~ input_distribution()

y = f(x)

But this won’t work if you want to constrain y to equal some observed data value, and then sample likely values of x. In both Turing and Gen, observations must also come at random-choice-points (~ lines in the model).

One option is to add random noise to y:

x ~ input_distribution()

y ~ normal(f(x), noise)

(The noise could also be random, noise ~ noise_prior().

If f is invertible has an easy-to-compute log abs det jacobian, you have no need for inference: just run f_inv on y to get your x value back out. However, it may be the case that f is a function of x and some additional randomness omega, where omega has the same dimension as y:

x ~ input_dist()

omega ~ some_other_dist()

y = f(x, omega)

If the function f(x, •) is invertible and has a tractable log-abs-det-jacobian (when considered as a function of omega), then you can create a distribution f_dist(x) with a known density, that internally "samples omega and pushes it through f(x, •).

Then your model would be:

x ~ input_dist()

y ~ f_dist(x)

This now allows you to constrain y and sample x using SMC, MCMC, etc. (The “normal-noise” model did as well, but this one doesn’t add new noise variables.) Constructing f_dist from Julia code for f is quite tricky, because it requires automatically constructing f’s inverse. I don’t believe there’s a package that does this right now, but if you build f(x, •) as a Bijector (either compositionally, or by hand-coding its inverse), you should be good to go.

Philipp Gabler 43 minutes ago

Quite what I meant, but expressed much better!

Jesse Perla 41 minutes ago

Amazing, let me think about that. At this point I am trying to get a sense for how PPL can fundamentally change how we could our models. But the basic model given above

π ~ Prior

Θ ~ ComplicatedStuff(π)

X ~ ObservationDist(θ)

Fits a lot of problems (and I think I understand that case reasonably well within a PPL setup). I am trying to get a sense for whether there are an expanding set of alternative ways to code that this stuff opens up.

For example, writing randomness in the middle of your code, transforming an entire set of code to spit out a distribution (instead of just sampling from one), etc.

Jesse Perla 40 minutes ago

But you have done more than enough to get me started on thinking about this stuff. For now, my goals are more mundane and fit into the standard setup.

1

McCoy Becker 36 minutes ago

In your for example section, the first one is do-able in any PPL.

Your second example is harder. I’ve done a few experiments here, but I’ve only been able to automatically transform a program into a functional log PDF with restricted models, without control flow. I know it’s possible for control flow where certain information is known at compile time but I haven’t written the correct pass yet.

In general, I don’t know if it’s possible for an arbitrary program. But it’s a good idea - if you give me the functional log PDF, I can differentiate it, do all sorts of stuff

Philipp Gabler 34 minutes ago

@Alex Lew about the package constructing inverses: I think https://github.com/GiggleLiu/NiLang.jl kind of does this.

1

Alex Lew 32 minutes ago

@McCoy Note that joint logpdfs of all random variables are available for any probabilistic program, including dynamically for models with arbitrary control flow. Marginal densities of outputs are harder (and don’t necessarily exist without a carefully defined, program-dependent reference measure).

Philipp Gabler 31 minutes ago

Also: https://github.com/MikeInnes/Poirot.jl, although that is just a draft (imagination?) currently.

McCoy Becker 28 minutes ago

@Alex Lew I think you’re right and I’m just being dumb. In Gen, the joint log PDF for any Trace is acquired by taking the choice map, and constraining the execution with the map. This is what you’re saying right?

Alex Lew 28 minutes ago

If you have a trace in Gen, you can even do get_score(trace) and get its joint logpdf. Or you can get the logpdf of any complete choicemap using Gen.assess (edited)

1

Alex Lew 27 minutes ago

(This assumes your model has no “untraced randomness” using rand. If it does, you’ll get unbiased estimates of the joint logpdf on the traced variables.)

McCoy Becker 26 minutes ago

Now I’m wondering why I thought it was so hard before. Maybe it’s because of the AD paradigm? I wanted the functional form so I could get the gradient program and pass in any trace.

McCoy Becker 25 minutes ago

I really need to look at your AD stuff in Gen

Alex Lew 23 minutes ago

You could try running AD on Gen.assess (which is deterministic, given a complete choicemap), but Gen also has its own AD support, currently based on ReverseDiff.

Gen’s AD works not just through Julia code, but also through calls to Tensorflow and Torch – and you can add custom adjoints for any generative function (which is how we get Tensorflow and Torch support). Also, parameters of generative functions can be optimized using any of the Flux optimizers.

Alex Lew 22 minutes ago

(This uses the packages GenTF, GenPyTorch, and GenFluxOptimizers).

McCoy Becker 20 minutes ago

Can I write proposals using gradients of other generative functions? Or does that hit the nesting/measure issue that your work on MetaPPL is solving?

Alex Lew 20 minutes ago

Gradients of other generative functions are deterministic, so none of those hairy issues arise Totally fine to use gradients inside a proposal.

McCoy Becker 19 minutes ago

Okay ideal - thanks for enlightening me

McCoy Becker 16 minutes ago

oh yes, ReverseDiff is a gradient tape.

The reason I was interested in source to source is because we’ve been having issues expressing (in a performance way) a set of ideas in deep RL with gradient tape based implementations. This is an edge case where the shape of the computation graph changes on each run (very similar to a probabilistic Trace).

If I wanted to batch Traces in training, for example, it’s not so simple to do. But this is a case which hasn’t been explored yet because we aren’t typically trying to offload learning of probabilistic programs to GPUs (edited)

Alex Lew 3 minutes ago

I see! Gen has support for Zygote in a branch, but I’m not even sure it’s source-to-source that you’re looking for: you want batched probabilistic programming. (Note that an adjoint acquired via a source-to-source transformation may still have all the control-flow weirdness of the original program.)

Alex Lew 3 minutes ago

You might find this paper interesting: https://arxiv.org/abs/1910.11141

arXiv.orgarXiv.org

Automatically Batching Control-Intensive Programs for Modern Accelerators

We present a general approach to batching arbitrary computations for accelerators such as GPUs. We show orders-of-magnitude speedups using our method on the No U-Turn Sampler (NUTS), a workhorse…

Alex Lew 1 minute ago

(In Gen, when we want to do batched training of, e.g., neural proposals, we write the model in a vectorized fashion. So we have vectors inside our traces, rather than vectors of traces.)

McCoy Becker < 1 minute ago

That paper is literally exactly what I’m looking for