Probabilistic programming with source transformations

Can you expand on this a bit? Do you think there is a problem with

  1. the ODE engine of Stan,
  2. the HMC implementation,
  3. something else,

or is it simply poor mixing?

I think it’s a few things related to the ODE integration. I’ll take it for granted that their HMC implementation is good.

First off, yes Stan has two ODE solvers: a wrapper for dopri and a wrapper for CVODE. There’s a few issues there. First of all, there’s the load of benchmarks at DiffEqBenchmarks.jl that show that the problems which people normally solve in Bayesian parameter estimation problems shouldn’t be using these algorithms, rather high order Rosenbrock or higher order Verner methods will give almost an order of magnitude speedup. Also on the algorithmic side note that this choice of stiff integrator means that there isn’t an L-stable integrator available (without limiting CVODE’s order of course), and since it’s a multistep method it won’t do well with heavy event handling problems like those from Pk/Pd models (since each event sends it back to Implicit Euler, and our tests on the event handling after adding it to Sundials.jl shows it takes quite a few steps after CVodeReInit before CVODE gets back to normal step sizes). So one thing I want to quantify is just how much of the issue is due to not using good choices for the integration method.

But there are other things as well. For example, the Lotka-Volterra model is a nice choice for a parameter estimation problem since for many of the parameters it diverges to infinity. It also qualifies as a simple problem because it’s just two non-stiff ODEs: how much simpler of an ODE system can you get? But if the birth rate is too high, the population explodes! This means it needs a strategy for handling when integration stops early, and for our own optimization functions we just filled it in with Infs. Stan seems to have trouble with this, so we’re testing things out like truncated priors but the strategy may need to be more refined then that. Additionally, “early exit” is something that can speedup the process tremendously. For example, if the Lotka-Volterra equation ever gets up to 100, I know that the solution is very bad anyways (whether it diverges or not) and so a DiscreteCallback which just terminate!(integrator) is a nice way to discard it in the DiffEq framework, but we don’t have access to that kind of tooling if we use Stan’s two integrators.

There’s other things I want to play with, like replacing the ODE with additive noise SDEs as a form of “model regularization”, and stuff like that. And I want to quantify the differences in the methods as well. I am not sure that playing with these ideas is possible with Stan, so I don’t think it’s a future for us but it’s a necessary benchmark target and a system to learn from.

(oh, and I want to do SDEs, DDEs and DAEs)

1 Like

Thanks for the details. These echo my problems with Stan: great HMC engine, with a limiting DSL. Which was the motivation for turning the whole problem “inside out”: write a HMC engine in Julia, code the posterior in Julia too.

If you have a simple ODE problem coded in Julia that works with, say, ForwardDiff.jl, I would be happy to run it through DynamicHMC.jl — I am curious how it performs.

1 Like

Let’s get to work: DynamicHMC.jl backend · Issue #16 · SciML/DiffEqBayes.jl · GitHub

1 Like

This is more along the lines of what I was asking about. Seems like that functionality isn’t there, though.

To expand a bit (and respond to remarks from @ChrisRackauckas and @Tamas_Papp), I’m developing some big hierarchical spatiotemporal models with complicated covariance structures, that happen to have ODEs embedded within them as one small part of the hierarchy. Stan has put some effort into getting features I use like Gaussian processes, LKJ priors, etc. both tested and also implemented efficiently. I’m not sure yet whether I could lob a full model of this form through one of the Julia HMC samplers with AD and have it work (or rather, work as well as it would in Stan), due to level of maturity. Is this even possible right now? It seems even the ODE part in DiffEqBayes.jl and integration with DynamicHMC.jl is still a work in progress, and ODEs are only one piece of what I need.

I’m willing to try pure Julia eventually, but my original post was just wondering whether I could start by replacing the just ODE part of my model with callbacks to the much nicer DifferentialEquations.jl solvers, without moving off of Stan completely. I’m not too interested in doing model rewriting, though.

Longer term, I have no great love for Stan’s DSL, multi-language problem, and lack of flexibility, and would much rather work in pure Julia. I don’t need to be sold on that!

Nope. We are not there yet but hopefully we are by the end of the summer. We’ll keep you posted.

And as a side note, I think @Tamas_Papp already bumped into what I was discussing as to the difficulties on seemingly simple problems.

On a problem like this I’m not sure integrators will really make much of a difference. But it would be interesting to investigate more domain-specific HMC tools? Note that this is the same issue that Stan has on this ODE. I want algorithms in Julia so I can more easily tackle this as a research project and push things around until I get something that works well.

One seemingly quite popular alternative to HMC for inference in probabilistic programs is to use Sequential Monte Carlo, specifically an iterated conditional SMC algorithm. I have recently implemented the generic algorithm this in Julia with an emphasis on multi-threaded performance.

https://github.com/awllee/SequentialMonteCarlo.jl

with documentation here:

https://awllee.github.io/SequentialMonteCarlo.jl/latest/

I plan to make the main release after 0.7/1.0 so that it is relatively stable.

If that is helpful, I would be very interested to know.

As an aside, I wanted to use a JuliaDiffEqs example but wasn’t able to find a way to reduce the overhead (since one would solve, e.g., an ODE many times in parallel with short time periods).

Oh that looks great, thanks for the link! The Turing folks might be interested in this as well. I think they’ve implemented this in Turing, but it might not be so cleanly abstracted as yours. Also, if you’re interested in variations of SMC, you should check out Anglican. The site has links to their papers, and also implementations of the algorithms in Clojure.

Some context, in case others aren’t familiar with this area

SMC-based systems are sometimes called “universal” probabilistic programming. This is a historical artifact originating with an analogy with Turing completeness (though this is inaccurate, as Stan is Turing complete), reinforced due to the lack of a better term.

Historically, BUGS and Church are the Fortran and Lisp of probabilistic programming. Stan descends from the former, working in terms of a known fixed-dimension parameter space with a computable density function. In exchange for this constraint, posterior conditional densities are available, usually leading to better performance (both statistical and computational).

Church-like languages are more flexible. Instead of a joint density, they work in terms of a randomized simulation process. SMC-like algorithms use a collection of “particles”, each representing one run of the simulation. Cleverly manipulating weighted particles (e.g., choosing to interrupt some and clone others) leads to some different different possibilities for parameter inference.

2 Likes

The “right” way to do it is to use the integrator interface so that way you build the integrator once and then re-solve using the same cache variables each time.

http://docs.juliadiffeq.org/latest/basics/integrator.html#Reinit-1

Of course, I’m ignoring that right now because the algorithm issue is much more important to get correct first. But if you want to optimize it, there you go.

Or for small problems you can use static arrays. But Julia has a few inference problems on some of our algorithms (many operation statement inference with static arrays issue of sorts) so that path won’t be the best until a 1.x gets that.

1 Like

Hold on, I found out how to fix static array stuff. I’ll get that in by tonight and these tests should then be using static arrays.

Edit: It’s done. You can use the static array parameterized function setup for this now quite performantly without reiniting.

2 Likes

As @dfdx had mentioned, there’s some trouble with differentiating expressions like logpdf(Normal(μ,σ),x) - “differentiating a data type” doesn’t really make sense.

I looked a bit more into the implementation. For functions like logpdf, Distributions just calls StatsFuns, which leads to a nice algebraic expression after a couple of steps. The following are all equivalent, because they are successive simplifications of the first line:

Main> logpdf(Normal(μ,σ),x)
-2.548376445638773

Main> normlogpdf(μ,σ,x)
-2.548376445638773

Main> normlogpdf((x-μ)/σ) - log(σ)
-2.548376445638773

Main> -(abs2((x-μ)/σ) + log2π)/2 - log(σ)
-2.548376445638773

Distributions manages this with a macro _delegate_statsfuns.

What’s really needed, I think, is a way to easily get from the first of these to the last. I could code this in Soss for this specific case, but it seems like a generally useful thing anyway. Most generally, I could imagine a function that takes an expression and returns an iterator over successive evaluation steps. I’m guessing the most efficient path might be somewhere between a quick hack and the “best possible world” evaluation iterator.

Any thoughts/suggestions?

ForwardDiff.jl works through these just fine. IMO AD is the “best possible” solution for these situations, as you have seen the symbolic approach does not scale well.

Took longer than I expected, but I have started a set of worked examples as Jupyter notebooks in
https://github.com/tpapp/DynamicHMCExamples.jl

Currently, it only contains a single problem; estimation of an unknown covariance matrix (decomposing into correlation and variance, using the LKJ prior on correlation). This is WIP, but I made it as clear as I could; note the that the API can of course change (I will of course update these examples then).

5 Likes

I agree. I wasn’t after symbolic diff, but AD via source rewriting, maybe similar to what Tangent does for Python. I think XGrad is along these lines, but maybe I’ve misunderstood.

That looks really helpful. Thank you!

I think I need to clarify it a bit. In math, derivatives are defined for functions. In this piece of code Normal represents a distribution, and I don’t know such thing as derivative of or w.r.t. a distribution. We can treat Normal as a struct (i.e. collection of its parameters) and find derivatives w.r.t. each field. In general, this should work and I’m open to this route. However, this way we will be bound to the internal structure of data types, making the code more fragile (Actually, Distributions.jl defines function params() to encapsulate implementation details, but from reading the source code I don’t think any AD tools will understand it and find derivatives only w.r.t. these parameters).

Given source rewriting approach that @cscherrer is using I think that simpler and more robust approach is to expand things like logpdf(Normal(...)) into normlogpdf() and differentiate it directly. It’s actually quite easy to do - XGrad already does it internally. But other ideas are welcome too.

I agree. I wasn’t after symbolic diff, but AD via source rewriting, maybe similar to what Tangent does for Python. I think XGrad is along these lines, but maybe I’ve misunderstood.

There are 2 main types of AD: forward-mode and reverse-mode. Reverse-mode AD can be dynamic or static. Static reverse-mode AD usually builds and optimizes computation graph. It can build this graph using function overloading or source transformation.

Symbolic differentiation rewrites symbolic expression (e.g. code) into corresponding derivatives. In most cases rewriting is done using the same 2-step procedure as in reverse-mode AD.

XGrad (as well as Theano, TensorFlow or Tangent) falls under definition of both - AD and SD.

A short guide:

  • forward vs. reverse-mode depends on whether you have a function of form R -> R^n or R^n -> R
  • static vs. dynamic depends on whether you can define computation graph beforehand
  • function overloading vs. source rewriting depends on the context
3 Likes

Ideally yes, but my impression is that ReverseDiff.jl is waiting for Cassette.jl to mature, so given the current state of packages, ForwardDiff.jl is a reasonable (but of course not ideal) choice for \mathbb{R}^n \to \mathbb{R} functions too, out of necessity.

1 Like

Thanks for the details @dfdx. I guess my impression is that for sufficiently simple code, statically rewriting the source will always be better. For me the relevant questions are

  • Which approach is most appropriate in principle?
  • How much risk is there that a given implementation will become obsolete?

I can see that XGrad still has a way to go before maturity, but it seems like a good approach to me, and I think there will always be a place for source transformation autodiff. I was really kind of shocked to see that ReverseDiffSource seems to have been abandoned, and I’m excited to see new work in its place.

Wait, now I’m confused. XGrad already does this? Maybe I’m calling it wrong. I was thinking I’d have to reconstruct the name-mangling Distributions does to call StatsFuns. Can you show a minimal example, like computing the gradient of this?

t -> logpdf(Normal(t[1],t[2]),t[3])

Note, that Cassete-based version of ReverseDiff moved to Capstan.jl. Another battle-tested reverse-mode AD package with function overloading is AutoGrad.jl.

I mostly use AD/SD in machine learning tasks where size of inputs can easily exceed 10^6. Using ForwardDiff.jl for such data isn’t an option at all :slightly_smiling_face:

2 Likes

I always try to isolate my code from other packages implementation details. Normally it means fixing derivatives - finding them once for each function and defining corresponding primitives. This way any changes to the internals of that packages won’t affect you as long as outer interfaces stay the same. Note, that source rewriting is advantageous here since you can literally generate and copy-paste code for derivatives you need into source files.

How much risk is there that a given implementation will become obsolete?

There’s a high risk that one of the functions you don’t own will get non-parsable (for source transformation), not generic (for function overloading) or just somehow non-differentiable (e.g. linking C code) part. If this happens, your code will just break even though all the interfaces stay the same.

Most libraries I’ve seen define a number of primitives that are guaranteed to work and additionally support others’ code that is expected to work in most cases.

Wait, now I’m confused.

It’s a bit longer story, so I’ll answer this tomorrow :slight_smile: