Probabilistic programming with source transformations

Cool, we’ll keep in touch then. No reason to rush. I hope to get a GSoC project devoted to Bayesian parameter estimation, in which case it would be nice to get setup with a few Julia tools and compare between them, but that would be months off (and requires finding a student :smile:)

4 Likes

I have a working version of the LKJ prior for correlation matrices on my computer that I will commit within a few days to ContinuousTransformations.jl, then we just need stick breaking and unit vectors, and have all the transformations Stan supports out of the box. If you need any of them quickly, I am happy to add them, just open an issue.

2 Likes

It seems to be breaking change. Is it possible to have it in 1.0? That issue is speculative without any milestone!

Nice! I think a natural way to connect this is that for the support of a distribution to determine the transformation. Distributions.support is currently only defined for univariate continuous distributions, but in principle it should be easy to extend.

Your DynamicHMC work seems to be set up specifically for ReverseDiff. From what I’ve seen, I think XGrad would be a little simpler and faster, and makes it pretty easy to specify optimized derivatives. For example, instead of examining the code for a given pdf, we know to call the cdf directly.

It will be great to get even a simple example running end-to-end:

  • Find the supports
  • Determine and apply the appropriate transformations
  • Adjust for the log-Jacobian
  • Compute the gradient
  • Call NUTS

I’m sure I’ll be asking for help with a lot of this :slight_smile:

Sorry, I don’t understand what this is referring to. Did you maybe post in the wrong thread?

It is set up to expect the log density and its gradient in a DiffResult structure. I usually get that using ForwardDiff.jl, but ReverseDiff.jl is an option, too.

I plan to do a few of those that once v0.7 lands, because it is really much nicer with named tuples. But if I have some time I will do a Julia notebook next week for a simple problem.

Curious why you use ForwardDiff – just a legacy thing? Have you tried XGrad or AutoGrad? I still don’t understand all of the autodiff tradeoffs, but for gradients reverse-mode is generally better. And I really prefer to be able to specify gradients manually if possible.

That would be really great, even something very simple with all of the pieces in place would be really helpful!

You are right, for gradients reverse mode is preferable; I mainly use ForwardDiff because it is very robust and fast enough for my purposes. I tried ReverseDiff a while ago and ran into problems, but I should look at it again because many of those issues were fixed in the meantime, I have not tried AutoGrad. XGrad is out for me because I don’t have closed form likelihoods (using indirect inference).

Interesting, this sounds a lot like more general “universal” probabilistic programming approaches, like what Turing does. I don’t yet understand exact boundary of where XGrad will and won’t work, but for the Stan-like approach I think it should be a good fit. Other inference algorithms can just call something else :slight_smile:

Can you provide an example of this? XGrad has its limitations, but they aren’t much different from those of ReverseDiff, so perhaps I’m missing some important use case.

Some more progress:
https://github.com/cscherrer/Soss.jl

@Tamas_Papp: I played with your ContinuousTransformations library a bit. I really like it, but I need the source code for a transformation and its Jacobian. Looks like maybe your stuff would need to be extended to get at this. For now I just hand-coded a couple of cases, and there’s no inverse transform stuff yet. Oh, and I also uploaded my updated NUTS code. But as you pointed out, this is from the original paper. It will be much better to get the updated version going.

@ChrisRackauckas: This notation is really similar to Turing’s, so it should be really easy to set up Turing as a back-end and vice versa. I would have worked with Turing directly, but the source code transformation approach seemed pretty different from what you guys are doing.

Just curious - how is the scope and goals of this package distinct from Turing?

Goals are very similar. But the approach is very different, which I expect will lead to different strengths and ideal use cases.

Turing expresses a model as a function, in terms of an overloaded ~ operator. Changing inference methods means changing the way this is evaluated at runtime.

Soss focuses instead of source rewriting. For example, using a Stan-like approach, it can transform this model:

normalModel = quote
    μ ~ Normal(0,5)
    σ ~ Truncated(Cauchy(0,3), 0, Inf)
    for x in DATA
        x <~ Normal(μ,σ)
    end
end

into

:(function (θ, DATA)
        ℓ = 0.0
        μ = θ[1]
        ℓ += logpdf(Normal(0, 5), μ)
        σ = softplus(θ[2])
        ℓ += abs(σ - θ[2])
        ℓ += logpdf(Truncated(Cauchy(0, 3), 0, Inf), σ)
        for x = DATA
            ℓ += logpdf(Normal(μ, σ), x)
        end
        return ℓ
    end)

In sampling or optimization, this function becomes part of an inner loop, so performance is critical. It’s not obvious here, but parameters are always rewritten as a vector. There’s no unboxing overhead – instead we can just index into θ. In principle, we can add arbitrarily complex transformations along similar lines.

I guess my focus is on avoiding the “really neat, but too slow” critique common in probabilistic programming. I know the biggest overhead in Stan is the compile time. RStan is great, but uses one language (R) for data prep, another (Stan) for modeling, and a third (C++) for implementation under the hood. Results of a model are then analyzed back in R. Doing all of this in one language should help, and the language properties and tools available in Julia make it an obvious choice.

Oh, also… Stan gets some of its performance from the constraints it imposes on models (fixed-dimension continuous parameter space). This makes it a natural place to start. But in principle, Soss can be a lot more flexible. I’ve been thinking about this for a long time, but the implementation itself is less than a week old, so I’m still feeling out where in the design space it will land.

1 Like

Probably should be static vectors then?

Good point! Yes, definitely.

Thanks, sounds very interesting!

1 Like

Is the limitation of using Stan only through a source transformation going to persist for the foreseeable future? That is, could Stan.jl eventually support calling out to arbitrary user-defined Julia functions that return gradient information via AD, or is the Stan engine not flexible enough to support that? I’d like to mix Stan with custom Julia functions, including ODE solvers; I’m not sure I trust Turing.jl or other backends for the statistical models I’m developing.

I’m not sure it can, at least not easily. At least I don’t see FFI in the Stan manual, and if there is, calling into Julia would still be non-trivial.

Not sure what trust has to do with it. But in DiffEqBayes.jl we’ll be developing the same interface over a bunch of different backends so by testing on the same problems we’ll have a more quantitative form of trust.

1 Like

I’m a little confused what you mean by this. Soss does source code transformations and will implement similar methods to Stan. But it does not call Stan.

My understanding is that Stan.jl and PyStan both call the command-line version of Stan. RStan is a little more integrated, since it uses a bridge to the underlying C++ code Stan is written in. But all three pass Stan a model in the form of a string.

So Stan.jl calls (command-line) Stan, which compiles to C++. I guess in principle one could have the C++ Stan codebase call back to Julia.

I can understand putting more trust in Stan, just because it’s more mature and more widely-used. But things can still be very fiddly, depending on parameterizations etc. Oh, and mixture models are very awkward, and easy to mess up. And sometimes the log-Jacobian needs to be updated by hand.

I’m not saying Turing or Soss are (or will be) better. Parameterization for some models can just be tricky. Model checking tools can help do some sanity checks, so there’s some great potential there. And Soss can more easily express mixtures, since there’s a nice combinator for those in Julia. In principle Stan could add this, but it would take a huge amount of work for anyone who’s not a core developer. Compared to a single-language approach, Stan’s extensibility isn’t so great.

Since Stan supports function definitions, one could write a Julia → Stan function rewriting engine for a limited subset of Julia. Personally, for reasons explained above, I don’t think this is the way to go, but you can experiment.

If you are doing anything nontrivial, I assume you are already validating your inference with simulated data, so I am not sure what your concern is. Also, if you like a framework but you are concerned about test coverage, you could contribute tests.

I think that the Stan engines for inference are quite mature and robust, but the language itself can easily lead to error-prone code. Debugging is difficult (with printf most of the time), unit testing components of your models is out (I have not seen a feasible approach for this, and it is a practical concern for models with > 100 LOC). Also, it is easy to make conceptual/arithmetic mistakes for manually coded Jacobians.

1 Like

DiffEqBayes.jl does this. It works only for functions defined by the ParameterizedFunctions because we need the information in a DSL to know how to do the transformation. So yeah, limited rewriting engines work, but you do need a good enough DSL. I wouldn’t see this as necessary unless the DSL already exists though, since if it doesn’t exist then you might as well use Stan’s.

@nurban mentioned he was going to be using the ODE solvers, and this is probably the easiest thing to check because you just stick the parameters in and see if your line goes sufficiently through the data points. Then you can also check against the result given by maximum likelihood estimation and see if the peak is close. You should do this even with Stan. It’s not difficult to come up with simple problem where Stan will have problems that would be caught with these checks (along with the standards of checking that the chains converged and all of that).