MCMC landscape

The Stan devs clearly invested a lot of effort in their AD-enabled math library, and in some ways, the compile-time nature of their AD system resembles what’s happening in Julia (as far as I understand it).

The Stan compiler is more like a lowering tool from Stan language to C++ using the Stan C++ library. Moving to OCaml would in principle provide a nice compiler-oriented language that will hopefully make it easier to support more complex Stan language, including user defined types (it is a major pain in the neck, having to pack everything into 3 flat arrays, for complex models).

is this because source to source AD can handle sparsity instead of moving around lots of zeros? I don’t understand AD optimizations for higher order stuff well, but it seems like sparsity would be a big deal

Implicitly, yes, this is an important factor, but the same applies to the Jacobian to a certain extent. Equally importantly, Zygote allows the compiler to generate very performant code.

But AFAIK using curvature information automatically for MCMC is itself in an experimental stage, not something one would do as a matter of course. So if add it to DynamicHMC, it will be something one would enable for research/exploration.

This is definitely true to some extend.

However, the more I talked with statisticians that are used to Stan about our work on Turing.jl, the more I get the feeling that many people are willing to move away from Stan. Mainly because of the difficulties the ecosystem of Stan brings with it.

And regarding vastly different background between developers and users. This is again only partially true. At least in Turing.jl, we have a pretty diverse team. Including @cpfiffer who is an economist and helps us a lot to develop Turing in a way that it is intuitive for end-users and can be used to analyse inference results. Besides, there is an increasing user base that is already applying Turing.jl in their applications.

7 Likes

I think a lot of end users put up with all the headaches that come with using Stan because it’s basically been the only game in town for a while (for non-specialists who aren’t used to writing their own samplers). My students routinely struggle to get stan installed and working and often the trouble comes from having to debug C++ compiler problems since updates in that ecosystem break things in mysterious ways.

So in my field (psychology/linguistics) there is a lot of pent up demand for a system that’s stan-like, with a maybe more expressive/graphical model oriented syntax, but mostly just to get away from the headaches associated with C++ being involved at any point.

4 Likes

My experience has been very different, I find the build process for Stan very smooth. After setting up a few libraries, it is literally just a make command. And wrappers like rstan automate this even further.

I work a lot with Stan, and as a user I never had to deal with C++.

IMO a NUTS/HMC implementation in Julia can have the following comparative advantages:

  1. coding likelihoods directly in Julia, for arbitrary functions, involving control flow etc; and unit testing small pieces
  2. ability to dissect and debug the adaptation, explore mixing and convergence issues directly for difficult posteriors

The difficulty is that beating Stan’s AD engine is very, very hard. Stan as a language is quite limiting, but the Stan developers managed to optimize gradient calculations to an amazing extent.

2 Likes

I only used STAN a couple times before Julia but this was similar to my experience. However, I did stop using it as soon as I wanted to hook something directly into the C++ and it seemed like a nightmare.

Yes, the famous two language problem strikes again :wink:

I am not saying that a Julia solution could not improve on Stan (especially in this respect), just that it will not be trivial, and in the meantime Stan is a reasonable solution for a lot of problems.

Mine too, until very recently. I’d never had a problem installing it until literally this year; historically I’ve used it on a mac but now run linux mostly. But my students running macos have had a real headache with getting it installed via rstan, which was a surprise to me but I wasn’t able to immediately debug it. It’s sensitive to thinks like linker paths etc. that can be messed with by other stuff and if you don’t know what you’re doing it’s just black magic.

You should probably know that a make call is quite a scary thing for a lot of people. Especially if you’ve never worked on any compiled languages, which is a ton of people.

2 Likes

it’s only scary of typing make doesn’t work

2 Likes

So sad no one mentions OpenBUGS. I have 3 textbooks that use BUGS.

I still use it occasionally! I actually rather dislike the Stan modeling language (although I suspect I’m in a minority). Specifically, I despise having to deal with all the different transformation blocks, specifying reals, etc. Basically anything beyond the data block and model block (the modeling block is pretty close to the BUGS-style syntax) just sets my teeth on edge.

It may have some fantastic AD backend (and I do use Stan occasionally), but in practice I actually think Stan is (very marginally) less usable than trying to write WinBUGS/OpenBUGS. It’s actually about as fast for me to just prototype eveything manually using Distributions and DynamicHMC (which says good things about those packages, I guess, but perhaps not such wonderful things for Stan as a PPL).

I’d prefer switch over to Turing or something else for most things, but the lack of support for the LJK distribution (for correlations) is a big problem given most of my current projects. Although @tpapp has an alt_distributions package that has it… hey, any chance of getting that particular distribution added to the main Distributions.jl line? Once that happens, I could probably get away from Stan almost entirely.

1 Like

Thank you for your kind words. You may be pleased to learn that DynamicHMC is nearing a 2.0 version this summer, with some long-awaited improvements.

Regarding xBUGS: it is of course possible that you find the syntax nicer, but the Gibbs sampling algorithm is much less efficient than HMC in general cases.

I am open to contributing it, I just needed it very quickly, which is why I put it in AltDistributions.jl. Is there anything that prevents you from using that package with Distributions.jl? There should be no conflicts.

1 Like

Oh, Stan is far faster than BUGS and family, of course. I just find the amount of boilerplate code required for even the most simple examples to be rather tedious, and took the opportunity to complain pointlessly into the ether. :wink:

Regarding AltDistributions, I can certainly use it myself fine… I was just thinking that since so many packages (like Turing) use the Distributions package to provide that functionality, it’d be nice to have something like the LJK distribution more widely available for people who use those things “by default,” that’s all.

2 Likes

IIUC if any distribution subtypes Distributions.Distribution, it may be usable in Turing. Any reason why this is not the case in AltDistributions? Even better is to subtype the “smallest” supertype in Distributions.

No strong reason, I am just a bit confused by what is implied by the interface of various subtypes. Note that I only programmed eg logpdf for some distributions, especially LKJL, no moments etc.

If making LKJL <: Distributions{MatrixVariate,Continuous} or something like that would help someone, I am happy to accept a PR.

1 Like

OK, in the end what is the fastest solution?
What about somehting like INLA (variational inference) for Julia?

This mostly depends on your problem (dimensionality, curvature, how “nice” it is), what kind of an approximation you consider acceptable, whether you need custom functions in your log likelihood or it has standard parts, etc.

I think at the moment your best solution is R-INLA via Rcall.

2 Likes

I’ve never had a problem with Stan on Linux.
When I was in grad school, I remember one student on a mac could never figure out Xcode problems (although I question the authenticity of his efforts). I was also given a Windows 7 machine a few years ago, and couldn’t figure out CmdStan, but thankfully I could request Ubuntu to be installed on it.

I started working last week, and again I’ve been given Windows (10), now without the option of swapping it for Linux, because of a mountain of required corporate software. My team members all develop directly on the local Linux cluster, because that is easier than trying to get Windows to work.
Neither Julia + ESS or Julia-repl work on Windows emacs.

Have your students never encountered problems?

The difficulty is that beating Stan’s AD engine is very, very hard. Stan as a language is quite limiting, but the Stan developers managed to optimize gradient calculations to an amazing extent.

My ProbabilityModels.jl is still severely limited, but I’ve seen large improvements over Stan in terms of performance, owing at least in part to the fact that Stan’s AD isn’t very amenable to SIMD. For a problem in my dissertation, which is similar to a model used at my work, I saw about a 20x improvement in ESS/second with ProbabilityModels + DynamicHMC over Stan.

I plan to add support for missing data in the next few days, and then broadcasting after that. However, real control flow (loops, if) is much further down the line.

In the mean time, as a hack, control flow can be wrapped in a function and an analytical Jacobian can be defined. The API for that is currently awful though, so I’ll have to come up with a better way of doing it.
I’ll likely add ForwardDiff or Zygote-based fall backs eventually.

It is not mature, nor well tested, enough that I recommend anyone else use it; I fear anyone who tries will spend a frustrating amount of time dealing with opaque errors and find whatever they’re trying to do isn’t supported yet.
I just point to that as an example of the speed that is eventually possible in Julia. Once it and all of its dependencies are solid enough, I’ll get them registered and announce them as reasonable solutions.

It would be neat if a more robust library like Turing could take advantage of some of the performance optimizations. But I think many of them depend a lot on my stack of unregistered packages.

I know a lot of professors who use it! Some because of its support for spatial statistics.
Where I work, JAGS is the most popular choice. While I don’t share @opera_malenky’s opinion, it (in my experience) is in the majority.

3 Likes

I think everyone takes this as gospel, but it’s not as simple as “Stan is always fast”. There are tradeoffs for HMC, and on many problems all the leapfrog steps are not worth the extra computational costs. For example, see this benchmark (note of bias: I was on the nimble project for awhile). On this particular problem, Stan’s current implementation of HMC actually does the worst of the four algorithms (JAGS, two nimble algorithms and Stan) compared in terms of ESS.

Note that I’m not saying HMC is a bad algorithm! In theory, it should scale up much better for large problems with heavy parameter dependencies (the example in the post is only 24 parameters, many of which I believe are conditionally independent). But HMC’s dominance is definitely problem specific.

As we wrote the paper, BUGS is considered a language for defining statistical models (although the name no longer makes any sense), and all the software packages (winBUGS, OpenBUGS, JAGS, Nimble and Stan) are software packages for attaching various algorithms to a BUGS defined model.

Another comment on Stan, and definitely nimble as well: there’s talk about whether it’s easy to set up to set these programs up. In my experience working with Nimble users, the answer is that if the fresh install works, it’s, of course, super easy. But when things don’t work for some reason, it can be a huge hassle. So experiences on the level of headaches involved in set up of either of these problems should be expected to vary greatly.

Since this is Julia Discourse, I’ll circle back to Julia. One thing that really struck me when I started to learn Julia is that Stan/Nimble should have been written in Julia! The reason I say this is that there is a lot of compiling of code that happens. That is, a user writes their BUGS model, then this is taken and compiled into a DAG and an algorithm is attached to it and is then used to write C++ code, which is then compiled and wrapped and queried from R. Clearly, a Julia implementation of this would be greatly simplified! Note that the “two-language problem” is deeply embedded into these types of programs; it’s really hard to debug an issue that’s a result of computer generated C++ code. If everything was one language, querying all steps of the problem could be done by the user. In fact, that’s one of the features of nimble; they first compile everything to a Native-R model/algorithm so the advanced user can examine all parts of the process should something behave in an unexpected way. Of course, this runs like molasses, so if something is behaving strange around iteration 10,000…good luck.

5 Likes