Julia calling Stan C++ directly


I was thinking a bit about ways to connect with Stan. @goedman’s CmdStan.jl (thanks Rob!) takes the well-established approach used by RStan and PyStan, in which a Stan model is expressed as a string.

But Julia has an advantage over Python and R, namely speed, and this discussion seems to indicate a callback to Julia isn’t too difficult.

The goal of the front end of Stan is, essentially, to express and compile a C++ function for the log-posterior. It seems to me a Julia function can play the role much more effectively, with comparable performance, and jit overhead probably much less than that of Stan’s compilation to C++. So maybe the information flow could be something like

  • Take a user-defined Julia function for a log-density (usually a posterior)
  • Pass this to a Julia library that calls some C++ glue
  • The C++ glue plays the role of the “Stan compiled function”, and is passed to various inference methods
  • Along the way, there will lots of callbacks (calls back?) to Julia

Stan does the autodiff thing, and may not be Julia-friendly, so it would probably require doing that on the Julia side, but there are plenty of options for that here :slight_smile:

I’ve done a little C before, but very little C++, so there could be some weird aspects of it I’m completely overlooking. But it would seem this has lots of potential benefits:

  • Probably less compile time than Stan, and better front-ends.
  • Direct access to a mature and well-maintained back end
  • Great benchmarking opportunity for Julia-native development like Turing.jl or DynamicHMC.jl

Is this feasible? What would be involved in such a development?

1 Like


I am not sure about this — the way I understand it, the model would need to be compiled anyway. Parsing the stan code and generating the cpp has rather trivial cost in cmstand & friends, it is the compilation.



After some digging, I think you’re right about that:

Still, I think this would be worthwhile. There a huge amount of effort behind making Stan better, but the front end and connectivity with other tools leave a lot to be desired, IMO. In the link above, Bob Carpenter indicates compilation creates a C++ class. I’d think having a “template” (in the generic, not C++ sense) and hooking into such a class directly could allow building models in a more expressive way, that are better than “stringly-typed”, Frustration with Stan’s front end is one thing that got me interested in building something like Soss :slight_smile:

1 Like


Hi Chad,

A few additional comments.

I was thinking a bit about ways to connect with Stan. @goedman’s CmdStan.jl (thanks Rob!) takes the well-established approach used by RStan and PyStan, in which a Stan model is expressed as a string.

As far as I know all Stan interfaces take a “stringly-like” definition of the Stan model. Even the function section is just textual C++. And proposed enhancements such as SlicStan seem to follow this pattern. Below mentioned quap() and ulam() do not and neither do R’s RStanARM and brms packages and as a consequence these are not portable across Stan interfaces.

RStan and PyStan do have access to Stan’s (C++) API, CmdStan.jl doesn’t. Not all data available inside Stan is dumped into the .csv files (which CmdStan reads in and turns into an MCMCChains.Chains object).

Over the years several folks have asked for a feature where functions in the function section could be defined in Julia but I have always felt those development cycles would be better devoted to a full Julia mcmc implementation. But I know of at least one user that is switching from CmdStan to Turing for this reason.

As we discussed earlier, my initial interest in Soss.jl was (and still is!) to use it as quap() and ulam() like interfaces to generate Stan model code, as in the StatisticalRethinking book. In the draft 2nd edition the author does point out that even ulam() cannot generate all (useful) Stan models and readers are encouraged to learn the Stan language. Ulam() is template based.

Turing has a nice modeling language and if the authors can get the performance comparable to Mamba and DynamicHMC it would be very high on my list.

But today, at least in my opinion, a killer combination would be DynamicHMC + a Soss frontend (as you already included in the initial version of Soss). Such a frontend would likely have similar constraints as the above mentioned quap() and ulam() and at that point a user would need to switch to pure DynamicHMC (or Turing or Mamba, I just don’t think switching PPLs at that point is too problematic).

I wonder (and hope) the same is true for Turing’s AdvancedHMC + a Soss based frontend. But I haven’t been able to play with Turing’s AdvancedHMC yet.



I think we may be talking about different things. Say we consider Stan as some integrated components:

  1. Parse a model written in the Stan language
  2. Compile the resulting IR into a C++ function
  3. Do some autodiff on that function
  4. Use the function and gradient in routines for HMC, VI, etc
  5. Output the result (csv, chains object, or whatever)

[Ok, it’s really a C++ class, but let’s set that aside]

Every problem I’ve had with Stan comes from (1). The front-end is verbose, C+±specific (“vector” and “array” are different), and impossible to work with interactively (as is any C++).

I would agree, is it’s a major undertaking. But abstractly, “Make Julia call an existing C++ library, with a Julia callback” seems like a week or less. Then again, I don’t understand it very well, so this could be a naive assessment.

I’ve heard good things about SR, but I haven’t read it yet. But doesn’t this contradict your earlier point about a full Julia mcmc implementation? Not trying to nit-pick, it’s just not clear to me where your stand on this.

I agree! I’ve never used Mamba, but I think Turing has a DynamicHMC backend now. If there’s extra overhead I’m not sure where it’s coming from.

I may be missing something, but I’m having trouble imagining DynamicHMC models that can’t be expressed in a high-level language like Soss. I think the biggest limitation is just developer time to get it there.

AdvancedHMC? Haven’t heard anything of this. Any insights beyond what I can get from my upcoming Googling?

BTW, I’m in the process of connecting SymPy.jl to manipulate log densities. The idea is to manipulate the posterior as an abstract summation to do things like pulling index-independent terms outside of sums, and common subexpression elimination (SymPy.jl has built-in cse). I expect this to really speed things up. I have a couple of conference presentations coming up, hoping to have some new results to show off :slight_smile:



You are talking about the Stan language, not the frontend. In that language, vectors, matrices, and arrays are indeed different and behave differently (roughly, whether operators broadcast or not).

Stan is indeed not a nice language by modern standards. But since it is so tightly intertwined with the implementation, I don’t think you can pick parts of the latter it and expect them to be independently useful in Julia.

I think that AD is 95% of the challenge of using HMC-family methods in practice for nontrivial models. NUTS itself is a pretty simple algorithm, and once you have it debugged, verified the relevant invariants, and dealt with all the crazy corner cases, you can pretty much apply it to anything. I think that Stan as a language is difficult to extend or modify because that always entails extending the whole AD stack, and that is very tricky.

I agree with @goedman: the way to go is a robust native NUTS implementation. I am working on rewriting DynamicHMC at the moment to address all the issues that piled up, and it should be ready in a month or two. The biggest constraint used to be reverse-mode AD, but that is also improving a lot, I find Zygote especially friendly.



Yes that’s right, I was being sloppy about the terminology.

I agree with your points about HMC, and I think your implementation is the right approach to this. I was thinking more about variational methods. A lot of tricks come into play to reduce the variance of the gradient estimates, and building something robust is a large investment.

Maybe something like Pyro or TensorFlow Probability is a better target for this?



My favorite things about Stan are the helpful diagnostics and plotting functions. If you’re using rstan, you get a lot of methods for stanfit objects, like stan_ac, stan_trace, pairs, etc. There are also other libraries, like ShinyStan is really neat.

Another advantage may be that Stan is likely to be the first to see many developments. I’d expect Stan to be the first to see improvements to rhat, for example.

If you’re doing MCMC, I don’t think there’s a real advantage to using Stan as the backend instead of DynamicHMC. I think the latter is a lot easier to dive into.
If you want optimization, I think Optim.jl is a lot better than Stan’s optimization.

Variational methods are something which AFAIK hasn’t been implemented in Julia yet. Someone ought to make a Julia library for that at some point. I haven’t had the need, but I am interested, so perhaps someday. For what it’s worth, Celeste used variational inference.

For front ends, I’ve had a little more time to work on ProbabilityModels. No documentation or README yet. But, I have a couple proof of concepts:
5 parameter Logistic Regression: 7x faster than Stan
94 parameter Hiearchical Model: 150x faster than Stan.

The basic approach is: write optimized functions that return the log density and gradient / Jacobian object, and stitch these together with source to source AD. It uses DynamicHMC as a backend.
My plan would be to give the library a fairly narrow focus that I can optimize heavily, and catch everything else with Zygote (or some other AD library) so that it retains flexibility / is easy to specify more complex models than BUGS/JAGS.

Stan seems poorly optimized. My computer with avx512 isn’t any faster at running Stan than my computer with half-throughput avx2 (Ryzen), even though it is 4x faster at operations like matrix multiplication.

The example with more parameters also benefited from Julia’s type system, however.



Wow @Elrod, this looks great! I’m getting ready to call it a night, but I’d love to hear more about this. Details of the approach, design choices you made, how its goals compare to those of Soss and Turing, etc.

1 Like


I may be missing something, but I’m having trouble imagining DynamicHMC models that can’t be expressed in a high-level language like Soss. I think the biggest limitation is just developer time to get it there.

Models like lda (in the README of Soss.jl) and expressing similar multilevel models as in chapters 10 to 14 of SR in Soss, absolutely! Soss and SR as languages are in fact pretty close and both are considerably less verbose…

How much developer time would mapping those onto a platform like DynamicHMC for execution take? And could that be done in a portable way. i.e. to subsequently map it onto the Turing language, would it take only 30% of the initial developer time?

I’ve heard good things about SR, but I haven’t read it yet. But doesn’t this contradict your earlier point about a full Julia mcmc implementation? Not trying to nit-pick, it’s just not clear to me where your stand on this.

My priorities: #1 a few full Julia solutions, #2 promote educational tools, #3 maintain CmdStan. As you correctly point out, this is not a clean partition.

As far as priority #1, today we have at least 1 complete solution (Mamba.jl, possibly with MCMCChains.jl if Gadfly.jl is not your preference) and several others that are pretty close. My ideal platform (already) is as I outlined in my previous email: A less verbose frontend mapped on a logpdf formulation as in DynamicHMC. At the same time MCMCChains can develop into a great backend tool where we can include improvements such as https://github.com/TuringLang/MCMCChains.jl/issues/86#issue-433293691 .

Quap() and ulam() are great educational tools to bring users gradually up to speed and enable them to link mcmc methods back to previously learned approaches. I don’t see them as end points but fitting in priority #2.

At the same time, today at least, I am responsible for Cmdstan.jl. I have seen CmdStan 1) as a stopgap until we have comparable Julia solutions and 2) as a bridge to a very capable and committed team. Item 2) has overtaken item 1) and for the foreseeable future I expect it will continue to act as a stepping stone for mcmc folks switching to Julia and for comparisons of inference results.

AdvancedHMC? Haven’t heard anything of this. Any insights beyond what I can get from my upcoming Googling?


1 Like


I’d like to spend the time to write about all that sooner or later. I’m still unsure about what approaches to take on a variety of issues. For example:

  • What sort of interface should I have for specifying derivatives. I think what I’m doing now can be improved upon a lot.
  • Expression templates and broadcasting. I think expression templates give a lot of potential for optimization.
    -Finding ways to make it easier to actually implement optimized functions. There are a lot of patterns I’m not taking advantage of. Which means it takes longer, more potential bugs, etc.

Much of the performance comes from simply optimizing the building blocks and derivatives.
For example:

boiler plate to generate data
using DistributionParameters, ProbabilityDistributions
using LoopVectorization, DynamicHMC, LogDensityProblems, SLEEFPirates, SIMDPirates
using LinearAlgebra, StructuredMatrices, ScatteredArrays, PaddedMatrices
using ProbabilityModels
using DistributionParameters: LKJ_Correlation_Cholesky
using ProbabilityModels: Domains, HierarchicalCentering, ∂HierarchicalCentering, ITPExpectedValue, ∂ITPExpectedValue

domains = ProbabilityModels.Domains(2,2,2,3);
# domains = ProbabilityModels.Domains(2,2,3);
T = 24; K = sum(domains); D = length(domains);

κ = (1/32) * reduce(+, (@Constant randexp(K)) for i ∈ 1:8) # κ ~ Gamma(8, 32)
σd = sum(@Constant randexp(4)) / 16 # σd ~ Gamma(4,16)
θ = 2.0 * (@Constant randn(K)) # θ ~ Normal(0,2)
S = (@Constant randn(K,4K)) |> x -> x * x'
S *= (1/16)
pS = StructuredMatrices.SymmetricMatrixL(S)
L = PaddedMatrices.chol(S); U = PaddedMatrices.invchol(pS)
μₕ₁, μₕ₂ = -3.0, 9.0
μᵦ₁ = μₕ₁ + @Constant randn(D); # placebo
μᵦ₂ = μₕ₂ + @Constant randn(D); #treatment
β₁ = HierarchicalCentering((@Constant randn(K)), μᵦ₁, σd, domains); # placebo
β₂ = HierarchicalCentering((@Constant randn(K)), μᵦ₂, σd, domains); # treatment

# rand generates uniform(0,1); we take the cumulate sum for the times.
T = 24; δₜ = (1/16) * reduce(+, (@Constant randexp(T-1)) for i ∈ 1:8)
times = vcat(zero(ConstantFixedSizePaddedVector{1,Float64}), cumsum(δₜ));

μ₁ = ITPExpectedValue(times, β₁, κ, θ)
μ₂ = ITPExpectedValue(times, β₂, κ, θ)

ρ = 0.7
ARmat = StructuredMatrices.AutoregressiveMatrix(ρ, δₜ)
ARcholesky = PaddedMatrices.chol(ConstantFixedSizePaddedMatrix(ARmat))
# Create an Array of matrix-normal entries.
Y1 = [ARcholesky * (@Constant randn(T, K)) * L' + μ₁ for n in 1:120]
Y2 = [ARcholesky * (@Constant randn(T, K)) * L' + μ₂ for n in 1:120]
Y1c = ChunkedArray(Y1) # Rearranges how the data is stored under the hood.
Y2c = ChunkedArray(Y2) # This often allows for better vectorization.

More typical data structures (base Julia arrays):

Y1r = reshape(reinterpret(Float64, Y1), (T, K, length(Y1)));
Y1p = permutedims(Y1r, (1,3,2));
Y1p2 = Array{Float64,3}(undef, T,length(Y1),K);
Y1p3 = Array{Float64,3}(undef, T,length(Y1),K);
mu1a2 = reshape(Array( μ₁), (T,1,K));
Ua = Array(U);
ARmata = Bidiagonal(vcat(1.0,Array(ARmat.spacing.rinvOmρ²ᵗ)), -1.0 .* ARmat.spacing.rinvOmρ²ᵗ .* ARmat.spacing.ρᵗ, :L)

We might write a matrix normal lpdf function like this:

function matrix_normal_lpdf(Y1p3::AbstractArray{Tf,3}, Y1p2::AbstractArray{Tf,3}, Y1p, mu1, ARmat, U) where {Tf}
    T,N,K = size(Y1p3)
    @inbounds Y1p3 .= Y1p .- mu1
    mul!(reshape(Y1p2,(T,N*K)), ARmat, reshape(Y1p3,(T,N*K)))
    mul!(reshape(Y1p3,(T*N,K)), reshape(Y1p2,(T*N,K)), U)
    N*(T*logdet(UpperTriangular(U)) + K*sum(log, ARmat.dv)) - 0.5dot(Y1p3, Y1p3)

Benchmarking it, versus the matrix normal function plus gradients ProbabilityModels uses:

julia> BLAS.set_num_threads(1);

julia> BLAS.vendor()

julia> @btime matrix_normal_lpdf($Y1p3, $Y1p2, $Y1p, $mu1a2, $ARmata, $Ua)
  22.322 μs (10 allocations: 672 bytes)

julia> @btime ProbabilityDistributions.∂Normal($Y1c, $mu1, $ARmat, $U, Val((false,true,true,true)))
  10.828 μs (0 allocations: 0 bytes)
(21091.33480913283, [92.12497022370084 -261.4974146549056 … 582.2118972474755 -11.522563188918166; 179.12977820511855 181.29576993232826 … -450.73096448944 111.9215118815924; … ; -297.0684630208436 371.93425285528326 … -734.958870848374 -184.31469734964307; 135.55918654442254 -141.00144820012636 … 582.7996244970207 84.55177511664405], -162.86568080703364, [3.873067478593928 -8.222693563326287 … 11.031109528140325 -10.42774614031186; 0.0 -17.471181102460612 … 3.4649162870660817 -13.588646652411438; … ; 0.0 0.0 … 1.6948774806619102 -0.11083031605590454; 0.0 0.0 … 0.0 9.729467158235364])

Took about half as long to get the log density and gradients with respect to the mean and both (inverses of Cholesky factors of covariance) matrices as the straightforward implementation did to only return the log density.

The straightforward implementation was a broadcast, two mul!s, and a dot – one could be forgiven for expecting it to be reasonably close to optimal. Autodiffing it would make it much slower.



I have said this before, but perhaps it is worth repeating: the purpose of DynamicHMC is primarily being a backend. One can of course use it directly if the user is comfortable coding posteriors, but it was designed to be very modular and easy to integrate to a frontend (if it can be improved in this respect for some application, I am happy to fix that, just open an issue).

In fact, all of my MCMC & related packages are designed to be used in backends. I am glad to see people experimenting with/working on frontends, and I am confident that the Julia package ecosystem will come up with nice, user friendly tools for MCMC in the medium run.



As @Tamas_Papp points out below, DynamicHMC is intended as a back-end, in particular it works in terms of a user-provided log-density. One capability of Soss is to build such a function, using a high-level representation as a starting point. A nice benefit that comes with this approach is the ability to combine and manipulate models at a high level. Mapping to DynamicHMC takes zero additional developer time; this functionality is built in already.

Soss models are already very close to Turing models, and with a little work it should be straightforward to map a Soss model to a Turing model.

Oh, that’s interesting, StanNUTSAdaptor makes this sound like it may be using Stan as a back end (?)

Wow, this is really nice! I had been looking into using SymPy.jl to do things like rewriting summations (e.g., for the common iid case) and eliminating common subexpressions over the entire log-likelihood. I had planned to use Zygote on the result to build gradients.

Do you expect your approach would be faster than this? Are ∂Normal etc in a form to be used by other libraries?


1 Like


One capability of Soss is to build such a function, using a high-level representation as a starting point. A nice benefit that comes with this approach is the ability to combine and manipulate models at a high level. Mapping to DynamicHMC takes zero additional developer time; this functionality is built in already.

Thanks Chad. I have seen and played around with the hello model example. No problem.

But do examples like linReg1D also work in this way? E.g I have tried:

using Soss

linReg1D = @model N begin
    # Priors chosen following Gelman(2008)
    α ~ Cauchy(0,10)
    β ~ Cauchy(0,2.5)
    σ ~ Truncated(Cauchy(0,3), 0, Inf)
    x ~ For(1:N) do n 
    ŷ = α + β .* x
    y ~ For(1:N) do n 
        Normal(ŷ[n], σ)

data = (N=8, )

nuts(linReg1D, data=data).samples

or is the Soss syntax changed?



tl;dr: Yes, but you may need to condition first on observed variables.

The idea with this example is that only N is passed in as an argument, so you could use it to generate fake data. Like so:

julia> makeRand(logReg1D)
          val = NamedTuple()
          N = kwargs.N
          α = rand(Cauchy(0, 10))
          val = merge(val, (α = α,))
          β = rand(Cauchy(0, 2.5))
          val = merge(val, (β = β,))
          σ = rand(Truncated(Cauchy(0.3), 0, Inf))
          val = merge(val, (σ = σ,))
          x = rand(For(1:N) do n
          val = merge(val, (x = x,))
          ŷ = α + β .* x
          y = rand(For(1:N) do n
          val = merge(val, (y = y,))

[codegen with For seems a bit buggy, I’ll need to have a look at that]

To use this for “typical” linear modeling, you need to condition on x and y:

julia> m = linReg1D(:x, :y)
@model (N, x, y) begin
    α ~ Cauchy(0, 10)
    β ~ Cauchy(0, 2.5)
    σ ~ Truncated(Cauchy(0.3), 0, Inf)
    x ~ For(1:N) do n
            Cauchy(0, 100)
    ŷ = α + β .* x
    y ~ For(1:N) do n
            Normal(ŷ[n], σ)

[Note that the x ~ ... line is still there. This will change soon.]

From here you can get the prior:

julia> prior(m)
@model x begin
    α ~ Cauchy(0, 10)
    β ~ Cauchy(0, 2.5)
    σ ~ HalfCauchy(3)
    ŷ = α .+ β .* x
    N = length(x)

or the likelihood:

julia> likelihood(m)
@model (x, α, β, σ) begin
    ŷ = α .+ β .* x
    N = length(x)
    y ~ For(1:N) do n
            Normal(ŷ[n], σ)

Both of these are models (not raw code) so you can further manipulate things. You can also build code for the log density, as is needed for HMC:

julia> logdensity(m)
:(function (par, data)
      ℓ = 0.0
      y = data.y
      x = data.x
      α = par.α
      ℓ += logpdf(Cauchy(0, 10), α)
      β = par.β
      ℓ += logpdf(Cauchy(0, 2.5), β)
      σ = par.σ
      ℓ += logpdf(HalfCauchy(3), σ)
      ŷ = α .+ β .* x
      N = length(x)
      ℓ += logpdf(For(1:N) do n
              end, y)
      return ℓ


I had asked one of the Stan developers about this, and his suggestion was:

If you want to avoid the autodiff entirely and just plug into the algorithms then
you can create a C++ derivation of
with appropriate overloads for the gradient function in the same directory, which should be possible using externs to Julia functions.

I imagine that what you would end up with would be a backend, roughly equivalent in terms of functionality to what is provided by DynamicHMC.jl.

1 Like


Have you looked at Reduce.jl?
It’s a CAS library that works on Julia expressions.

Besides the fact that operating on Julia expressions makes it a natural choice for macros and generated functions, the Revise backend itself may already be well suited.
Here is some Reduce documentation.
Chapter 42 (page 351) is about “GENTRAN”, which is functionality to generate Fortran programs. More than simple one-line math expressions, it can handle multi-line expressions with variable assignments and (“do”) loops.
Probably worth taking a look at.

Does this also mean that you intend to manually inline everything? Eg, replace pdfs with an expression?

Yes, I expect my approach to be faster.
Look at my matrix_normal_lpdf function. Do you think SymPy or any computer algebra system can improve on it?
Yet it already takes about twice as long as my function which not only returns the log density, but the gradients as well.
Now, if we want to use Zygote to get the gradients…

julia> function matrix_normal_lpdf(Y1p, mu1, ARmat, U) where {Tf}
           T,N,K = size(Y1p)
           @inbounds Y1p2 = Y1p .- mu1
           Y1p3 = ARmat * reshape(Y1p2,(T,N*K))
           Y1p4 = reshape(Y1p3,(T*N,K)) * U
           N*(T*logdet(U) + K*logdet(ARmat)) - 0.5dot(Y1p4, Y1p4)
matrix_normal_lpdf (generic function with 2 methods)

julia> ARmata2 = Array(ARmata);

julia> @benchmark Zygote.gradient((m,ar,u) -> matrix_normal_lpdf($Y1p, m, ar, u), $mu1a2, $ARmata2, $Ua)
  memory estimate:  2.71 MiB
  allocs estimate:  60620
  minimum time:     831.813 μs (0.00% GC)
  median time:      864.588 μs (0.00% GC)
  mean time:        1.258 ms (31.10% GC)
  maximum time:     4.266 ms (78.04% GC)
  samples:          3967
  evals/sample:     1

About 80x slower, although this was also with OpenBLAS rather than MKL (but on the same computer; I used a newer version of Julia that I built with MKL in the other tests; Zygote only works on Julia 1.1).

There’s a lot more to optimizing code than CSE.
Something else to consider…

julia> B = randn(80,200); a = randn(80); c = randn(80);

julia> D = similar(B);

julia> using BenchmarkTools, PaddedMatrices

julia> pD = PaddedMatrices.PaddedArray(D);

julia> pa = PaddedMatrices.PaddedArray(a);

julia> pB = PaddedMatrices.PaddedArray(B);

julia> pc = PaddedMatrices.PaddedArray(c);

julia> @benchmark $D .= $a .* $B .+ $c
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     4.088 μs (0.00% GC)
  median time:      4.114 μs (0.00% GC)
  mean time:        4.121 μs (0.00% GC)
  maximum time:     8.106 μs (0.00% GC)
  samples:          10000
  evals/sample:     7

julia> @benchmark PaddedMatrices.muladd!($pD, $pa, $pB, $pc)
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     2.219 μs (0.00% GC)
  median time:      2.251 μs (0.00% GC)
  mean time:        2.254 μs (0.00% GC)
  maximum time:     5.344 μs (0.00% GC)
  samples:          10000
  evals/sample:     9

I am skeptical of SymPy being able to turn the broadcast into PaddedMatrices.muladd!, which is close to twice as fast. The same holds true for simpler functions, such as sum(B, dims = 2).
In this case, my function is faster because instead of two for loops to fill in the answer-matrix D, it uses 3 for loops.
It fills 4 CPU registers with elements from a and c, and then loops over B, filling in D. Then it loads 4 more registers-worth, loops again, etc.
It is faster because it only loads each element of a and c into registers once. The broadcasted version loads each element once per column B and D. In this example, that is 200 times the loads.
To optimize code, you not only need to make sure it is vectorized, but to maximize the ratio of time it spends doing useful work (that is, the actual math) to the time it spends doing anything else, like loading memory.)

Polly might help, once it is enabled.
For now it isn’t. Trying to build with it gives me an error saying I’d need LLVM_VER=svn to do it. Haven’t bothered trying. LLVM=7.0.1 already results in regular segfaults, so I’m sticking with the default.
(EDIT: Seems polly doesn’t help clang 8.0.)

However, CSE is something I’m worried about with expression templates.

Perhaps, with some work.
Many of my functions are written to use custom array types.
That will probably not change. It’s much easier to optimize code if you can control memory layouts, write methods without committing type piracy, etc. I also make use of a lot of arrays typed by size (similar to StaticArrays), although my muladd! vs broadcast example above used dynamically sized arrays that simply wrap base Julia arrays.

The interface for array functions is as in the ∂Normal example:

ProbabilityDistributions.∂Normal($Y1c, $mu1, $ARmat, $U, Val((false,true,true,true)))

Returns a tuple of the density, and gradients with respect to the 2nd, 3rd, and 4th parameter (as indicated by the Val).
Removing the will not return the gradients. I should probably do that via a second Val argument instead.
The function also drops constant terms. Val((false,true,false,false)) would, for example, not calculate the log determinant with respect to the covariance matrices; true means that argument is treated as a parameter.



Yes! I really like that Reduce can work directly with Julia expressions, and @chakravala has been really helpful . I had been exploring it when I hit a bit of a wall with type constraints in Distributions.jl. Reduce didn’t allow them (it does now), and in the mean time I started exploring SymPy.

I think I’ve figured out the right way to use this, and have partially implemented it. So now you can give it a “~ expression” and tell what variables are in scope, and it will generate the pdf. For example, here’s a normal approximation to a Poisson:

julia> ℓ = @getlogpdf(x ~ Normal(λ,sqrt(λ)), [x, λ, μ])
  log(λ)   log(π)   log(2)   (x - λ) 
- ────── - ────── - ────── - ────────
    2        2        2        2⋅λ   

There are a few things motivating the symbolic approach:

  • Cases like the above where the expression can be easily simplified
  • CSE in summations, especially in hierarchical models.

For a summation, suppose you had 1000 iid observations of the example above. Then we could replace x with y(j) (really y[j], but the symbolics don’t care). So it’s something like

julia> ℓj = subs(ℓ,Sym("x"), SymFunction("y")(Sym("j")))
  log(λ)   log(π)   log(2)   (-λ + y(j)) 
- ────── - ────── - ────── - ────────────
    2        2        2          2⋅λ     

julia> ℓj.args
(-log(2)/2, -log(pi)/2, -log(λ)/2, -(-λ + y(j))^2/(2*λ))

julia> sum([simplify(sympy.Sum(t,(Sym("j"),1,Sym("N")))).doit() for t in ℓj.args])
                                           ╲   ⎛        2   ⎞
  N⋅λ   N⋅log(λ)   N⋅log(π)   N⋅log(2)      ╲  ⎜       y (j)⎟
- ─── - ──────── - ──────── - ──────── +    ╱  ⎜y(j) - ─────⎟
   2       2          2          2         ╱   ⎝        2⋅λ ⎠
                                         j = 1               

I expect this could be pretty fast. SymPy also has a cse function that’s very easy to use:

julia> ans |> cse
(Tuple{Sym,Sym}[(x0, N/2), (x1, y(j))], Sym[-x0*λ - x0*log(λ) - x0*log(pi) - x0*log(2) + Sum(-x1^2/(2*λ) + x1, (j, 1, N))])

In this case it’s a bit weird to substitute x1=y(j); clearly some more exploration to do.

I’d be very interested to see how Reduce or you approach compare to this.

If by “manually” you mean “automatically” :wink:. None of the above was specific to the particular problem.

I had considered building separate implementations of logpdf functions. It’s potentially error-prone, but it’s easy to build tests comparing to Distributions.jl. But I wanted to focus on higher-level manipulations, and I wasn’t thinking at the level of register allocation, etc.

Yeah, me too. To what extent can your approach apply to an arbitrary model, with no manual tuning of the algorithm?

I think that should be fine. My model representation is very high-level, so it should be able to map to anything.

This looks really nice.

I’m curious, there’s potential for a lot of data-parallelism here, especially with larger data sets. And I’ve seen some work on HMC on a GPU. It’s also easier in Julia than most languages.

Any thoughts on this? I know there are lots of subtleties around memory layout, but I don’t have a sense how this carries over to GPUs, or even across CPUs (e.g., does my Ryzen benefit as much as your Intel?)



All of you might not be aware, but I am also developing a package called Grassmann.jl which will work tightly together with Reduce.jl for automatic differentiation. Currently, I am working on implementing some new features for Forward Automatic Differentiation using generalized hyper-dual numbers by means of degenerate tensor products and geometric algebra and source transformation with Reduce. These features are nearly ready, but require a bit more tweaking and modifications.



Ideally, this problem will be solved.
However, I’m @inline-ing all the functions in SLEEFPirates, so the compiler can figure this out and do this transformation itself:

julia> using SLEEFPirates, Distributions, BenchmarkTools

julia> function poisson_normal_approx_lpdf(λ::T, x::AbstractVector) where {T}
           target = zero(T)
           for xᵢ ∈ x
               target -= SLEEFPirates.log(λ)/2 + SLEEFPirates.log(T(π))/2 + SLEEFPirates.log(T(2))/2 + (xᵢ - λ)^2/(2λ)
poisson_normal_approx_lpdf (generic function with 1 method)

julia> function poisson_normal_approx_cse_lpdf(λ::T, x::AbstractVector) where {T}
           N = length(x)
           target = zero(T)
           for xᵢ ∈ x
               target += xᵢ - xᵢ^2/(2λ)
           target - N*λ / 2 - N*log(λ) / 2 - N*log(π)/2 - N*log(2)/2
poisson_normal_approx_cse_lpdf (generic function with 1 method)

julia> p = Poisson(25.0)

julia> x = rand(p, 500);

julia> @btime sum(y -> logpdf($p, y), $x)
  22.479 μs (0 allocations: 0 bytes)

julia> @btime poisson_normal_approx_lpdf(25.0, $x)
  463.797 ns (0 allocations: 0 bytes)

julia> @btime poisson_normal_approx_cse_lpdf(25.0, $x)
  452.102 ns (0 allocations: 0 bytes)

julia> @btime sum(y -> logpdf(Normal(25.0, sqrt(25.0)), y), $x)
  3.818 μs (0 allocations: 0 bytes)

The speed difference is because Base.log is faster than SLEEFPirates.log for scalars:

julia> @benchmark SLEEFPirates.log($(rx)[])
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     9.811 ns (0.00% GC)
  median time:      9.916 ns (0.00% GC)
  mean time:        9.921 ns (0.00% GC)
  maximum time:     20.202 ns (0.00% GC)
  samples:          10000
  evals/sample:     999

julia> @benchmark log($(rx)[])
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     5.069 ns (0.00% GC)
  median time:      5.094 ns (0.00% GC)
  mean time:        5.104 ns (0.00% GC)
  maximum time:     18.715 ns (0.00% GC)
  samples:          10000
  evals/sample:     1000

Meaning the only difference between the “cse” version and the naive version are that the cse-version got to use a faster (scalar) log function (SLEEFPirates.log's advantage is that the compiler can vectorize it).

Besides the compiler already being able to handle some basic transformations, because my approach involves defining functions, it can take advantage of multiple dispatch.
That is, I can define different methods for PoissonNormalApprox(x::AbstractVector, ::Number), PoissonNormalApprox(x::Integer, ::Number), and PoissonNormalApprox(x::AbstractVector, ::AbstractVector, taking care to optimize each version.

If by manual tuning you’re referring to the register-allocating, VectorizationBase (a dependency of PaddedMatrices) provides host-specific CPU information that the algorithms use (in particular, register-size and register-count).

You need to have defined the expressions of each probability distribution as well.
So, at some point, both of us need to have predefined the building blocks we’re using.

Although hand-optimizing all these building blocks will of course take a long time.

While I might have to figure out some form of cse between pdf-calls, for now I will rely entirely on dispatch, as in defining ::AbstractVector and ::Number methods.

I do think there is an advantage to hiding code behind the veil of dispatch. Having the modelling language work on the level of pdf-families (with multiple dispatch determining the code getting run), is also high level.

But I recall you saying one of the goals of your project is performing model manipulations and transformations. With that goal, it’s probably a lot better to have access to the pdf’s expressions, instead of hiding them behind functions.

I was planning on eventually implementing the edge pushing Hessian algorithm, for use with third order asymptotic approximations (note: I wrote that library before learning much about writing optimized code, but I think the slowest part is still probably going to be calculating the Hessians, where I’d expect Soss to do much better than ForwardDiff if you can use SymPy for that).

Correction: I meant “The interface for pdf functions”, not “The interface for array functions”.
I could define these as adjoints for Zygote, but they’ll probably have to work by calculating all the gradients, and then just discard the ones that aren’t needed. This can still be much faster than the alternatives, as in the matrix normal example, but is less than ideal.

[quote=“cscherrer, post:18, topic:23212”]I’m curious, there’s potential for a lot of data-parallelism here, especially with larger data sets. And I’ve seen some work on HMC on a GPU. It’s also easier in Julia than most languages.

Any thoughts on this? I know there are lots of subtleties around memory layout, but I don’t have a sense how this carries over to GPUs[/quote]

I believe Stan will support GPUs eventually, but I believe the algorithms need double precision.
Excitingly, there is an AMDGPUnative.jl in development, so Julia users will eventually have an alternative to CUDAnative.
One reason I find this especially exciting is that AMD’s Radeon VII is affordable and has good double precision performance.

All the other GPUs with strong double precision performance cost many times as much money.

Sadly, while GPUs are something I really want to experiment with more, I haven’t really gotten around to it yet.
Every time I mention my love of avx-512 on hacker news, people say I should just use a GPU instead. I’m not convinced that’s really possible or profitable for the types of MCMC workloads I (and many other statisticians fitting Bayesian models) do, but the potential is definitely worth looking into.

I have a Vega64, which can be thought of as having 256 cores with vectors that hold 64 numbers. I think that view is easier than imagining every core doing the exact same thing like it’s normally presented. I don’t want to run tens of thousands of chains, and even if I did I bet memory would be come a problem.
So, I think the question is, how and where can a model be split up into separate chains and vectorizable chunks within chains.

Ryzen and most recent Intel processors have avx2, while the specific Intel cpus I was using have avx512. This means that these have vectors twice as wide, and benefit roughly twice as much from vectorization.
Ryzen generations 1 and 2 also only have 128 bit fma units, so they only have a quarter of the fused-multiply-add throughput of the avx-512 cpus (which have two 512 bit units).
Other operations, like the loads and stores are full speed, however. Generating less heat also means they may run at a higher clock speed, but I haven’t really found that to be an issue.
(Note that the Ryzen cpus coming out this year will have two 256 bit units, the same as Intel desktop processors,)

Regarding memory layout, PaddedMatrices pads arrays so that the length of the first dimension is a multiple of one of these vector widths. That means if A is an 11x11 matrix, stride(A,2) == 12 on an avx2 platform, and stride(A,2) == 16 with avx512.
The advantage of this is that to load a column of A, it only requires 3 and 2 vector loads for avx2 and avx512, respectively. Otherwise, it would take 2 or 1 vector load(s) + 3 scalar loads. As shown here, the advantage this padding provides over StaticArrays.jl can be fairly substantial.

An example that I’ve been doing the same across CPUs: I have been storing some NxN triangular and symmetric matrices as N diagonal elements followed by the N*(N-1)/2 elements from the lower or upper triangle. This is N*(N+1)/2 elements total, instead of the N^2 elements.
Separating out the diagonals helps mostly for the sake of things like calculating determinants of the triangular matrices, or the LKJ-correlation-cholesky pdf.
I’m not totally sold on that approach. At large sizes, a standard layout to use BLAS/LAPACK will be much faster.

@chakravala Will it also do reverse auto diff, or create symbolic expressions via Reduce?