Julia calling Stan C++ directly

Ideally, this problem will be solved.
However, I’m @inline-ing all the functions in SLEEFPirates, so the compiler can figure this out and do this transformation itself:

julia> using SLEEFPirates, Distributions, BenchmarkTools

julia> function poisson_normal_approx_lpdf(λ::T, x::AbstractVector) where {T}
           target = zero(T)
           for xᵢ ∈ x
               target -= SLEEFPirates.log(λ)/2 + SLEEFPirates.log(T(π))/2 + SLEEFPirates.log(T(2))/2 + (xᵢ - λ)^2/(2λ)
           end
           target
       end
poisson_normal_approx_lpdf (generic function with 1 method)

julia> function poisson_normal_approx_cse_lpdf(λ::T, x::AbstractVector) where {T}
           N = length(x)
           target = zero(T)
           for xᵢ ∈ x
               target += xᵢ - xᵢ^2/(2λ)
           end
           target - N*λ / 2 - N*log(λ) / 2 - N*log(π)/2 - N*log(2)/2
       end
poisson_normal_approx_cse_lpdf (generic function with 1 method)

julia> p = Poisson(25.0)
Poisson{Float64}(λ=25.0)

julia> x = rand(p, 500);

julia> @btime sum(y -> logpdf($p, y), $x)
  22.479 μs (0 allocations: 0 bytes)
-1512.3198814595462

julia> @btime poisson_normal_approx_lpdf(25.0, $x)
  463.797 ns (0 allocations: 0 bytes)
-1513.5282228193805

julia> @btime poisson_normal_approx_cse_lpdf(25.0, $x)
  452.102 ns (0 allocations: 0 bytes)
-1513.5282228193912

julia> @btime sum(y -> logpdf(Normal(25.0, sqrt(25.0)), y), $x)
  3.818 μs (0 allocations: 0 bytes)
-1513.5282228193805

The speed difference is because Base.log is faster than SLEEFPirates.log for scalars:

julia> @benchmark SLEEFPirates.log($(rx)[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.811 ns (0.00% GC)
  median time:      9.916 ns (0.00% GC)
  mean time:        9.921 ns (0.00% GC)
  maximum time:     20.202 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark log($(rx)[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.069 ns (0.00% GC)
  median time:      5.094 ns (0.00% GC)
  mean time:        5.104 ns (0.00% GC)
  maximum time:     18.715 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Meaning the only difference between the “cse” version and the naive version are that the cse-version got to use a faster (scalar) log function (SLEEFPirates.log’s advantage is that the compiler can vectorize it).

Besides the compiler already being able to handle some basic transformations, because my approach involves defining functions, it can take advantage of multiple dispatch.
That is, I can define different methods for PoissonNormalApprox(x::AbstractVector, ::Number), PoissonNormalApprox(x::Integer, ::Number), and PoissonNormalApprox(x::AbstractVector, ::AbstractVector, taking care to optimize each version.

If by manual tuning you’re referring to the register-allocating, VectorizationBase (a dependency of PaddedMatrices) provides host-specific CPU information that the algorithms use (in particular, register-size and register-count).

You need to have defined the expressions of each probability distribution as well.
So, at some point, both of us need to have predefined the building blocks we’re using.

Although hand-optimizing all these building blocks will of course take a long time.

While I might have to figure out some form of cse between pdf-calls, for now I will rely entirely on dispatch, as in defining ::AbstractVector and ::Number methods.

I do think there is an advantage to hiding code behind the veil of dispatch. Having the modelling language work on the level of pdf-families (with multiple dispatch determining the code getting run), is also high level.

But I recall you saying one of the goals of your project is performing model manipulations and transformations. With that goal, it’s probably a lot better to have access to the pdf’s expressions, instead of hiding them behind functions.

I was planning on eventually implementing the edge pushing Hessian algorithm, for use with third order asymptotic approximations (note: I wrote that library before learning much about writing optimized code, but I think the slowest part is still probably going to be calculating the Hessians, where I’d expect Soss to do much better than ForwardDiff if you can use SymPy for that).

Correction: I meant “The interface for pdf functions”, not “The interface for array functions”.
I could define these as adjoints for Zygote, but they’ll probably have to work by calculating all the gradients, and then just discard the ones that aren’t needed. This can still be much faster than the alternatives, as in the matrix normal example, but is less than ideal.

[quote=“cscherrer, post:18, topic:23212”]I’m curious, there’s potential for a lot of data-parallelism here, especially with larger data sets. And I’ve seen some work on HMC on a GPU. It’s also easier in Julia than most languages.

Any thoughts on this? I know there are lots of subtleties around memory layout, but I don’t have a sense how this carries over to GPUs[/quote]

I believe Stan will support GPUs eventually, but I believe the algorithms need double precision.
Excitingly, there is an AMDGPUnative.jl in development, so Julia users will eventually have an alternative to CUDAnative.
One reason I find this especially exciting is that AMD’s Radeon VII is affordable and has good double precision performance.

All the other GPUs with strong double precision performance cost many times as much money.

Sadly, while GPUs are something I really want to experiment with more, I haven’t really gotten around to it yet.
Every time I mention my love of avx-512 on hacker news, people say I should just use a GPU instead. I’m not convinced that’s really possible or profitable for the types of MCMC workloads I (and many other statisticians fitting Bayesian models) do, but the potential is definitely worth looking into.

I have a Vega64, which can be thought of as having 256 cores with vectors that hold 64 numbers. I think that view is easier than imagining every core doing the exact same thing like it’s normally presented. I don’t want to run tens of thousands of chains, and even if I did I bet memory would be come a problem.
So, I think the question is, how and where can a model be split up into separate chains and vectorizable chunks within chains.

Ryzen and most recent Intel processors have avx2, while the specific Intel cpus I was using have avx512. This means that these have vectors twice as wide, and benefit roughly twice as much from vectorization.
Ryzen generations 1 and 2 also only have 128 bit fma units, so they only have a quarter of the fused-multiply-add throughput of the avx-512 cpus (which have two 512 bit units).
Other operations, like the loads and stores are full speed, however. Generating less heat also means they may run at a higher clock speed, but I haven’t really found that to be an issue.
(Note that the Ryzen cpus coming out this year will have two 256 bit units, the same as Intel desktop processors,)

Regarding memory layout, PaddedMatrices pads arrays so that the length of the first dimension is a multiple of one of these vector widths. That means if A is an 11x11 matrix, stride(A,2) == 12 on an avx2 platform, and stride(A,2) == 16 with avx512.
The advantage of this is that to load a column of A, it only requires 3 and 2 vector loads for avx2 and avx512, respectively. Otherwise, it would take 2 or 1 vector load(s) + 3 scalar loads. As shown here, the advantage this padding provides over StaticArrays.jl can be fairly substantial.

An example that I’ve been doing the same across CPUs: I have been storing some NxN triangular and symmetric matrices as N diagonal elements followed by the N*(N-1)/2 elements from the lower or upper triangle. This is N*(N+1)/2 elements total, instead of the N^2 elements.
Separating out the diagonals helps mostly for the sake of things like calculating determinants of the triangular matrices, or the LKJ-correlation-cholesky pdf.
I’m not totally sold on that approach. At large sizes, a standard layout to use BLAS/LAPACK will be much faster.

@chakravala Will it also do reverse auto diff, or create symbolic expressions via Reduce?