[ANN] ZigZagBoomerang.jl

Dear all!

We would like to announce our package together with our new paper “Sticky PDMP samplers for sparse and local inference problems” (Arxiv link)

ZigZagBoomerang.jl

Sleek implementations of the ZigZag, Boomerang and other assorted piecewise deterministic Markov processes for Markov Chain Monte Carlo including Sticky PDMPs for variable selection

1. What is this about

ZigZagBoomerang.jl contains a user-friendly and clean implementations of the most prominent piecewise deterministic Monte Carlo methods. These Monte Carlo methods are fast (non-reversible, with momentum) and allow for subsampling of data without creating bias and can be used for sparse and local inference problems.

2. What can it do

Within a few lines of code, you can run challenging problems with large data size and in high dimensional space. The only ingredient the user has to add is (the gradient of ) the target log-density and perhaps a rough estimate of the posterior mean.

For example the gradient of a Gaussian density is linear and can be given by a sparse matrix Γ and then it is

# 
Γ = ... # sparse precision matrix

# Define ∇ϕ(x, i, Γ) giving the partial derivative of ϕ(x) with respect to x[i]
∇ϕ(x, i, Γ) = ZigZagBoomerang.idot(Γ, i, x) # more efficient that dot(Γ[:, i], x)

# Random initial values
t0 = 0.0
x0 = randn(n*n)
θ0 = rand([-1.0,1.0], n*n)

# Rejection bounds
c = 1.0*ones(length(x0))

# Define ZigZag
Z = ZigZag(Γ, x0*0)

# Run sparse ZigZag for T time units and collect trajectory
T = 20.0
@time trace, (tT, xT, θT), (acc, num) = spdmp(∇ϕ, t0, x0, θ0, T, c, Z, Γ; adapt=true)
@time traj = collect(discretize(trace, 0.1))

zigzagfield

4. Using ZigZag and friends as backends in your probabilistic programming environment.

We have written ZigZagBoomerang.jl to be used as sampling engine for probabilistic programming languages. Let’s demo this with Soss.jl where work on Zig-Zag integration is already under way:

using Soss
m = @model x begin
    α ~ Normal()
    β ~ Normal()
    yhat = α .+ β .* x
    y ~ For(eachindex(x)) do j
        Normal(yhat[j], 2.0)
    end
end

# generate data from model
x = randn(20);
obs = -0.1 .+ 2x + 1randn(20); 
posterior = m(x=x) | (y=obs,)
using ZigZagBoomerang


include("zigzag.jl") 


T = 100.0
trace, final, (num, acc) = @time zigzag(posterior, T) # sample Soss posterior

# trace is a continous object, discretize to obtain samples
ts, xs = ZigZagBoomerang.sep(discretize(trace, 0.1))

ts # time points
xs # vector of points

(see also zigzag.jl)
Let’s plot

using Plots
p = plot(ts, getindex.(xs, 1))
plot!(ts, getindex.(xs, 2), color=:red)
p2 = plot(first.(xs), last.(xs))
p

tracezigzag tracezigzag2

24 Likes

Cool! The arxiv link is not working

So what about sparsity?

The Zig-Zag allows (that is what our paper is about) spike and slab priors for variable selection. Let’s demo this with Soss:

Let’s use MeasureTheory.jl 's SpikeMixture (a very spiky spike and slab with a Dirac measure as spike) to indicate that we have ground to believe the coefficients are actually zero:

m = @model x begin
    α ~ SpikeMixture(Normal(), 0.2) # 0.2*Normal() + 0.8*Dirac(0)
    β ~ SpikeMixture(Normal(), 0.2)
    yhat = α .+ β .* x
    y ~ For(eachindex(x)) do j
        Normal(yhat[j], 2.0)
    end
end


x = randn(20);
obs = -0.1 .+ 2x + 1randn(20); 
T = 100.0
posterior = m(x=x) | (y=obs,)

sampling with the sticky ZigZag


include("sparsezigzag.jl")
trace, final, (num, acc) = @time sparse_zigzag(posterior, T, c=50)

ts, xs = ZigZagBoomerang.sep(discretize(trace, 0.1))
xs = xform(posterior).(xs) # make named tuples `NamedTuple{(:β, :α)`

now we can answer for example what is the posterior probability that the parameters are zero

julia> mean(getindex.(xs, :α) .== 0)
0.7211394302848576

julia> mean(getindex.(xs, :β) .== 0)
0.028985507246376812

tracesticky

The nice thing is that this scales very well…

4 Likes

For example a cloud of SDE driven boids

Each boid is in love with another one and follows them… but whom? It also hates one and avoid it… which one we don’t know. That is a sparse estimation and without sparsity inducing prior there will be not enough signal in the noise to estimate n*(n-1)/2 love or hate relationships. But with a spike and slab prior


Figure: Thin vertical lines indicate distance to the truth. True zeros are plotted with the symbol ×, other are plotted as points will color gradient corresponding to the ground truth.

we estimate the love and hate matrix with something like 2500 unknowns reasonably well (compared to thresholding) .

3 Likes

Nice package :slight_smile: at the end of my doctorate I wrote a package for PDMP samplers which is now pretty outdated (PDSampler.jl) and which I unfortunately don’t have the energy to maintain anymore. It also implemented the ZZ and BP sampler as well as funkier stuff on factor graphs. It’s nice to see a fresher and better package take over! Good luck!

5 Likes

Thank you! I see that I starred your package and then must have forgotten it, sorry for not keeping you in the loop. Do you have something written on the fancy factors? That’s very interesting for us, we thought quite a bit about PDMP on factor graphs. We should also check if there is something to port from your package? @SebaGraz

1 Like

No worries I didn’t expect you to (especially given the package is not actively maintained anymore). Re factor graph, you could check this part of the docs, some references are [1510.02451] The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method and maybe [1701.04244] Piecewise Deterministic Markov Processes for Scalable Monte Carlo on Restricted Domains :slight_smile:

2 Likes

We haven’t done yet the BPS on a factor graph, so that goes onto the todo list.

What we can add easily is sampling processes constrained to be positive! Because we already compute the time it takes to hit the coordinate axes and put it in the priority queue, we can also just do a boundary reflection from your https://arxiv.org/pdf/1701.04244.pdf instead of freezing to the axis as in the sticky version. That’ll be useful for sampling parameters supported on [0, ∞) without reparametrising.

2 Likes

This is really interesting. Any thoughts on how far this boundary reflection can be taken? Could this be used to sample arbitrary convex domains?

1 Like

The only restriction mentioned on https://arxiv.org/pdf/1701.04244.pdf is to have an open, pathwise connected domain with Lipschitz boundaries. From a computational point of view, you need to be able to evaluate the boundary and its angle to compute hitting times and reflections at the boundary. One last thing, the process might have some funny recurrent behaviour (I am thinking of the Zig-Zag targeting a uniform density on a square), but they can be easily tackled by adding refreshments.

5 Likes

Yeah so @SebaGraz nailed it. IIRC from coding it, you can indeed “in theory” have this work for any domain that meets these conditions but there’s a bit of a hidden trick in that you have to be able to compute the normal at the incidence point to compute the reflection quickly and efficiently (if you do this with a numerical approximation then you potentially lose unbiasedness properties). @SebaGraz said exactly that, just stressing that it’s important.
For simple surfaces (basically polygons with few faces) this is easy, for other boundaries it can be pretty hard.

Polygonal boundaries are useful though e.g. for GLMs with things like positivity constraints as mentioned above.

4 Likes

PS:

4 Likes

Fresh from the Youtube press:

ZigZagBoomerang.jl - parallel inference and variable selection | Moritz Schauer | JuliaCon2021

11 Likes

Is there code for the boundary reflection somewhere? I’d really like to constrain some parameters to be positive. Thanks

I’ve got a sampler I call the White Box sampler. It goes in straight lines until it hits the boundary where the log probability density is below some threshold, then it tries random directions on the hyper sphere until it succeeds in stepping back into the high probability region. It samples essentially uniform from the high probability density region. By using this as an umbrella sampler with a simple metropolis Hastings diffusive sampler you can get quite effective sampling from a true posterior and zero need for gradient calcs.

However I haven’t figured out how to plug this concept into the ecosystem, nor have I written a paper about it. If someone is interested I’d be happy to collaborate. I think the scheme has a lot going for it, especially the gradient free nature.

1 Like

Sounds a little like Hit-and-Run with a slice sampler. Is that right?

Haha, I’d have to read about hit and run to be able to answer, but… Sounds plausible. It was motivated by the fact that in higher dimensions the log probability density is near constant, and all the volume is near the surface, so sampling uniformly inside the LP threshold is already nearly exact

Also “white box sampler” is motivated by the idea of a photon bouncing inside a perfectly reflective white painted box. At each reflection it goes towards the inside but in a random direction.

1 Like

Hi @EvoArt , do you have only the positivity constraint? That’s easy to achieve, we’ll prepare something. @SebaGraz

Thanks! Yes, just positivity constraint for a subset of parameters.

Almost there, do you have a share-able example?