[ANN] ZigZagBoomerang.jl

Here is a MWE. I would want x[2] to always be positive. Thanks again.

using ZigZagBoomerang, Distributions, ForwardDiff, LinearAlgebra, SparseArrays

data = rand(Normal(3,10),20)
g(x) =sum(logpdf.(Normal(x[1],x[2]),data))

function ∇ϕ(x, i) 
   d = -ForwardDiff.gradient(g,x)
   d[i]
end
n= 2
t0 = 0.0
x0 = abs.(randn(n))
θ0 = rand([-1.0,1.0], n)
c = 1.0*ones(length(x0))
Z = ZigZag(sparse(I(n)), x0*0);
T = 200.0

trace, (tT, xT, θT), (acc, num) = pdmp(∇ϕ, t0, x0, θ0, T, c, Z; adapt=true)

  • List item
1 Like

We are going to a redesign to allow mix real valued parameters, spike and slab priors and intervals of the form [a, ∞], [a, b], [-∞, b]

using ZigZagBoomerang, Distributions, ForwardDiff, LinearAlgebra, SparseArrays, StructArrays
const ZZB = ZigZagBoomerang
## Problem
d = 2
n = 10
xtrue = [-3.0, 3.0]
data = rand(Normal(xtrue[1], xtrue[2]), n)
g(x) = sum(logpdf(Normal(x[1], x[2]), dt) for dt in data) 

## Negative partial derivative maker
function negpartiali(f, d)
   id = collect(I(d))
   ith = [id[:,i] for i in 1:d]
   function (x, i, args...)
       sa = StructArray{ForwardDiff.Dual{}}((x, ith[i]))
       δ = -f(sa, args...).partials[]
       return δ
   end
end

## Sampler

# Starting point
t0 = 0.0
x0 = [2.0, 5.0]
θ0 = rand([-1.0,1.0], d)
u0 = ZZB.stickystate(x0)

# Dynamics
Z = ZigZag(sparse(I(n)), x0*0);
flow = ZZB.StickyFlow(Z)

# Duration
T = 2000.0
end_time = ZZB.EndTime(T)


# Target 
G = [i=>collect(1:d) for i in 1:d] # Sparsity target
target = ZZB.StructuredTarget(G, negpartiali(g, d))

# Barriers
c = 1.0*ones(length(x0))
κ = Inf # Inverse waiting time
barriers = [ZZB.StickyBarriers(), # No barrier
            ZZB.StickyBarriers((2.5, Inf), (:reflect, :reflect), (κ, κ)) # instantaneously reflect at 2.5 and at "infinity"
   ]

# Rejection bounds
strong = false
c = 20*[1.0 for i in 1:d]
adapt = true # adapt bounds
factor = 1.5
G1 = [i => [i] for i in 1:d] # Sparsity pattern bounds
G2 = [i => setdiff(union((G1[j].second for j in G1[i].second)...), G[i].second) for i in eachindex(G1)]
upper_bounds = ZZB.StickyUpperBounds(G1, G2, 1.0sparse(I(d)), strong, adapt, c, factor)

# Sample
trace, _, _, acc = @time ZZB.stickyzz(u0, target, flow, upper_bounds, barriers, end_time)
println("acc ", acc.acc/acc.num)

# Discretize on dynamic grid for plotting
global ts1, xs1 = ZZB.sep(collect(trace))

# Discretize on fixed grid for means
dt = 0.5
ts, xs = ZZB.sep(collect(discretize(trace, dt)))
@show mean(xs)


# Visualize
using GLMakie
fig1 = fig = Figure()
r = 1:length(ts1)
ax = Axis(fig[1,1], title = "trace 1")
lines!(ax, ts1[r], getindex.(xs1[r], 1))
ax = Axis(fig[2,1], title = "trace 2")
lines!(ax, ts1[r], getindex.(xs1[r], 2))

ax = Axis(fig[1:2,2], title = "phase")
lines!(ax, getindex.(xs1[r], 1), getindex.(xs1[r], 2))

save(joinpath(@__DIR__, "positivity.png"), fig1)
display(fig1)

5 Likes

New version 0.11 is out! This version brings a practical implementation of the Bouncy Particle sampler.

The sampler is defined and has essentially three tuning parameters: how often and how much the search direction/momentum is randomised, and possible a root of the mass matrix (e.g. Cholesky)

using LinearAlgebra
using ZigZagBoomerang
M = I # mass matrix

BPS = BouncyParticle(∅, ∅, # ignored
    10.0, # momentum refreshment rate 
    0.95, # momentum correlation ρ ∈ [0,1] / only gradually change momentum in refreshment/momentum update
    0.0, # ignored
    M # root of momentum precision/mass matrix, or identity
) 

So now let’s say we want to sample a multivariate density target p(x) .

We’ll need to build two functions: dneglogp, ∇neglogp!

  • d1, d2 = dneglogp(_, x, v) computes for f(t) = -log(p(x + t*v)) the first directional derivative d1 = f'(0) of negative target log-likelihood p at x
    in direction v and the corresponding second derivative d2 = f''(0).

  • ∇neglogp!(y, _, x) computes the gradient of negative target log-likelihood -log(p(x)) at x, saving it in y (e.g. using ForwardDiff.gradient!).

If you have a Turing model model, ./turing/lr.jl has a convenient function creating these two:

dneglogp, ∇neglogp! = make_derivatives_neglogp(model)

We need both! This is in fact very nice: computing a directional derivative is much cheaper than a gradient, and we are getting away with calling dneglop 90%-99% of the time!
And essentially we are doing line search: we move along direction v, probing the slope of -log(p(x + t*v)) until we need to change direction. Only then we need to compute the gradient to find a new direction to explore.

From here calling the sampler is it’s straight forward. Let’s define initial state and

d = 25 # number of parameters 
t0 = 0.0 # nominal starting time of sampler
x0 = zeros(d) # starting guess sampler
θ0 = randn(d) # starting search direction/momentum sampler
T = 200. # end time (similar to number of samples in MCMC)
c = 5.0 # initial guess for the curvature of the target

Thus, we can call the sampler, the signature is as follows:

trace, final, (acc, num), cs = @time pdmp(
        dneglogp, # return first two directional derivatives of negative target log-likelihood in direction v
        ∇neglogp!, # return gradient of negative target log-likelihood
        t0, x0, θ0, T, # initial state and duration
        ZigZagBoomerang.LocalBound(c), # use Hessian information 
        BPS; # sampler
        adapt=true, # adapt bound c
        progress=true, # show progress bar
        subsample=true # keep only samples at refreshment times
)

Voila, some samples in trace (a vector of pairs t => x) where x is a sample and t a nominal time that can be ignored when subsample=true:

julia> typeof(last.(trace))
Vector{Vector{Float64}} (alias for Array{Array{Float64, 1}, 1})

Check out ZigZagBoomerang.jl/turing for an example using a Turing.jl model and a comparison with NUTS!

9 Likes

By the way, using a unit mass matrix is not really doing the sampler justice, doing a preliminary run getting some samples x_prelim and setting

M = Diagonal(1 ./ std(x_prelim))
BPS = BouncyParticle(∅, ∅, # ignored
    10.0, # momentum refreshment rate 
    0.95, # momentum correlation ρ ∈ [0,1] / only gradually change momentum in refreshment/momentum update
    0.0, # ignored
    M # root of momentum precision/mass matrix, or identity
) 
θ0 = M\randn(d) # scaled starting direction sampler

will reduce the run-time down to seconds

2 Likes

Oh wow, that’s amazing! Hmm, maybe that should be a default? Possibly even defaulting to a mass matrix from Pathfinder.jl? @sethaxen ?

2 Likes

Thanks for pointing that out. I saw the package before but misunderstood what it was good for! Now having a mass matrix and a MAP is great, the map can be used by us as control variate in computing stochastic gradients instead of going over the full data

2 Likes

Whoohoo, with Pathfinder.jl!

julia> nuts_chain
Chains MCMC chain (1000×37×1 Array{Float64, 3}):

Iterations        = 501:1:1500
Number of chains  = 1
Samples per chain = 1000
Compute duration  = 367.67 seconds
julia> bps_chain
Chains MCMC chain (1233×25×1 Array{Float64, 3}):

Iterations        = 1:1:1233
Number of chains  = 1
Samples per chain = 1233
Compute duration  = 6.67 seconds (+  6.33 seconds for Pathfinder)

@sethaxen ZigZagBoomerang.jl/lr.jl at master · mschauer/ZigZagBoomerang.jl · GitHub

3 Likes

Nice! Depending on how you use the mass matrix, you may find Pathfinder.WoodburyLeftInvFactor useful. It’s a matrix A that wraps the estimated covariance matrix \Sigma (i.e. inverse metric) such that A A^\top = \Sigma^{-1}, i.e. can be used in place of the lower triangular Cholesky factor of the metric. Currently it only exists for compatibility with DynamicHMC and is not part of the API, but if it’s useful for you, I can move it into the main source and make it part of the API.

2 Likes

Pathfinder does compute the MAP (when it can), but this is not the result.fit_distribution it returns. Instead, the fit distribution is the distribution along the optimization trajectory that maximizes the ELBO, which as shown in the Pathfinder paper does a better job of approximating the target distribution. But if optimization succeeded (i.e. converged on a mode; this is not checked), you could access the MAP estimate with result.optim_trace.points[end] or the estimated distribution at the MAP (a “low rank” Laplace approximation) with result.fit_distributions[end].

1 Like

Yeah, I compute reflection angles with respect to a metric

function reflect!(∇ϕx, x, θ, F::BouncyParticle)
    θ .-= (2*dot(∇ϕx, θ)/normsq(F.L\∇ϕx))*(F.L'\(F.L\∇ϕx))
    θ
end

where L triangular (your A?). I could also use a factor of Σ instead

Both are straightforward to implement, and I could add them to Pathfinder. I’ve also been thinking that it would be nice to have a lightweight package that, given positive semidefinite A could compute the following factors:

  • some L such that \langle L x, L y\rangle=\langle x, A y \rangle
  • some U such that \langle U^\top x, U^\top y\rangle=\langle x, A y \rangle
  • some \tilde{L} such that \langle \tilde{L} x, \tilde{L} y\rangle=\langle x, z \rangle where A z = y
  • some \tilde{U} such that \langle \tilde{U}^\top x, \tilde{U}^\top y\rangle=\langle x, z \rangle where A z = y

The returned factor would be required to implement * by matrices and vectors, as well as the in-place mul!, lmul!, and ldiv!. The default would be to compute cholesky and return the requested factor or fall back to svd for a PSD matrix.

The real purpose of the package would be to implement the API that packages that implement specialized PSD matrices can overload to deliver these factors when they are available. It would basically be an alternative to PDMats that doesn’t require one implement a custom PDMat wrapper for every possible structured matrix one might want. Instead, one just needs to focus on implementing factor types like these. What do you think?

This sounds a lot like the way we do affine transforms in MeasureTheory.jl. I generally avoid PDMats, but PositiveFactorizations.jl is pretty great for the semidefinite case:

Ah yes, PositiveFactorizations seems better than SVD. Either way, it seems there may be a place for a package like this that doesn’t depend on Distributions, MeasureTheory, or PDMats, and which can be extended by packages that implement custom matrices.

1 Like

Maybe a silly question but why are the sizes of the MC chains different in the two cases?

That’s a great question:

Left is how the samples of a MCMC sampler look like. Right is the trajectory of a BPS (in one coordinate) looks like and how it is saved. The line itself is the sample. As you perhaps don’t want an entire line, I transform it into ordinary samples. If you would just pick the points of direction changes you would get a bias (direction changes happen in the corners V and Λ, the corners are not representative for the entire line - they are farer away from zero than the line on average). To avoid that, we sample the trajectory at random points… but then there is a bit of uncertainty how many points you get exactly. In any case, what matters is the effective sample size and not the absolute number of samples

3 Likes

Got it. Thanks for the clarification. :pray:t2: Looking forward to playing with this. The benchmark against NUTS is really impressive.

1 Like

@mschauer Can you comment on the ESS/second of the two methods for this example? Also, taking NUTS draws as ground truth, do you see any bias in the ZigZagBoomerang draws? For me these would be more informative comparisons than raw runtime.

2 Likes

I estimate BPS to have 3 times as many samples per second, see how the BPS running averages (green) converge faster, but the NUTS running averages (red) eventually agree (x-axis = computing time in seconds) and both approach the posterior mean

6 Likes