[ANN] Vecchia.jl: linear-cost differentiable approximations for Gaussian log-likelihoods

Hi all. I’ve just registered Vecchia.jl, which provides functionality for a very convenient class of approximations for the Gaussian log-likelihood (as of now, only mean-zero, although for some technical reasons this is a bit less of a limitation than you might think). My preferred reference for those who like to read papers is Stein/Chi/Welty 2004, but the idea can be described very simply. Any joint likelihood for a multivariate \textbf{z} can of course be decomposed into a product of conditional likelihoods as

\ell_{\theta}(\textbf{z}) = \ell_{\theta}(z_1) \prod_{j=2}^n \ell_{\theta}(z_j \; | \; \textbf{z}_{1:(j-1)}).

The point of Vecchia approximations is that this quantity can in some circumstances be very well-approximated with

\ell_{\theta}(\textbf{z}) \approx \ell_{\theta}(z_1) \prod_{j=2}^n \ell_{\theta}(z_j \; | \; \textbf{z}_{\sigma(j)}),

where \sigma(j) \subseteq [j-1] and is typically O(1) in size. In this setting and from this formulation, it is reasonably direct to see that you will evaluate O(n) many O(1)-sized conditional Gaussian likelihoods, giving a very attractive linear complexity in the data size. There are plenty of dials to turn, like taking “chunks” \textbf{z}_j instead of scalar measurements z_j, but for the sake of brevity I won’t discuss all that in the initial announcement post here.

As far as why this is sometimes a sensible thing to do, in the simplest case you could imagine a time series for \textbf{z}, and there is an obvious intuition that perhaps conditioning on the recent past (say, \sigma(j) = \{j-3, j-2, j-1\}) makes the the present and the far-past nearly independent. For those who would like a reference or more concrete discussion of when exactly that might be reasonably close to the truth, let me know and I can provide plenty of references or discussion here. But for the moment I won’t introduce any more technical or mathematical language unless somebody asks.

The package is pretty easy to use. The README gives a complete demo, but here is a slightly shortened demo:

using LinearAlgebra, StaticArrays, Vecchia

# VERY IMPORTANT FOR MULTITHREADING, since this is many small BLAS/LAPACK calls:
BLAS.set_num_threads(1)

# Covariance function, in this case Matern(v=3/2):
kfn(x,y,p) = p[1]*exp(-norm(x-y)/p[2])*(1.0+norm(x-y)/p[2])

# Locations for fake measurements, in this case 2048 of them, and fake data 
# (data NOT from the correction distribution, this is just a maximally simple demo):
const pts = [SVector{2, Float64}(randn(2)) for _ in 1:2048]
const dat = randn(length(pts))

# Create the VecchiaConfig:
const chunksize = 64
const num_conditioning_chunks = 3
const vecc = Vecchia.kdtreeconfig(dat, pts, chunksize, num_conditioning_chunks, kfn)

# Now you can evaluate the likelihood:
const sample_p = ones(2)
Vecchia.nll(vecc, sample_p)

Thanks to the beautiful Folds ecosystem of @tkf, this code thread-parallelizes very well. And as yet another bonus, it is fully AD-compatible, so you can using ForwardDiff.hessian and so on to perform higher-order optimization, which is a must for GP estimation in my opinion. The example files provide a full demonstration of directly using Ipopt to perform parameter estimation. I acknowledge that I’m an Ipopt cultist, but I really don’t think it gets any better than that.

Not demonstrated here but shown in the README, the package also provides functionality for the induced sparse approximation to the precision matrix. This depends even more on @tkf’s ecosystem and now also on @elrod’s LoopVectorization, but with all those fancy tools assembling the sparse approximation is only about 8x the time cost of the direct evaluation of the many small likelihoods. All things considered, I think that’s pretty cool.

Also, I’m releasing this on the MIT license instead of my preferred GPLv2 license that I’ve always used in the past, so if that was ever a deterrent to using code I’ve released, no need for that concern here. I’d really love to see more of the spatial statistics community adopt Julia, because I’m sick of writing R bindings so that R users can access all of the incredible functionality of Julia. So crossing my fingers that this helps…

Once the three day waiting period is up, you should be able to ]add Vecchia. If you try it and have any issues or comments or questions, please do reach out. This is my first time registering a package and trying to deal with things like compat, so please also tell me if I’ve done something stupid there.

23 Likes

This looks great, I’m excited to check it out! @ElOceanografo I think you’ll find this interesting as well.

3 Likes

Yes! This looks great @cgeoga, can’t wait to play around with it.

2 Likes

Excellent! I’ll be very curious to hear from you and/or @jkbest2 if you do use it.

I should also clarify to anybody reading this thread that adding support for mean functions would be just a few lines of code. Since all the derivatives are AD-based it would be sufficient to evaluate for data - meanfun.(locations) instead of just data. I primarily haven’t added it because it’s not obvious what the “best” interface for it is. But if you or anybody else would be more interested if it had mean function stuff built in, I can easily plop that in and add that to the demo.

1 Like

Hey all–just an update that I’ve released a new version that now also offers the option of LoopVectorization-powered SIMD for the likelihood function. Particularly if you’re going to want sequential execution instead of threaded, the SIMD easily buys you a factor of two or three speedup. As of now it’s not completely stable for autodiff, but even with it just falls back to @inbounds that doesn’t come at a speed cost in the sequential case (unsure how exactly about many-core systems). But in some applications people just want a fast likelihood (right?), and so maybe even in its slightly incomplete form this would be of interest.

The usage is pretty simple once you’ve created the VecchiaConfig object like in the above code example or README:

# re-write the kernel function to accept scalar components of the two spatial
# coordinates.
function kfn_scalar(x1, x2, y1, y2, p)
  nrm = sqrt((x1-y1)^2 + (x2-y2)^2)
  p[1]*exp(-nrm/p[2])*(1+nrm/p[2])
end

# change some internals so that looping over points is LV-approved.
const vecc_s = Vecchia.scalarize(vecc, kfn_scalar) 

# Enjoy the AVX:
Vecchia.nll(vecc_s, sample_p)

Huge thanks to @elrod for help with the @generated functions that make this extension just as dimension-agnostic as the standard implementation that uses SVectors for coordinates.

5 Likes

Hey all–just another quick update. I’ve added a basic implementation of the “reverse” Cholesky factor for the sparse precision matrix. In a word, the typical Vecchia likelihood is written as a bunch of small likelihoods (see top post). But this can also be written in terms of a sparse precision matrix \Omega \, ``\approx" \Sigma^{-1} such that

\log \ell_{\text{Vecchia}} (\theta) = \frac{1}{2} \left( -\log |\Omega(\theta)| + \textbf{z}^T \Omega(\theta) \textbf{z} \right)

for (suitably re-permuted!!) data \textbf{z}. The functionality to compute this \Omega has been in the package for a while as Vecchia.precisionmatrix(vecc, params) (see above for information about the vecc configuration object).

Further, though, there are some really nice ways to directly compute a symmetric factor of \Omega so that \Omega = U U^T and U is upper triangular. The code for creating U is now available in the package as Vecchia.rchol(vecc, params), and is a decent bit faster than building \Omega. Not only is this pretty convenient for likelihoods, but it also makes things like simulations much easier. Probably the most important benefit of this, though, is that the U-based likelihood method does not require any sparse matrix factorizations (or any SparseMatrixCSC structures at all), and so the AD compatibility is excellent. The constructor for U does a lot of mutation, so I can’t promise that it will work for all AD systems. But with ForwardDiff.jl at least it’s great. Finally, I’m hoping that this helps in making this package composable with other things in the GP ecosystem, as you can use this function and just get a matrix factor out and then do whatever other crazy MCMC thing or whatever you want with it.

By default, U=Vecchia.rchol(vecc, params) gives you back a struct that has custom application methods (U*x, U'*x, and U*U'*x with some in-place variants) and a log-determinant. If you use big conditioning sets, these custom methods can be a lot faster than the sparse methods, I would guess because working with the dense chunks means getting to use BLAS. But if you want the actual sparse matrix, you can just call sparse(U) and you’re good to go.

As a quick reminder: this is all built on top of the beautiful JuliaFolds ecosystem. So you can really enjoy some good multi-threading speedup. BUT, remember to BLAS.set_num_threads(1) if you want to use multiple Julia threads (as in, julia -t5 or whatever)!

Happy to provide more information and/or references if anybody is interested, but this post is getting long enough for now.

4 Likes