Implementing sparse Gaussian process model with Turing and Stheno

I’ve built a Turing model for some non-Gaussian data which contains a latent Gaussian process for spatial autocorrelation. I’ve run it on a small simulated dataset to confirm that it works. Now I want to run it on my real data–which includes about 11,000 data points, so impractical for a dense GP.

A simple MWE is below (the real model more complicated, but the essence is the same, i.e. GP => transformation => parameter of data distribution). Is there a simple(ish) way to modify this to use the sparse inducing-point approximations available in Stheno instead of the full GP?

using Turing, Stheno, Distributions, Random

Random.seed!(1)
x = 1.0:10.0
y = rand(GP(Matern32(), GPC())(x))
λ = exp.(y)
z = rand.(Poisson.(λ))

@model function example_model(x, z)
    y ~ GP(Matern32(), GPC())(x)
    λ = exp.(y)
    z .~ Poisson.(λ)
end

mod = example_model(x, z)
chn = sample(mod, NUTS(), 100)

@willtebbutt might know.

As an example of something I’ve tried that doesn’t work:

@model function example_model_sparse(x, z, xu)
    f = GP(Matern32(), GPC())
    y ~ MvNormal(zeros(length(x)), 2.0)
    Turing.@addlogprob! elbo(f(x, 1e-9), y, f(xu, 1e-9))
    λ = exp.(y)
    z .~ Poisson.(λ)
end

xu = 0:2:10 # Inducing points
mod_sparse = example_model_sparse(x, z, xu)
chn_sparse = sample(mod_sparse, NUTS(), 100)
ypost_sparse = Array(group(chn_sparse, :y))'

using Plots
plot(x, ypost_sparse, color=:black, alpha=0.5, label="")
plot!(x,  y, color=:red, linewidth=5, label="True y")

Which produces this:
image

In contrast, making the same plot with the chain generated using example_model above looks like this:

image

so the dense model appears to be generating a sensible posterior that tracks the true value of the latent field y.

I may be (probably am) misunderstanding what elbo is calculating, but the latent y variable doesn’t seem to be connected to the observed data here…

Hmmm this is a bit tricky. You’ve used the elbo calculation roughly correctly, but I don’t think that it’s doing what you want overall.

The basic issue is that you’ll need to somehow avoid computing the logpdf and instead compute the elbo. One approach / hack would be to introduce a wrapper type that eats a Stheno.AbstractGP and a FiniteGP representing the pseudo-points, and is itself another Stheno.AbstractGP which, when contained in a FiniteGP implements logpdf by calling the elbo of the GP that it wraps at the pseudo points that it wraps.

Does that suggestion makes sense?

Yes, that generally makes sense. So you’re thinking of something like this:

struct SparseFiniteGP{T1<:Stheno.FiniteGP, T2<:Stheno.FiniteGP} <: Stheno.AbstractGP
    fobs::T1
    finducing::T2
end

function Distributions.logpdf(f::T, y) where {T <: SparseFiniteGP}
    return elbo(f.fobs, y, f.finducing)
end

I tried using this new type in my Turing model as y ~ SparseFiniteGP(f(x), f(xu)). However, it complains that “ArgumentError: Right-hand side of a ~ must be subtype of Distribution or a vector of Distributions”.

Also, doesn’t Turing require that custom Distributions have a rand method?

What you’ve written is almost the thing that I had in mind. I had specifically imagined something along the lines of the following (modulo bugs, I’ve not tested this):


struct HackyGP{Tf<:AbstractGP, Tu<:FiniteGP} <: Stheno.AbstractGP
    f::Tf # process to be wrapped
    u::Tu # pseudo-points
end

function Stheno.logpdf(hacky_f::FiniteGP{<:HackyGP}, y::AbstractVector{<:Real})
    f = hacky_f.f.f
    u = hacky_f.f.u
    x = hacky_f.x
    Σy = hacky_f.Σy
    return Stheno.elbo(f(x, Σy), y, u)
end
1 Like

Okay, I came back to this and after some hacking have got the following to run and produce basically sensible results.

struct SparseFiniteGP{T1<:Stheno.FiniteGP, T2<:Stheno.FiniteGP} <: Distributions.AbstractMvNormal
    fobs::T1
    finducing::T2
end

function Stheno.logpdf(f::T, y::AbstractArray{T1, 1}) where {T <: SparseFiniteGP, T1 <: Real}
    return elbo(f.fobs, y, f.finducing)
end

Base.length(f::SparseFiniteGP) = length(f.fobs)
Distributions._rand!(rng::Random._GLOBAL_RNG, f::SparseFiniteGP1, x::Array{T}) where {T<:Real} = rand(f.fobs)

@model function example_model_sparse(x, z, xu)
    f = GP(Matern32(), GPC())
    fsparse = SparseFiniteGP(f(x, 1e-6),  f(xu, 1e-6))
    y ~ fsparse
    λ = exp.(y)
    z .~ Poisson.(λ)
end

chn_sparse = sample(example_model_sparse(x, z, xu), NUTS(), 200)

On the left is the dense model’s posterior for λ, on the right is the sparse model’s posterior (with the locations of the inducing points shown by the vertical lines).

image image

The posteriors do have some differences, but…with 900 observations, drawing 200 NUTS samples from the dense model takes something like 4 hours. Using the sparse model with 19 inducing points, the same sampling run takes about 60 seconds, i.e. a roughly 240-fold speedup.

@willtebbutt, do you think a type like this is worth having in Stheno? If so, I can start work on a PR…

4 Likes

This looks great, a PR would be very much welcome!

1 Like

Great! I’ve opened an issue to continue this discussion at the repo: https://github.com/willtebbutt/Stheno.jl/issues/134

3 Likes

Hope this isn’t too tangential, since it’s a usage question, but I can’t find examples anywhere but here or the docs.

If I understand Titsias and Bauer correctly (big if), then the VFE approximation implemented in Stheno is generally sensitive to the number of inducing points, but not the location of the inducing points.

Is there any reason to choose regularly indexed points from the training set rather than a random sample?

Good question, sorry for the slow response.

then the VFE approximation implemented in Stheno is generally sensitive to the number of inducing points, but not the location of the inducing points.

I suppose that it depends on what you mean by sensitive. If you mean the quality of the posterior approximation and, therefore, the tightness of the ELBO, then it’s sensitive to both the number and location of the pseudo-points. However, the time taken to compute the bound doesn’t generally depend on the number of inducing points.

In terms of how to pick the locations of the pseudo-points – the optimal way to do this is something of an open question, but a couple of rules of thumb are

  1. if you’ve got a time series with regularly-spaced observations, then regularly-spaced pseudo-points is probably fine.
  2. in almost any other situation, you probably want to optimise the locations of the pseudo-points to maximise the ELBO (in addition to e.g. kernel parameters). The choice of initialisation is quite important here I believe (I’m not an expert on this bit). @theogf don’t you have a library of some kind that looks at inducing points initialisations?|

Yeah I started to make https://github.com/JuliaGaussianProcesses/InducingPoints.jl (still not registered), to provide multiple ways to select inducing points given a set of points.
It’s definitely not finished, but you can already do a few things with it. You can pass a matrix of points and it will return an AbstractInducingPoints object which is just a wrapper around a ColVecs object (therefore directly compatible with KernelFunctions.jl).

This is not true, Bauer especially show that having overlapping points does not bring anything and that’s therefore the position matter. More specifically it’s going to highly influence the uncertainty of your model.

Thanks for the clarifications!

  1. in almost any other situation, you probably want to optimise the locations of the pseudo-points to maximise the ELBO (in addition to e.g. kernel parameters). The choice of initialisation is quite important here I believe (I’m not an expert on this bit). @theogf don’t you have a library of some kind that looks at inducing points initialisations?

Let me step back and make sure I understand correctly, then. Assuming I’d still wish to run inference on the hyper parameters of the sparse GP, I would need to:

  1. Choose the number of pseudo-points.
  2. Initialize those points using something “reasonable”
  3. Optimize the location of those pseudo inputs
  4. Use some mcmc or variational methods to optimize the hyperparameters

I suppose a follow-up question if #3 and #4 can occur at the same time. I.E. If I were to add a length scale parameter and prior to it like:

@model function example_model_sparse(x, z, xu)
    len ~ LogNormal(0,1)
    f = GP(stretch(Matern32(),len), GPC())
    fsparse = SparseFiniteGP(f(x, 1e-6),  f(xu, 1e-6))
    y ~ fsparse
    λ = exp.(y)
    z .~ Poisson.(λ)
end

chn_sparse = sample(example_model_sparse(x, z, xu), NUTS(), 200)

Would I actually be optimizing the inducing point locations while sampling? Or would I need to first optimize the locations/elbo, then run mcmc on the hyperparameters once a choice of inducing points is found?
Hopefully this makes sense!

That makes sense!

While you could certainly separate things out as proposed in 1, 2, 3, 4, but it will generally yield sub-optimal results because the optimal location for the pseudo-points depends on the hyperparameters.

If you’re keen to do MCMC though, I think you might be stuck only doing 1, 2, and 4 though as it’s not clear to me how to optimise the inducing point locations in the context of MCMC. Of course, if you do this, you’re not going to be sampling from the exact posterior over the hypers, but some kind of approximation posterior because of the pseudo-point approximation.

So I think that, if you’re keen to use MCMC, you should just to 1, 2 and 4.

I was going to write something about jointly optimising pseudo-point locations and hyperparameters, but I then realised that we don’t really have the infrastructure for that in Turing. @theogf might have something in AugmentedGPs that would let you do that?

Yes that’s exactly what my package does! Although I realized recently that some of my (hard-coded) gradients were wrong, and I proceeded to make a rewrite to make it completely AD. It should be available soon though!
It uses a copy of InducingPoints.jl in the internals

Don’t know how I missed these. Thanks for the replies!

Looking forward to this.

Thanks for the clarification, the literature seems a bit sparse on such a processes as well, aside from a few specific cases. I suppose choosing the number of points based on computation time and then just comparing cross-validated errors on a few sets of fixed positions could be justified pragmatically, even if it’s not a particularly rigorous method. Or finding another justification in the context of the actual data, like regularly spaced time series, etc. Or work on proof, but that’s a bit above my pay grade. Any other tips when dealing with sparse gp’s? All the info has been very helpful so far.