Implementing sparse Gaussian process model with Turing and Stheno

ElOceanografo · September 18, 2020, 11:19pm

I’ve built a Turing model for some non-Gaussian data which contains a latent Gaussian process for spatial autocorrelation. I’ve run it on a small simulated dataset to confirm that it works. Now I want to run it on my real data–which includes about 11,000 data points, so impractical for a dense GP.

A simple MWE is below (the real model more complicated, but the essence is the same, i.e. GP => transformation => parameter of data distribution). Is there a simple(ish) way to modify this to use the sparse inducing-point approximations available in Stheno instead of the full GP?

using Turing, Stheno, Distributions, Random

Random.seed!(1)
x = 1.0:10.0
y = rand(GP(Matern32(), GPC())(x))
λ = exp.(y)
z = rand.(Poisson.(λ))

@model function example_model(x, z)
    y ~ GP(Matern32(), GPC())(x)
    λ = exp.(y)
    z .~ Poisson.(λ)
end

mod = example_model(x, z)
chn = sample(mod, NUTS(), 100)

cpfiffer · September 19, 2020, 4:37pm

@willtebbutt might know.

ElOceanografo · September 21, 2020, 7:03pm

As an example of something I’ve tried that doesn’t work:

@model function example_model_sparse(x, z, xu)
    f = GP(Matern32(), GPC())
    y ~ MvNormal(zeros(length(x)), 2.0)
    Turing.@addlogprob! elbo(f(x, 1e-9), y, f(xu, 1e-9))
    λ = exp.(y)
    z .~ Poisson.(λ)
end

xu = 0:2:10 # Inducing points
mod_sparse = example_model_sparse(x, z, xu)
chn_sparse = sample(mod_sparse, NUTS(), 100)
ypost_sparse = Array(group(chn_sparse, :y))'

using Plots
plot(x, ypost_sparse, color=:black, alpha=0.5, label="")
plot!(x,  y, color=:red, linewidth=5, label="True y")

Which produces this:

In contrast, making the same plot with the chain generated using example_model above looks like this:

so the dense model appears to be generating a sensible posterior that tracks the true value of the latent field y.

I may be (probably am) misunderstanding what elbo is calculating, but the latent y variable doesn’t seem to be connected to the observed data here…

willtebbutt · September 22, 2020, 7:06am

Hmmm this is a bit tricky. You’ve used the elbo calculation roughly correctly, but I don’t think that it’s doing what you want overall.

The basic issue is that you’ll need to somehow avoid computing the logpdf and instead compute the elbo. One approach / hack would be to introduce a wrapper type that eats a Stheno.AbstractGP and a FiniteGP representing the pseudo-points, and is itself another Stheno.AbstractGP which, when contained in a FiniteGP implements logpdf by calling the elbo of the GP that it wraps at the pseudo points that it wraps.

Does that suggestion makes sense?

ElOceanografo · September 22, 2020, 4:58pm

Yes, that generally makes sense. So you’re thinking of something like this:

struct SparseFiniteGP{T1<:Stheno.FiniteGP, T2<:Stheno.FiniteGP} <: Stheno.AbstractGP
    fobs::T1
    finducing::T2
end

function Distributions.logpdf(f::T, y) where {T <: SparseFiniteGP}
    return elbo(f.fobs, y, f.finducing)
end

I tried using this new type in my Turing model as y ~ SparseFiniteGP(f(x), f(xu)). However, it complains that “ArgumentError: Right-hand side of a ~ must be subtype of Distribution or a vector of Distributions”.

Also, doesn’t Turing require that custom Distributions have a rand method?

willtebbutt · September 22, 2020, 7:04pm

What you’ve written is almost the thing that I had in mind. I had specifically imagined something along the lines of the following (modulo bugs, I’ve not tested this):


struct HackyGP{Tf<:AbstractGP, Tu<:FiniteGP} <: Stheno.AbstractGP
    f::Tf # process to be wrapped
    u::Tu # pseudo-points
end

function Stheno.logpdf(hacky_f::FiniteGP{<:HackyGP}, y::AbstractVector{<:Real})
    f = hacky_f.f.f
    u = hacky_f.f.u
    x = hacky_f.x
    Σy = hacky_f.Σy
    return Stheno.elbo(f(x, Σy), y, u)
end

ElOceanografo · September 29, 2020, 3:53am

Okay, I came back to this and after some hacking have got the following to run and produce basically sensible results.

struct SparseFiniteGP{T1<:Stheno.FiniteGP, T2<:Stheno.FiniteGP} <: Distributions.AbstractMvNormal
    fobs::T1
    finducing::T2
end

function Stheno.logpdf(f::T, y::AbstractArray{T1, 1}) where {T <: SparseFiniteGP, T1 <: Real}
    return elbo(f.fobs, y, f.finducing)
end

Base.length(f::SparseFiniteGP) = length(f.fobs)
Distributions._rand!(rng::Random._GLOBAL_RNG, f::SparseFiniteGP1, x::Array{T}) where {T<:Real} = rand(f.fobs)

@model function example_model_sparse(x, z, xu)
    f = GP(Matern32(), GPC())
    fsparse = SparseFiniteGP(f(x, 1e-6),  f(xu, 1e-6))
    y ~ fsparse
    λ = exp.(y)
    z .~ Poisson.(λ)
end

chn_sparse = sample(example_model_sparse(x, z, xu), NUTS(), 200)

On the left is the dense model’s posterior for λ, on the right is the sparse model’s posterior (with the locations of the inducing points shown by the vertical lines).

The posteriors do have some differences, but…with 900 observations, drawing 200 NUTS samples from the dense model takes something like 4 hours. Using the sparse model with 19 inducing points, the same sampling run takes about 60 seconds, i.e. a roughly 240-fold speedup.

@willtebbutt, do you think a type like this is worth having in Stheno? If so, I can start work on a PR…

willtebbutt · September 29, 2020, 8:30am

This looks great, a PR would be very much welcome!

ElOceanografo · September 29, 2020, 4:53pm

Great! I’ve opened an issue to continue this discussion at the repo: https://github.com/willtebbutt/Stheno.jl/issues/134

nucklass · November 23, 2020, 11:39pm

Hope this isn’t too tangential, since it’s a usage question, but I can’t find examples anywhere but here or the docs.

If I understand Titsias and Bauer correctly (big if), then the VFE approximation implemented in Stheno is generally sensitive to the number of inducing points, but not the location of the inducing points.

Is there any reason to choose regularly indexed points from the training set rather than a random sample?

willtebbutt · December 2, 2020, 3:45pm

Good question, sorry for the slow response.

then the VFE approximation implemented in Stheno is generally sensitive to the number of inducing points, but not the location of the inducing points.

I suppose that it depends on what you mean by sensitive. If you mean the quality of the posterior approximation and, therefore, the tightness of the ELBO, then it’s sensitive to both the number and location of the pseudo-points. However, the time taken to compute the bound doesn’t generally depend on the number of inducing points.

In terms of how to pick the locations of the pseudo-points – the optimal way to do this is something of an open question, but a couple of rules of thumb are

if you’ve got a time series with regularly-spaced observations, then regularly-spaced pseudo-points is probably fine.
in almost any other situation, you probably want to optimise the locations of the pseudo-points to maximise the ELBO (in addition to e.g. kernel parameters). The choice of initialisation is quite important here I believe (I’m not an expert on this bit). @theogf don’t you have a library of some kind that looks at inducing points initialisations?|

theogf · December 2, 2020, 3:58pm

Yeah I started to make GitHub - JuliaGaussianProcesses/InducingPoints.jl: Package for different inducing points selection methods (still not registered), to provide multiple ways to select inducing points given a set of points.
It’s definitely not finished, but you can already do a few things with it. You can pass a matrix of points and it will return an AbstractInducingPoints object which is just a wrapper around a ColVecs object (therefore directly compatible with KernelFunctions.jl).

This is not true, Bauer especially show that having overlapping points does not bring anything and that’s therefore the position matter. More specifically it’s going to highly influence the uncertainty of your model.

nucklass · December 5, 2020, 9:39pm

Thanks for the clarifications!

in almost any other situation, you probably want to optimise the locations of the pseudo-points to maximise the ELBO (in addition to e.g. kernel parameters). The choice of initialisation is quite important here I believe (I’m not an expert on this bit). @theogf don’t you have a library of some kind that looks at inducing points initialisations?

Let me step back and make sure I understand correctly, then. Assuming I’d still wish to run inference on the hyper parameters of the sparse GP, I would need to:

Choose the number of pseudo-points.
Initialize those points using something “reasonable”
Optimize the location of those pseudo inputs
Use some mcmc or variational methods to optimize the hyperparameters

I suppose a follow-up question if #3 and #4 can occur at the same time. I.E. If I were to add a length scale parameter and prior to it like:

@model function example_model_sparse(x, z, xu)
    len ~ LogNormal(0,1)
    f = GP(stretch(Matern32(),len), GPC())
    fsparse = SparseFiniteGP(f(x, 1e-6),  f(xu, 1e-6))
    y ~ fsparse
    λ = exp.(y)
    z .~ Poisson.(λ)
end

chn_sparse = sample(example_model_sparse(x, z, xu), NUTS(), 200)

Would I actually be optimizing the inducing point locations while sampling? Or would I need to first optimize the locations/elbo, then run mcmc on the hyperparameters once a choice of inducing points is found?
Hopefully this makes sense!

That makes sense!

willtebbutt · December 6, 2020, 1:44pm

While you could certainly separate things out as proposed in 1, 2, 3, 4, but it will generally yield sub-optimal results because the optimal location for the pseudo-points depends on the hyperparameters.

If you’re keen to do MCMC though, I think you might be stuck only doing 1, 2, and 4 though as it’s not clear to me how to optimise the inducing point locations in the context of MCMC. Of course, if you do this, you’re not going to be sampling from the exact posterior over the hypers, but some kind of approximation posterior because of the pseudo-point approximation.

So I think that, if you’re keen to use MCMC, you should just to 1, 2 and 4.

I was going to write something about jointly optimising pseudo-point locations and hyperparameters, but I then realised that we don’t really have the infrastructure for that in Turing. @theogf might have something in AugmentedGPs that would let you do that?

theogf · December 6, 2020, 5:07pm

Yes that’s exactly what my package does! Although I realized recently that some of my (hard-coded) gradients were wrong, and I proceeded to make a rewrite to make it completely AD. It should be available soon though!
It uses a copy of InducingPoints.jl in the internals

nucklass · December 23, 2020, 12:09am

Don’t know how I missed these. Thanks for the replies!

Looking forward to this.

Thanks for the clarification, the literature seems a bit sparse on such a processes as well, aside from a few specific cases. I suppose choosing the number of points based on computation time and then just comparing cross-validated errors on a few sets of fixed positions could be justified pragmatically, even if it’s not a particularly rigorous method. Or finding another justification in the context of the actual data, like regularly spaced time series, etc. Or work on proof, but that’s a bit above my pay grade. Any other tips when dealing with sparse gp’s? All the info has been very helpful so far.

Topic		Replies	Views
Speed up Gaussian Process model Probabilistic Programming turing , stheno	16	1351	June 9, 2022
Running TemporalGPs/Stheno sampling Probabilistic Programming gaussian-process , stheno	25	1333	May 29, 2021
Conceptual Gaussian Process Questions Probabilistic Programming	15	1768	August 17, 2020
Data stitching Probabilistic Programming turing	30	978	November 7, 2022
Gaussian Process Model with Turing Probabilistic Programming regression , turing , gaussian-process	84	10474	November 28, 2020

Implementing sparse Gaussian process model with Turing and Stheno

Related topics