Speed up Gaussian Process model

Hi all

I am interested in implementing a latent variable Gaussian process regression model with a pre-calculated (phylogenetic) distance matrix. The below model (mostly copied from here and based on this runs, and get’s results in finite time (with variational inference at least).

using Turing, Random
using StatsFuns,LinearAlgebra
using Flux, Turing.Variational
Turing.setadbackend(:zygote)

n = 500
d = rand(LKJ(n,1))
x = rand(MvNormal(zeros(n),d))
y = rand.(BernoulliLogit.(x))

@model function gp_mod(y,Dmat, ::Type{T}=Float64) where {T}

    n = length(y)
    η² ~ Exponential(2)
    ρ² ~ Exponential(0.5)
    
    Σ = η²* exp(-ρ² * Dmat ^2) + LinearAlgebra.I * (0.01 + η²)
    η ~ MvNormal(zeros(n), Symmetric(Σ))

    y .~ BernoulliLogit.(η)
end

model = gp_mod(y,d)
q0 = Variational.meanfield(model)
advi = ADVI(10, 10_000)
q = vi(model, advi,q0; optimizer=Flux.ADAM())

However, for my real model n = 3600, and η is 3600x70. Even at n = 1000 and η at 1000x3 the ETA to finish inference is several days. I found this post promising Implementing sparse Gaussian process model with Turing and Stheno but I couldn’t find anything in the latest Stheno docs about how to use SparseFiniteGP, and all older code snippets I’ve found don’t work anymore. It would be nice to implement the model in Turing, but I’m not tied to it. Basically I would really appreciate any advice on how to majorly speed up inference, but I still need to supply my own distance matrix and use latent variables within the model.

Thanks in advance!

did you try AbstractGPs.jl? I know @sharanry changed some code we were working on from GaussianProcesses.jl to AbstractGPs.jl and got a 100x acceleration. I don’t know Stheno well to know if it would be a similar boost, but AbstractGPs is fairly active IIRC.

Can you elaborate on what this means? Does this mean that you have 70 replicates of something with the distribution of y?

Also, I don’t really know anything about the Bayesian GP scene, so maybe the answer to this is obvious, but what exactly is your bottleneck? Is it working with the matrix, like doing the Cholesky factorization or something? If so, you could consider (shameless plug) a package like Vecchia.jl, which you could use to get a sparse symmetric factor to the inverse of \Sigma that is very structured and where every method is O(n) complexity with an attractively small prefactor and good thread parallelization. I don’t know what new methods you’d need to add to make it all “just work”, but I would think not too much.

If the issue is that the sampling/optimization/whatever isn’t very efficient, maybe you could benefit from a slightly more standard re-parameterization of your covariance matrix. If there isn’t some specific reason about doing this, I don’t see why the scale parameter of the squared exponential covariance should be the same as the diagonal perturbation/nugget/you know what I mean. Also, are you adding 0.01 because of numerical definiteness issues? You could probably completely mitigate that issue by switching to a covariance function that isn’t analytic at the origin, like a Matern. If you picked a special smoothness case your kernel would still just be a call to exp and a polynomial evaluation, so still plenty fast. But I don’t really know anything about the Bayesian stuff and so maybe that isn’t likely to be an issue.

Thanks for the tip. Looks good. Will delve deeper into AbstractGPs in the morning.

Yes exactly this.

I will have a look at Vecchia, though I’ll admit much of what you said went over my head! Lot’s to learn! My (brief) introduction to GPs was more along the lines of “you have some relatively small data set, plus a distance matrix connecting your data points, you should be able to fit some sensible function and inference won’t take too long (because of small data set)”. I’ve just now got a massive speed up, using Stheno.jl SparseFiniteGP, which “just works” with Turing.jl (yay Julia). For a simplified version of the model without latent variables (just started running the latent variable model now) MCMC sampling with good old Metropolis Hastings is quick and accurate.

1 Like

Depending on what exactly you need, at a glance it looks like its the type of model that MUSE might be applicable to (theoretically, and also in practice since there’s a Turing interface). I gave an example here which gave big speedups over HMC, the docs are pretty good too, and show big speedups over VI as well. Please don’t hesitate to get in touch if you try it out and/or run into anything!

Nice. Will have a look in the morning, thanks.

1 Like

Hey, that’s great that something works! If you’re happy with the speed of the matrix operations then no need to look into the Vecchia stuff. I have no idea what a sparse finite GP is and from googling really don’t see anything clarifying so I can’t really comment with any comparison. Most of the more basic approximation methods like Nystrom/inducing points/collocation points/whatever are actually special cases of Vecchia, and so if you end up needing more fine-scale expressiveness in your process maybe the Vecchia stuff would be worth revisiting. But there is presumably some learning and probably some coding overhead involved, so it seems sensible to not deal with it unless you decide it would be a helpful improvement. Feel free to shoot me a note should that happen.

This is not clear to me yet.

I think maybe Stheno is using AbstractGPs in the background…

1 Like

The main purpose of Stheno.jl is to build some additional functionality on top of the GP type from AbstractGPs.jl.

If you’re doing GP stuff inside Turing.jl and need to utilise sparse approximations, then you’re probably doing the right thing by utilising the SparseFiniteGP type from Stheno.jl (which is a bit of an anomaly that should probably be moved to AbstractGPs.jl anyway).

However, for n=500 you probably don’t need to use a sparse / pseudo-point / inducing-point approximation. For example, I would imagine that something like

@model function gp_mod(y,Dmat, ::Type{T}=Float64) where {T}

    n = length(y)
    η² ~ Exponential(2)
    ρ² ~ Exponential(0.5)

    f = GP(η² * with_lengthscale(SEKernel(), ρ²))
    η ~ f(x, 0.01 + η²)
    y .~ BernoulliLogit.(η)
end

might be appropriate. (I’ve not tested this locally, and it will recompute Dmat at each iteration – not sure if that’s what you want)

edit: apologies, I should have read your original post more closely @EvoArt . For n=3600 you might want to use an approximation in order to speed things up, but I’d still be inclined to try without first.

2 Likes

Just that @EvoArt has 70 iid samples of things that are ~BernoulliLogit.(\eta).

Same eta for all y or a different sample of eta for each y? The dimensions of eta are also unclear to me.

@EvoArt please correct me if some part of this is incorrect, but I would assume that this means that they have 70 iid samples of things that are ~BernoulliLogit.(~\eta), which is to say 70 iid samples with different eta/70 samples from the @model function gp_mod above (except with different data sizes). My interpretation of the text here is that each eta is length 3600, so the total is 70 iid samples of 3600-length vectors with distribution @model function gp_mod(...).

Again, though, this could be incorrect, so somebody correct me if I’m not reading something correctly.

That is exactly right. I have 70 binary vectors of length 3600. I assume each vector is from the same distribution, and I want to learn about that distribution. A more accurate representation of what I want to do, is something like

n = 3600
m = 70
c = 0.2
f = GP(LinearKernel(c))
fobs = f(rand(n),0.1)
finducing = f(0:0.1:1,0.1)
fxu = SparseFiniteGP(fobs, finducing)
x = fobs.x
y = rand(fxu,m)
z = rand.(Bernoulli.(cdf.(Normal(),y))) # those 70 vectors aggregated into an array

@model function gp_mod(x,z,m) 

    c ~ Exponential(1)
    f = GP(LinearKernel(c))
    fobs = f(x,0.1)
    finducing = f(0:0.1:1,0.1)
    fxu = SparseFiniteGP(fobs, finducing)
    η ~ filldist(fxu,m)

    for i in 1:m
        z[:,i] .~ Bernoulli.(cdf.(Normal(),η[:,i]))
    end
end

model = gp_mod(x,z,m)

Choice of kernel function was arbitrary (just to produce some runnable code). In fact I will be doing something like

using KernelFunctions
struct MyKernel <: KernelFunctions.Kernel 
    D ::Array{Real}
    α ::Real
    β ::Real
end
(k::MyKernel)(x, y) =  k.α + k.β*k.D[x,y]

To index into a pre-calculated distance matrix. But I’m still learning the possibilities/limitations of valid kernel functions.

Keep an eye out if you can set things up in a way that the covariance matrix of the Gaussian process has an (approximately) sparse inverse through choice of kernel and distance. Something like this makes sense if you have spatial data and you have an idea that the Gaussian field is spatially somewhat smooth, but not too much.

If this is what your code currently is, this might account for a lot of the slowdown. I don’t really know what KernelFunctions.jl does under the hood, but in general structs with abstract type fields will really hurt performance. If you want it to be a flexible object and not fully specify the type, it would be much better to at least use some type parameters.

Unrelated to that, if D is your distance matrix and your x,y are integer indices just pulling out distances, then what you’ve written there isn’t a valid covariance function (simplest check: the diagonal isn’t the largest element in each row/column). I would think that you’d get an error in a Cholesky factorization to warn you about that, but mentioning it in case it saves you some debugging time.

1 Like