Faster gp fitting on million observations?

I have a million-observation regression problem with one outcome and several predictors, including a time variable. I would like to fit the over-time variation with a Gaussian process. I tried using AbstractGPs with DynamicHMC but it didn’t scale to that size: on 400 observations it took 8 minutes (code below).

I see there are a lot of options I haven’t tried: AdvancedHMC, EllipticalSliceSampling, VariationalInference, TemporalGPs, and probably others.

Rather than blindly trying them all, I’d like to ask for some guidance: what would you suggest I do here?

using AbstractGPs
using Distributions
using StatsFuns
using DynamicHMC
using LogDensityProblems
using Random


const n = 400
const x_train = collect(range(-5.0, 5.0; length=n))
const y_train = randn(n) + sin.(x_train .* 2);

function gp_loglikelihood(x, y)
    function loglikelihood(params)
        kernel =
            softplus(params[1]) * (Matern52Kernel() ∘ ScaleTransform(softplus(params[2])))
        f = GP(kernel)
        fx = f(x, 0.1)
        return logpdf(fx, y)
    return loglikelihood

function gp_posterior(x, y, p)
    kernel = softplus(p[1]) * (Matern52Kernel() ∘ ScaleTransform(softplus(p[2])))
    f = GP(kernel)
    return posterior(f(x, 0.1), y)

const loglik_train = gp_loglikelihood(x_train, y_train)
logprior(params) = logpdf(MvNormal(2, 1), params)

const n_samples = 2_000
const n_adapts = 1_000

# Log joint density
function LogDensityProblems.logdensity(ℓ::typeof(loglik_train), params)
    return ℓ(params) + logprior(params)

# The parameter space is two-dimensionaln
LogDensityProblems.dimension(::typeof(loglik_train)) = 2

# `loglik_train` does not allow to evaluate derivatives of
# the log-likelihood function
function LogDensityProblems.capabilities(::Type{<:typeof(loglik_train)})
    return LogDensityProblems.LogDensityOrder{0}()

@time mcmc_result = mcmc_with_warmup(
        ADgradient(:ForwardDiff, loglik_train),
end #481s

A million observations is quite a lot and, unfortunately, I don’t think that any of the options that you mention are likely to solve your problem. If you only had time as a variable I would suggest to use TemporalGPs, but it doesn’t really handle multiple predictors at the minute.

We have a project (hopefully) lined up for the summer that should implement SVGP, which is method that lets you utilise mini-batches of data and could scale to that size data set, but that won’t be availlable for a few months.

@theogf are you aware of anything we currently have that might scale to this size?

I suppose I can run a non-GP regression on the other predictors using another tool (MixedModels.jl?) and then run TemporalGPs on the residuals.

Though I’m not sure how to manage uncertainty in this two-stage process.

Maybe you could try Sparse GPs?

Hey yeah AugmentedGaussianProcesses.jl works fine with huge data. I had experiments with 10 millions points.
However you would need to use inducing points, i.e. SVGP, and use StochasticVI. For the kernel parameters they would be optimised to get the best point estimate using empirical Bayes.
All this is done automatically in the package

1 Like

I suppose I can run a non-GP regression on the other predictors using another tool and then run TemporalGPs on the residuals.

Yup, this could definitely work.

Though I’m not sure how to manage uncertainty in this two-stage process.

You should just be able to continue to use HMC? If, for example, if your non-GP regressor is a linear regressor, just run HMC on its weights etc.

One thing to watch out for with the SVGP stuff is your time variable, because sparse approximations tend not to be partcularly well-suited to time-series problems. The number of pseudo-points needed to maintain a good approximation tends to increase linearly in the number of points in time that you observe.

This isn’t to say that you shouldn’t try SVGP, but rather that it might not be sufficient for your use case.

Are you expecting quite a lot of variation through time in the quantity that you’re modelling?

If you have a million of observations you can afford your Gaussian prior to have only a bit of smoothness. Is your model really representative of the actual problem?

const y_train = rand(n) + sin.(x_train .* 2);

Is the U[0,1] noise model here correct or should this be randn? Then your live is easier.

1 Like

Also, how many predictors do you have? There are some great approximation algorithms based on grids but they are limited to low-dimensional problems…

Are you willing to just do maximum likelihood to estimate parameters? KernelMatrices.jl (disclaimer: my project) implements hierarchical matrices with a particular compression method so that you could get a log-likelihood, gradient, and both expected and observed information matrices in quasilinear complexity. You could probably crank that code to 1M points, at least if you have a reasonable amount of RAM. There are several example files that show interfacing with different optimizers. Although I’m trying to move people away from the trust region method there and instead to the one in GPMaxlik.jl for some technical reasons that we probably don’t need to get into unless you’re interested in that.

The matrix compression method is a pretty simple Nystrom approximation (I now gather also sometimes called “full scale approximation”), so the approximation quality really depends on some assumptions like the kernel being smooth away from the origin and the points being ordered in some coherent way so that off-diagonal blocks at least sort of correspond to the covariance of the random field between well-separated regions of the domain. But a KD-tree sort seems to work pretty well, and I see you’re using a Matern kernel with 2 mean-square derivatives, so that should all be fine.

EDIT: I got curious and checked out what the times and RAM needs would look like for a 1M matrix at KM’s current level of optimization, which is perhaps decent but not great. On an intel core i5-11600K using 5 cores, I can assemble a 1M x 1M matrix with sensible hierarchical settings in about 1 minute, and factorize it in a bit less than that. But you’d probably want at least 24GB of RAM, in my estimation, if you’re going to parallelize over cores like I did, because that really blows up the memory use for some reason. I am having some really strange and unhappy perf issues with this CPU, so perhaps your timings would be better than mine. But in any case, definitely possible and numerically stable enough to be an option if you’re willing to not use fancy Bayesian methods.

1 Like

Not too much, but some. To quantify a bit: There are about 300-1000 observations per day for some years. I think the overall standard deviation is 20x the standard deviation of the daily means. But during some periods there is a marked change in the value for a while, which I would like to capture.

randn is what I meant, right.

In addition to the temporal variable, there are 2-5 continuous predictors and another 2-5 categorical ones.

I have no experience trying something like this, but this approach seemed rather interesting: it uses a simple subsampling approach based on graphons. (It’s especially interesting to me; my background is mainly network analysis, and I was a bit surprised to see graphons used outside of that context.)

So if you can get an unbiased estimate of the partial derivatives of you log-likelihood/log posterior density then you are in business to use ZigZagBoomerang.jl

For example for a model

n = 1_000_000
D = 2
σ2 = 1.0
γ0 = 0.1
d = D + n
β = randn(D)
X = [randn(n) for i in 1:D]
t = range(0.0, 1.0, length=n)
x = sin.(2pi*t)
y = sum(X .* β) + x + sqrt(σ2)*randn(n)

and a Gaussian random walk smoothing prior on the time component x with banded precision matrix

1000000×1000000 SparseMatrixCSC{Float64, Int64} with 4999994 stored entries:

I can estimate unbiasedly this partial derivative of the negative log density of the posterior nicely with

function ∇ϕ(u, i_, Γ, X, y, K = 20)
    β = u[1:D]
    x = @view u[D+1:end]
    if i_ > D
        i = i_ - D
        a = ZigZagBoomerang.idot(Γ, i, x) + (x[i] - y[i])/σ2
        for di in 1:D
            a += β[di]*X[di][i]/σ2  
        return a
        i = i_
        a = γ0*β[i]
        s = n/K/σ2
        for k in 1:K # use an unbiased estimate of the log-likelihood sampling K observations
            j = rand(1:n)
            a += s*X[i][j]*(x[j] + X[1][j]*β[1] + X[2][j]*β[2] - y[j])
        return a  

so you can use the Zig-Zag sampler. A bit of more work is needed to find an pre-estimate of the posterior location and scale to make the sampling efficient. But then

julia> include("million.jl")
Progress:  55%|█████████████████████████████▏                       |  ETA: 0:04:00

and you have the beginning of a nice chain…

You need to sample a bit more to get the precise estimates, just look at the scale of the y-axis to get an idea how concentrated that posterior for the parameters is.

This was brought to you by the department of advertisement of ZigZagBoomerang.jl

I put the code here: ZigZagBoomerang.jl/million.jl at master · mschauer/ZigZagBoomerang.jl · GitHub

You need #master and may want to start julia with 4 threads

julia -t 4 --check-bounds=no

@mschauer This is very cool and it executes quickly on my data.

I am challenged to use it myself because I’m likely to do it wrong. I don’t have confidence that I can correctly build big derivative functions like the one excerpted above, or keep track of so many objects and how to use them as in lines like

spdmp(∇ϕ, t0, x0, θ0, T, c, G, Z, Γ, X, y, (xpost, βpost, bias), K; structured=true, adapt=true, progress=true)

Is it possible to use tools like Soss to build the model or Zygote to differentiate it, to take some of the delicate parts out of my error-prone inexpert hands?


@jzr Thank you, support for Soss.jl and perhaps other things is under construction, so you are essentially working with a back-end directly and I see it feels.

In any case you can use ForwardDiff to get your partial derivatives of the log-likelihood, see ZigZagBoomerang.jl/example1.jl at master · mschauer/ZigZagBoomerang.jl · GitHub

But… in that case I would simplify your model, there is no need to have 1_000_000 latent points, I would change a bit the model, e.g. by a piecewise constant process with “only” 10_000 grid points. Then you can also afford to use autodiff instead of a hand-crafted estimator of the neg-log-likelihood.

But while we are at it: What is what?

∇ϕ(x, i, args...) – partial derivative i of negative log likelihood
t0 – starting time of the process
x0 – starting point of the process
θ0 – starting momenta of the process
T – end time of the process
c – a vector of small numbers just take 1e-7ones(length(x0)), the algorithm with adapt=true increases to the right choice
G – see below
Z – The dynamics of the process (e.g. ZigZag)
args... – the rest are args for ∇ϕ

G is the tricky one: G[i] == i => J tells on which coordinates x[j] for j in J the expression ∇ϕ(x, i) actually depends. That’s known to the probabilistic program generating the density, but doing it by hand is tedious. I’ll think I can compute it.

1 Like

If you’re willing to roll your own solver, there are methods like this or this that handle million data points without variational approximations. However, there are no easy to use implementation of these methods yet. So you’ll have to implement them yourself.

1 Like