Gaussian Process Model with Turing

This is pretty much code-gold. I have some regression methods that someone else might like using. Is there a place to put handy snippits of Prob. prog. models?

2 Likes

That’s very kind, thank you :slight_smile:

For Soss models, I’d love to flesh out GitHub - cscherrer/SossExamples.jl . PRs welcome!

3 Likes

Thanks! That does look useful. I’ve been working my way through this one (which has the benefit that it’s free), but the math quickly goes over my head and I’m finding it somewhat challenging to go from book examples into practical code (though I have to say, the design of these packages does make it easier to fiddle and get something working even if I only have a vague idea why)

Carl’s book is a really good introduction to GPs in my opinion.

1 Like

Yeah, it’s far more comprehensible than many other things I’ve looked at. The deficit is entirely on me, and the fact that the last math class I took was during my freshman year of college… 18 years ago… :scream:

Question for the KernelFunctions/AbstractGP authors, though if anyone else has any idea feel free to chime in too. Whats the usual procedure for diagnosing/fixing PosDef Errors during estimation of hyperparameters? I’ve got a fairly large model (~6000 observations) and it’s seems somewhat random if I get the error or not, so I’ve just been repeating dynamicHMC sample calls until it doesn’t error. Obviously this is not sustainable and rather uniformed. Is the conventional wisdom just to increase the value added to the diagonal of the covariance matrix until I stop getting errors or should is this an indication that my choice of prior for hyperparameters is off? Again, awesome work on both packages!

1 Like

It’s not clear to me whether a singular covariance is a problem for the model, or just for the way it’s currently being computed. Maybe PositiveFactorizations.jl could help with this?

1 Like

Ah actually looks like the KernelFunctions team already answered my question. Check out https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/blob/6bca6a17284dcd12fc4eca945435697e50a524b3/src/matrix/kernelpdmat.jl
if anyone else is running into this issue. Still errors after switching from kernelmatrix to kernelpdmat in the model specification, but at least the error message is much more informative :sweat_smile: . Looks like I might need to fiddle with some priors or reformat the data to make the feature matrix more sparse.

2 Likes

Hey @nucklass, one of the authors here. I designed kernelpdmat exactly for this purpose but it can still not be enough for some situations. What you might want to look for is if your kernel matrix is low-rank, i.e. you have some columns which are numerically identical. This is usually due to having 2 observations extremely close to each other, This closeness can happen for example with very large lengthscales for instance (I imagine this is what is happening in your optimization). To be honest there is no really good solution for this… Sparse GPs tend to avoid this problem and given that you have a lot of observations this might be the way to go.
EDIT: We should probably write a full paragraph about the “good GP practices” somewhere in some docs :smiley:

3 Likes

Thanks for the reply!

This is looking more and more like the direction I should go. I’m only familiar with the simplest sparse approximations, ie Subsets of Data or Subsets of Regressors. Would it be entirely foolish to try and combine sparse approximations and the use of MCMC sampling of kernel hyperparamters? I.E. try and extend the Soss + Turing programs above with some sparse approximation of the latent function. From a cursory glance it seems like quite a few of Sparse methods attempt to optimize hyperparameters and inducing points simultaneously, and I’m not quite sure what estimation methods are being used in each instance. From a performance standpoint, I run a lot of Bayesian models so waiting a few days for sampling to finish is nothing new. Running an analysis that some one could reproduce/rerun is rather important, though. Also if Augmented Gaussian Processes implements any of this already, thanks in advance.

FYI, theres a PR to Stheno right now that will let you use Sparse GPs in Turing. Not battle-tested yet, but I’ve got it to run for some simple problems. And here’s the thread that led to it, if you didn’t see it:

2 Likes

Oh that’s very cool! I’ll definitely give this a try. I’ve heard Stheno’s kernel interface is going to switch over to KernelFunctions.jl fairly soon, so I guess everything’s coming together :smile:.

Alright, I’m back! This is my focus until I get something working. To reiterate the goals here, because this post has meandered a bit (which is great, I’m learning a ton!), the situation is this:

  • I have longitudinal patient biosamples. That is, I’ve taken measurements from multiple people (subjects) over time.
  • Many of the measurements are continuous, in this case in particular, I’m looking at the dependent variable bug, and one of the predictor variables is nutrient
  • bug typically changes over time, but not in a single direction. We expect measurements for a given subject to be highly correlated for samples close in time, less so for samples more distant in time.
  • we expect measurements from different subjects to be completely independent of one another (that is, there’s no relationship between bug1 for subj1 at date1 and bug1 for subj2 at date1. Akin to a random effect.
  • There are other categorical and boolean variables that do not tend to change for a particular subject (eg gender, obesity)

We would like to model the measurement of bug over time as a gaussian process, where (in the language of LMs) things like obese are fixed effects, subject is a random effect, and the covariance of nutrient over time is squared exponential.

After doing some additional reading, I think what I really want is to encode at least subject, and possibly obese as index variables, a la https://github.com/TuringLang/Turing.jl/pull/1422. But I’m having issues figuring this out, and I think it has to do with a misunderstanding of how different things interact. So first things first, I tried building a slightly extended version of the demo model in #1422, such that the g variable is an index variable, and X is a matrix of typical continuous variables (also using the linreg tutorial as inspiration. This seems to work:

# Dummy data
x = vcat(randn(10), randn(10).+10, randn(10).-5)
y = vcat(randn(10), randn(10).+10, randn(10).-5) .* 2
g = vcat(ones(Int, 10), ones(Int, 10)*2, ones(Int, 10)*3)

@model function demo(x, y, g)
    ng = length(unique(g))
    # group coefficients priors
    a ~ filldist(Normal(), ng)
    
    # x coefficient prior
    b ~ Normal()

    mu = a[g] + b * x
    
    y .~ Normal.(mu)
end

model = demo(x,y,g);
sample(model, NUTS(0.8), 1_000)
Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat  
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64  
                                                                            
        a[1]   -0.6291    0.2943     0.0132    0.0117   380.2325    1.0030  
        a[2]    2.6719    0.8799     0.0394    0.0438   174.4973    1.0048  
        a[3]   -2.0860    0.4941     0.0221    0.0229   175.6049    1.0010  
           b    1.7736    0.0921     0.0041    0.0048   167.5996    1.0039

So now imagine I have a tp argument that includes timepoints. How do I incorporate the above specification into a GP? The following code runs, but I’m not 100% sure that it’s actually doing what I’m intending…

tp = repeat(1:10, 3)

sekernel(alpha, rho) = 
  alpha^2 * KernelFunctions.transform(SEKernel(), sqrt(0.5)/rho)

@model function demo(x, y, g, tp, jitter=1e-6)
    ng = length(unique(g))
    # group coefficients priors
    a ~ filldist(Normal(), ng)
    
    # x coefficient prior
    b ~ Normal()

    # gp priors
    sig2 ~ LogNormal(0, 1)
    alpha ~ LogNormal(0, 0.1)
    rho ~ LogNormal(0, 1)

    # lm specs
    mu = a[g] + b * x
    
    # gp specs
    kernel = sekernel(alpha, rho)  # covariance function
    K = kernelmatrix(kernel, reshape(tp, length(tp), 1), obsdim=1)  # cov matrix
    K += LinearAlgebra.I * (sig2 + jitter)
    
    y .~ MvNormal(mu, K)
end

model = demo(x,y,g,tp);
sample(model, NUTS(0.8), 1_000)
Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat  
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64  
                                                                            
        a[1]   -0.4016    0.7114     0.0318    0.0164   746.6413    0.9994  
        a[2]    0.5394    0.9962     0.0446    0.0565   430.2212    0.9982  
        a[3]   -0.6955    0.8198     0.0367    0.0488   503.1865    0.9993  
       alpha    0.9929    0.0944     0.0042    0.0012   653.7770    0.9980  
           b    1.9989    0.1195     0.0053    0.0074   382.8640    0.9982  
         rho    1.7239    2.2199     0.0993    0.0645   376.1499    1.0025  
        sig2    5.9109    1.6832     0.0753    0.0698   497.3395    0.9982  

Does this model make sense / fit what I described above?

This also runs on my real data (no PosDefException, yay!) where the number of unique g is ~300), but is suuuuper slow. Any suggestions on this front would be greatly appreciated.

1 Like

Bump. Also some updates.

I’ve been trying to do a bit of benchmarking and optimizing on the model above, following the performance tips. I’ve already avoided loops, and when I do @code_warntype model = demo(x,y,g,tp), everything is blue (though this is my first time using that macro, not sure that’s right).

Then I experimented with different backends. First thing to note - Zygote gives me an error (something about not being able to iterate nothing). I’m assuming this has to do with this statement:

In case of :tracker and :zygote , it is necessary to avoid loops for now. This is mainly due to the reverse-mode AD backends Tracker and Zygote which are inefficient for such cases. ReverseDiff does better but vectorized operations will still perform better.

I don’t explicitly use a loop, is it possible that the GP is using one implicitly?

My real data is 921 rows, each subject has 1-4 timepoints, and there are 305 total subjects. When I do the first 50 rows, 100 samples for the chain and forwarddiff:

  • NUTS(0.8) takes 17.6 sec
  • HMC(0.1, 10) takes 12.9 sec
  • HMC(0.1, 20) takes 6.7 sec

The first 50 rows has 19 subjects. When I bump up to 100 rows (36 subjects), that last one takes 51 sec (almost 10 times as long), and using reversediff takes about 10.5 min. At 150 rows (53 subjects), it’s 230 sec. If things keep scaling like this, I’m looking at days for the whole set. Is there a better way to do this?

I also made an attempt to do what was suggested earlier and add a separate subject GP with a really short length scale - it’s possible I did that wrong, but it did not improve things much.

Note that the relevant benchmark is probably not time/sample, but time/effective sample size.

2 Likes

Sorry to hijack this thread, but this is moving towards a more general question I’ve had for Bayesian modelling in general and Turing specifically for a while: how can I form a reasonable expectation about how long a model will take to estimate in advance? Is it about the number of observations, the number of parameters, or the structure of the model itself? And what is effective sample size in this context?

For someone only trained in classical econometrics, where one estimates mostly linear models that take a fraction of a second to estimate, it’s hard to guess when I could feasibly switch my analysis strategy over to Bayesian.

1 Like

Generally,

  1. evaluation time of the posterior (with gradients, if you are using those), which does depend on the number of observations etc (unless you have summary statistics),
  2. how well-behaved the log posterior is (eg non-uniform curvature is difficult to deal with)

See the Stan manual for an explanation of ESS.

3 Likes

For example in this situation, a kernel matrix K of size length(tp) x length(tp) is created and passed to MvNormal. Now MvNormal needs to invert this matrix with complexity length(tp)^3, which is usually the bottleneck with GPs.
So basically every sampling step has at least complexity N^3, which, for N=300 is not negligible!

2 Likes

Note that I don’t really understand the distinction :joy:

Yeah, this doesn’t seem great :weary:. In this case, I really only care about the GP and linear hyperparameters, not the subject-specific stuff (a[g] in the model). I found a presentation from a former colleague, where he wrote that we want to

  • approximate the latent posterior and integrate it out.
    • approximation by laplace method
    • use saddle-free Newton to avoid saddle nodes

I did some googling and didn’t find any julia-specific information about this - any ideas where I should look?

@kevbonham Can you restate the actual model you are interested in?

2 Likes