Gaussian Process Model with Turing

Question for the KernelFunctions/AbstractGP authors, though if anyone else has any idea feel free to chime in too. Whats the usual procedure for diagnosing/fixing PosDef Errors during estimation of hyperparameters? I’ve got a fairly large model (~6000 observations) and it’s seems somewhat random if I get the error or not, so I’ve just been repeating dynamicHMC sample calls until it doesn’t error. Obviously this is not sustainable and rather uniformed. Is the conventional wisdom just to increase the value added to the diagonal of the covariance matrix until I stop getting errors or should is this an indication that my choice of prior for hyperparameters is off? Again, awesome work on both packages!

1 Like

It’s not clear to me whether a singular covariance is a problem for the model, or just for the way it’s currently being computed. Maybe PositiveFactorizations.jl could help with this?

1 Like

Ah actually looks like the KernelFunctions team already answered my question. Check out https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/blob/6bca6a17284dcd12fc4eca945435697e50a524b3/src/matrix/kernelpdmat.jl
if anyone else is running into this issue. Still errors after switching from kernelmatrix to kernelpdmat in the model specification, but at least the error message is much more informative :sweat_smile: . Looks like I might need to fiddle with some priors or reformat the data to make the feature matrix more sparse.

2 Likes

Hey @nucklass, one of the authors here. I designed kernelpdmat exactly for this purpose but it can still not be enough for some situations. What you might want to look for is if your kernel matrix is low-rank, i.e. you have some columns which are numerically identical. This is usually due to having 2 observations extremely close to each other, This closeness can happen for example with very large lengthscales for instance (I imagine this is what is happening in your optimization). To be honest there is no really good solution for this… Sparse GPs tend to avoid this problem and given that you have a lot of observations this might be the way to go.
EDIT: We should probably write a full paragraph about the “good GP practices” somewhere in some docs :smiley:

3 Likes

Thanks for the reply!

This is looking more and more like the direction I should go. I’m only familiar with the simplest sparse approximations, ie Subsets of Data or Subsets of Regressors. Would it be entirely foolish to try and combine sparse approximations and the use of MCMC sampling of kernel hyperparamters? I.E. try and extend the Soss + Turing programs above with some sparse approximation of the latent function. From a cursory glance it seems like quite a few of Sparse methods attempt to optimize hyperparameters and inducing points simultaneously, and I’m not quite sure what estimation methods are being used in each instance. From a performance standpoint, I run a lot of Bayesian models so waiting a few days for sampling to finish is nothing new. Running an analysis that some one could reproduce/rerun is rather important, though. Also if Augmented Gaussian Processes implements any of this already, thanks in advance.

FYI, theres a PR to Stheno right now that will let you use Sparse GPs in Turing. Not battle-tested yet, but I’ve got it to run for some simple problems. And here’s the thread that led to it, if you didn’t see it:

2 Likes

Oh that’s very cool! I’ll definitely give this a try. I’ve heard Stheno’s kernel interface is going to switch over to KernelFunctions.jl fairly soon, so I guess everything’s coming together :smile:.

Alright, I’m back! This is my focus until I get something working. To reiterate the goals here, because this post has meandered a bit (which is great, I’m learning a ton!), the situation is this:

  • I have longitudinal patient biosamples. That is, I’ve taken measurements from multiple people (subjects) over time.
  • Many of the measurements are continuous, in this case in particular, I’m looking at the dependent variable bug, and one of the predictor variables is nutrient
  • bug typically changes over time, but not in a single direction. We expect measurements for a given subject to be highly correlated for samples close in time, less so for samples more distant in time.
  • we expect measurements from different subjects to be completely independent of one another (that is, there’s no relationship between bug1 for subj1 at date1 and bug1 for subj2 at date1. Akin to a random effect.
  • There are other categorical and boolean variables that do not tend to change for a particular subject (eg gender, obesity)

We would like to model the measurement of bug over time as a gaussian process, where (in the language of LMs) things like obese are fixed effects, subject is a random effect, and the covariance of nutrient over time is squared exponential.

After doing some additional reading, I think what I really want is to encode at least subject, and possibly obese as index variables, a la https://github.com/TuringLang/Turing.jl/pull/1422. But I’m having issues figuring this out, and I think it has to do with a misunderstanding of how different things interact. So first things first, I tried building a slightly extended version of the demo model in #1422, such that the g variable is an index variable, and X is a matrix of typical continuous variables (also using the linreg tutorial as inspiration. This seems to work:

# Dummy data
x = vcat(randn(10), randn(10).+10, randn(10).-5)
y = vcat(randn(10), randn(10).+10, randn(10).-5) .* 2
g = vcat(ones(Int, 10), ones(Int, 10)*2, ones(Int, 10)*3)

@model function demo(x, y, g)
    ng = length(unique(g))
    # group coefficients priors
    a ~ filldist(Normal(), ng)
    
    # x coefficient prior
    b ~ Normal()

    mu = a[g] + b * x
    
    y .~ Normal.(mu)
end

model = demo(x,y,g);
sample(model, NUTS(0.8), 1_000)
Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat  
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64  
                                                                            
        a[1]   -0.6291    0.2943     0.0132    0.0117   380.2325    1.0030  
        a[2]    2.6719    0.8799     0.0394    0.0438   174.4973    1.0048  
        a[3]   -2.0860    0.4941     0.0221    0.0229   175.6049    1.0010  
           b    1.7736    0.0921     0.0041    0.0048   167.5996    1.0039

So now imagine I have a tp argument that includes timepoints. How do I incorporate the above specification into a GP? The following code runs, but I’m not 100% sure that it’s actually doing what I’m intending…

tp = repeat(1:10, 3)

sekernel(alpha, rho) = 
  alpha^2 * KernelFunctions.transform(SEKernel(), sqrt(0.5)/rho)

@model function demo(x, y, g, tp, jitter=1e-6)
    ng = length(unique(g))
    # group coefficients priors
    a ~ filldist(Normal(), ng)
    
    # x coefficient prior
    b ~ Normal()

    # gp priors
    sig2 ~ LogNormal(0, 1)
    alpha ~ LogNormal(0, 0.1)
    rho ~ LogNormal(0, 1)

    # lm specs
    mu = a[g] + b * x
    
    # gp specs
    kernel = sekernel(alpha, rho)  # covariance function
    K = kernelmatrix(kernel, reshape(tp, length(tp), 1), obsdim=1)  # cov matrix
    K += LinearAlgebra.I * (sig2 + jitter)
    
    y .~ MvNormal(mu, K)
end

model = demo(x,y,g,tp);
sample(model, NUTS(0.8), 1_000)
Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat  
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64  
                                                                            
        a[1]   -0.4016    0.7114     0.0318    0.0164   746.6413    0.9994  
        a[2]    0.5394    0.9962     0.0446    0.0565   430.2212    0.9982  
        a[3]   -0.6955    0.8198     0.0367    0.0488   503.1865    0.9993  
       alpha    0.9929    0.0944     0.0042    0.0012   653.7770    0.9980  
           b    1.9989    0.1195     0.0053    0.0074   382.8640    0.9982  
         rho    1.7239    2.2199     0.0993    0.0645   376.1499    1.0025  
        sig2    5.9109    1.6832     0.0753    0.0698   497.3395    0.9982  

Does this model make sense / fit what I described above?

This also runs on my real data (no PosDefException, yay!) where the number of unique g is ~300), but is suuuuper slow. Any suggestions on this front would be greatly appreciated.

1 Like

Bump. Also some updates.

I’ve been trying to do a bit of benchmarking and optimizing on the model above, following the performance tips. I’ve already avoided loops, and when I do @code_warntype model = demo(x,y,g,tp), everything is blue (though this is my first time using that macro, not sure that’s right).

Then I experimented with different backends. First thing to note - Zygote gives me an error (something about not being able to iterate nothing). I’m assuming this has to do with this statement:

In case of :tracker and :zygote , it is necessary to avoid loops for now. This is mainly due to the reverse-mode AD backends Tracker and Zygote which are inefficient for such cases. ReverseDiff does better but vectorized operations will still perform better.

I don’t explicitly use a loop, is it possible that the GP is using one implicitly?

My real data is 921 rows, each subject has 1-4 timepoints, and there are 305 total subjects. When I do the first 50 rows, 100 samples for the chain and forwarddiff:

  • NUTS(0.8) takes 17.6 sec
  • HMC(0.1, 10) takes 12.9 sec
  • HMC(0.1, 20) takes 6.7 sec

The first 50 rows has 19 subjects. When I bump up to 100 rows (36 subjects), that last one takes 51 sec (almost 10 times as long), and using reversediff takes about 10.5 min. At 150 rows (53 subjects), it’s 230 sec. If things keep scaling like this, I’m looking at days for the whole set. Is there a better way to do this?

I also made an attempt to do what was suggested earlier and add a separate subject GP with a really short length scale - it’s possible I did that wrong, but it did not improve things much.

Note that the relevant benchmark is probably not time/sample, but time/effective sample size.

2 Likes

Sorry to hijack this thread, but this is moving towards a more general question I’ve had for Bayesian modelling in general and Turing specifically for a while: how can I form a reasonable expectation about how long a model will take to estimate in advance? Is it about the number of observations, the number of parameters, or the structure of the model itself? And what is effective sample size in this context?

For someone only trained in classical econometrics, where one estimates mostly linear models that take a fraction of a second to estimate, it’s hard to guess when I could feasibly switch my analysis strategy over to Bayesian.

1 Like

Generally,

  1. evaluation time of the posterior (with gradients, if you are using those), which does depend on the number of observations etc (unless you have summary statistics),
  2. how well-behaved the log posterior is (eg non-uniform curvature is difficult to deal with)

See the Stan manual for an explanation of ESS.

3 Likes

For example in this situation, a kernel matrix K of size length(tp) x length(tp) is created and passed to MvNormal. Now MvNormal needs to invert this matrix with complexity length(tp)^3, which is usually the bottleneck with GPs.
So basically every sampling step has at least complexity N^3, which, for N=300 is not negligible!

2 Likes

Note that I don’t really understand the distinction :joy:

Yeah, this doesn’t seem great :weary:. In this case, I really only care about the GP and linear hyperparameters, not the subject-specific stuff (a[g] in the model). I found a presentation from a former colleague, where he wrote that we want to

  • approximate the latent posterior and integrate it out.
    • approximation by laplace method
    • use saddle-free Newton to avoid saddle nodes

I did some googling and didn’t find any julia-specific information about this - any ideas where I should look?

@kevbonham Can you restate the actual model you are interested in?

2 Likes

Note that I don’t really understand the distinction :joy:

What you really care about is how many unique samples you get from your sampler. For example in Metropolis Hastings, if you would reject all proposal you would get stuck with an effective sample size of one (although you might have done calculations for 10 millions tries).

I did some googling and didn’t find any julia-specific information about this - any ideas where I should look?

I don’t think there is any Laplace method implemented at the moment, also because Julia AD is pretty bad with Hessians right now…

2 Likes

This is a fun topic! Here’s a very hand-wavy explanation, just to build on @theogf’s point. :slight_smile:

It’s common for “number of samples” n to show up in computations statistical computations, usually with the assumption that samples are independent. Think of the computation for standard error of the mean, for example.

In an MCMC context, we can often compute things more directly than depending on formulas like this, which often only hold asymptotically and with some assumptions (iid, etc).

That’s great, but when you plug everything in, the equations often don’t quite work out, because some assumptions are broken. It turns out we can quantify “how broken” they are (again, hand-wavily).

So we pretend we don’t know n, and solve for it, calling it n_{\text{eff}}, the effective number of samples. However many samples we actually took, the result behaves as if we had taken n_{\text{eff}}.

Though we often sweep it under the rug, it’s important to remember that there are different ways of computing this. So when you dig into the details, you always need to consider the asymptotic result used, and which assumptions from that result are broken.

4 Likes

Sure! Given that I’m not totally certain that the code I’m writing is what I want, I’ll state it again in words. What I am most interested in is comparing two models, one that contains one variable (in this case, diet) and the other that doesn’t. The components of the model are:

  • bug - the dependent variable (continuous)
  • diet - I want to compare models with and without this variable (continuous), which should have linear relationship with bug
  • x_1 \dots x_n some number of covariates with a linear relationship to bug
  • subject (categorical) and time - I’ve got multiple samples from each subject over time.

bug varies over time, with samples that are closer to one another in time more similar to one another. But this relationship only holds for the same subject. That is, time=1 for subject=1 and time=1 for subject=2 have no relationship. Also, the number subject has no numerical meaning, it’s just a UID.

So here’s the model I came up with[^1] in code. Here, I’m ignoring the x_n stuff, just trying to do bug, diet, subject and time:

sekernel(alpha, rho) = 
  alpha^2 * KernelFunctions.transform(SEKernel(), sqrt(0.5)/rho)

@model function GPmodel1(bug, diet, subj, tp, jitter=1e-6)
    ns = length(unique(subj))
    
    # subj coefficients (random effects) priors
    a ~ filldist(Normal(), ns)
    
    # diet coefficient prior
    b ~ Normal()

    # gp priors
    sig2 ~ LogNormal(0, 1)
    alpha ~ LogNormal(0, 0.1)
    rho ~ LogNormal(0, 1)

    # lm specs
    mu = a[subj] + b * bug
    
    # gp specs
    kernel = sekernel(alpha, rho)  # covariance function
    K = kernelpdmat(kernel, reshape(tp, length(tp), 1), obsdim=1)  # cov matrix
    K += LinearAlgebra.I * (sig2 + jitter)
    
    bug .~ MvNormal(mu, K)
end

[^1]: When I say “I came up with,” I really mean “that I mostly copied from examples earlier in this thread from @luiarthur

1 Like

I think this line:

mu = a[subj] + b * bug

should be

mu = a[subj] + b * diet


Also, I think it might help to think about the model more abstractly. Something like

y_{s,j} = \beta_0 + d_{s,j} \cdot \beta_d + f_s(t_{s,j}) + \epsilon_{s,j}, for j=1,\dots,J_s, and s=1,\dots,S,

where y_{s,j} represents the dependent variable (bug) for the j-th measurement for subject s, d_{s,j} is the diet for subject s at measurement j, \beta_d is the diet effect, \beta_0 is the intercept, t_{s,j} is the time at which measurement j was taken for subject s, f_s(\cdot) is a (possibly non-linear) function of time for subject s, and \epsilon_{s,j} is the residuals. (f could be a number of things. e.g., a linear function, a polynomial, some other non-linear function, etc.)

I would think that the relationship between the dependent variable and time might be different for each subject. So, I included a subscript on f for each subject. Though, this was not accounted for in the Turing model.

@kevbonham Does this look like the model you intend to fit? If not, what is the high level description of your intended model? For example, do you think f should be indexed by subject? Or is one diet prescribed for all time for each subject and, thus, d does not need to be indexed by j?

Can you show a couple of trajectory of persons? Perhaps one can see what is the actual correlation structure in time.

2 Likes