Gaussian Process Model with Turing

Full code here

(GPTool) pkg> st
Project GPTool v0.1.0
Status `~/repos/gptool/Project.toml`
  [99985d1d] AbstractGPs v0.2.6
  [c7e460c6] ArgParse v1.1.0
  [336ed68f] CSV v0.7.7
  [324d7699] CategoricalArrays v0.8.1
  [a93c6f00] DataFrames v0.21.7
  [b4f34e82] Distances v0.9.0
  [31c24e10] Distributions v0.23.10
  [91a5bcdd] Plots v1.6.1
  [fce5fe82] Turing v0.14.1
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [de0858da] Printf
  [10745b16] Statistics
using CSV
using DataFrames
using Turing
using Distributions
using Plots
using Distances
using LinearAlgebra
using AbstractGPs

df = DataFrame(
    subject = repeat(1:5, inner=3),
    obese = repeat(rand(Bool, 5), inner=3),
    timepoint = [1,2,3,1,3,4,1,2,5,1,4,5,1,3,5],
    bug = rand(Beta(0.9, 5), 15),
    nutrient = rand(Beta(0.9,5), 15)
)

## Straight Turing

sqexp_cov_fn(D, phi, eps=1e-3) = exp.(-D^2 / phi) + LinearAlgebra.I * eps

@model function myGP(y, Z, X, m=0, s=1, s_beta=3, cov_fn=sqexp_cov_fn)
    # Dimensions of GP predictors. Note that if X is a single column 
    # it needs to have dimension Nx1 (a matrix), not an array of 
    # length N, so that it can be processed properly by the distance function.
    N, P = size(X)
    
    # Dimensions of linear model predictors
    J = size(Z, 2)  # Z should be N x J
    
    # Distance matrix.
    D = pairwise(Distances.Euclidean(), X, dims=1)
    
    # Priors.
    mu ~ Normal(m, s)
    sig2 ~ LogNormal(0, 1)
    phi ~ LogNormal(0, 1)
    
    # Realized covariance function
    K = cov_fn(D, phi)
    
    # Prior for linear model coefficients
    beta ~ filldist(Normal(0, s_beta), J)

    # Sampling Distribution.
    y ~ MvNormal(mu * ones(N) + Z * beta, K + sig2 * LinearAlgebra.I(N))
end

gp = myGP(df.bug, Matrix(df[!,[:subject, :obese, :nutrient]]), Matrix(df[!,[:timepoint]]))
chain = sample(gp, HMC(0.01, 100), 200)

Tried both, on their own and together, same deal :-/

Separately, I’m trying to follow this example and running into other issues. Is there a better place to raise those?

This runs, but I haven’t looked at the results closely.

Note that I used KernelFunctions.jl, which implements the squared exponential covariance function previously written as sqexp_cov_fn.

Also cleaned up a few things. You’ll probably want to think about the priors for the application.

using CSV
using DataFrames
using Turing
using Distributions
using LinearAlgebra
using KernelFunctions

df = DataFrame(
    subject = repeat(1:5, inner=3),
    obese = repeat(rand(Bool, 5), inner=3),
    timepoint = [1,2,3,1,3,4,1,2,5,1,4,5,1,3,5],
    bug = rand(Beta(0.9, 5), 15),
    nutrient = rand(Beta(0.9,5), 15)
)

sekernel(alpha, rho) = 
  alpha^2 * KernelFunctions.transform(SEKernel(), sqrt(0.5)/rho)

@model function myGP(y, Z, X, jitter=1e-6)
    # Dimensions of GP predictors. Note that if X is a single column 
    # it needs to have dimension Nx1 (a matrix), not an array of 
    # length N, so that it can be processed properly by the distance function.
    N, P = size(X)
    
    # Dimensions of linear model predictors
    J = size(Z, 2)  # Z should be N x J
    
    # Priors.
    mu ~ Normal(0, 1)
    sig2 ~ LogNormal(0, 1)
    alpha ~ LogNormal(0, 0.1)
    rho ~ LogNormal(0, 1)

    # Prior for linear model coefficients
    beta ~ filldist(Normal(0, 1), J)
    
    # GP Covariance matrix
    kernel = sekernel(alpha, rho)  # covariance function
    K = kernelmatrix(kernel, X, obsdim=1)  # cov matrix
    K += LinearAlgebra.I * (sig2 + jitter)

    # Sampling Distribution.
    y ~ MvNormal(mu .+ Z * beta, K)
end

gp = myGP(df.bug,
          Matrix(df[!,[:subject, :obese, :nutrient]]),
          Matrix{Float64}(df[!,[:timepoint]]))

@time chain = sample(gp, HMC(0.01, 100), 200)
1 Like

Yeah, perhaps if you have issues with the example in the blog, try asking in the comments section (of the blog) for now.

1 Like

Confirmed, this now works for toy data, but is failing (PosDefException) on my real data :frowning:

If I’m reading your model correctly here, you are trying to include a GP to account for correlations among observations over time. Right now you have multiple observations at the same time point (check rank(D)), which is why you are not getting a positive definite covariance matrix. You could make it positive definite if you added a large enough variance to the diagonal, but that’s probably not what you want.

It might make more sense to model correlation over time within each subject. This will (implicitly) result in a block-diagonal covariance matrix that should be positive definite.

2 Likes

Ahh, yes this is true. But also true of toy data, no?

Yes, definitely - time-point 1 from subject 1 has basically no relationship to time-point 1 from subject 2. Is this what the one-hot encoding mentioned earlier is meant to address?

Yeah sorry. It’s fine to have multiple observations at the same time point if:

  1. they are unrelated (different subjects)
  2. there is observation error (adding variance to the diagonal)

The reference to one-hot encoding is separate from the GP. Right now you are passing a Z (design matrix) that looks like:

julia> Matrix(df[!, [:subject, :obese, :nutrient]])
 15×3 Array{Float64,2}:
 1.0  0.0  0.201721
 1.0  0.0  0.358846
 1.0  0.0  0.0457792
 2.0  0.0  0.217184
 2.0  0.0  0.0164159
 2.0  0.0  0.253808
 3.0  1.0  0.145758
 3.0  1.0  0.0136373
 3.0  1.0  0.328215
 4.0  0.0  0.002383
 4.0  0.0  0.334386
 4.0  0.0  0.0121662
 5.0  1.0  0.234525
 5.0  1.0  0.145446
 5.0  1.0  0.251321

Note that your first column is treating subject as continuous. So only one effect is estimated, and subject 5 experiences five times the effect of subject 1. One-hot encoding means that you have one column per subject, and rows have a single 1 value in the column corresponding to their subject id, and are otherwise zero.

The StatsModels package makes dealing with this easier, providing a formula interface. StatsModels calls one-hot encoding StatsModels.FullDummyCoding, not exported because other encodings (statisticians call them contrasts) are often more appropriate. If you want to use one-hot encoding you need to explicitly not include the intercept term or your design matrix (Z) will not be full rank. Using your data frame above,

mf = ModelFrame(@formula(bug ~ -1 + subject + obese + nutrient),
                df,
                contrasts = Dict(:subject => StatsModels.FullDummyCoding()))
Z = modelmatrix(mf)

Then Z looks like

  15×7 Array{Float64,2}:
 1.0  0.0  0.0  0.0  0.0  0.0  0.201721
 1.0  0.0  0.0  0.0  0.0  0.0  0.358846
 1.0  0.0  0.0  0.0  0.0  0.0  0.0457792
 0.0  1.0  0.0  0.0  0.0  0.0  0.217184
 0.0  1.0  0.0  0.0  0.0  0.0  0.0164159
 0.0  1.0  0.0  0.0  0.0  0.0  0.253808
 0.0  0.0  1.0  0.0  0.0  1.0  0.145758
 0.0  0.0  1.0  0.0  0.0  1.0  0.0136373
 0.0  0.0  1.0  0.0  0.0  1.0  0.328215
 0.0  0.0  0.0  1.0  0.0  0.0  0.002383
 0.0  0.0  0.0  1.0  0.0  0.0  0.334386
 0.0  0.0  0.0  1.0  0.0  0.0  0.0121662
 0.0  0.0  0.0  0.0  1.0  1.0  0.234525
 0.0  0.0  0.0  0.0  1.0  1.0  0.145446
 0.0  0.0  0.0  0.0  1.0  1.0  0.251321

So the first five columns encode the subject (and you’ll estimate one parameter for each). However, now there is an identifiability issue because the obesity column is the sum of the subject 3 and subject 5 columns (so the matrix is not full rank). Appropriate priors (maybe estimate one global mean and then put fairly strong priors on each subject’s difference from that mean) can help here.

1 Like

This is super helpful, thanks (this is really not my field, so a lot of the jargon is pretty opaque). The original data I fed in did not obey these constraints, but I tried filtering the data down to enforce them, and I’m still having the same error. Once point of clarification - are those two conditions OR or AND?

1 Like

Those are OR. If you model a GP for each subject separately then you’re saying that each subject follows an independent multivariate normal distribution (potentially sharing parameters like correlation length and marginal variance). If there is no observation error (also known as a ‘nugget’ in geostatistics), then the GP will interpolate the observations exactly, which isn’t possible if there are two different observations at the same location.

2 Likes

Hi @luiarthur,
I looked through your blogs and found them extremely helpful! Do you know of a method to add some “jitter” to the diagonal of the covariance matrix while still using the AbstractGP interface? I very much like the manual nature of just using Turing + KernelFunctions, but I also like the succinctness of the AbstractGP if I can get my hands on it. Also thanks for the blog posts! They certainly helped me piece together how GP’s fit together in a larger Bayesian modelling sense.

Thanks, @nucklass.

If I understand you correctly…

using KernelFunctions

k = SEKernel()
jitter = 1e-6
# adds jitter to diagonal of kernel matrix 
k +=  jitter * WhiteKernel()  
# `EyeKernel` is an alias for `WhiteKernel`
# So, `k += jitter * EyeKernel()` also works.

Note that jitter needs to be constant. If you instead want to learn a noise term, you’ll have to go about this a different way.

Also, slightly modified from my GP regression post,

using AbstractGPs, KernelFunctions, Distributions

# Define a parameterized kernel.
sqexpkernel(alpha::Real, rho::Real) = 
    alpha^2 * transform(SEKernel(), 1/(rho*sqrt(2)))

# Define model.
@model GPRegression(y, X, jitter=1e-6) = begin
    # Priors.
    alpha ~ LogNormal(0.0, 0.1)
    rho ~ LogNormal(0.0, 1.0)
    sigma ~ LogNormal(0.0, 1.0)
    
    # Covariance function.
    kernel = sqexpkernel(alpha, rho)

    # GP (implicit zero-mean).
    gp = GP(kernel)
    
    # Sampling Distribution
    y ~ gp(X, sigma^2 + jitter)  # add jitter for numerical stability.
end;

This version ends up being somewhat faster than if you were to add jitter * WhiteKernel() to your preferred kernel.

2 Likes

Ah that’s rather straightforward. I was running into some posdefexecption’s when I wasn’t expecting it and had read above adding a small constant to the diagonal could help out numeric stability. I ended up rewriting the model from scratch though, and circumventing the issue all together. Figured I should figure out how to do it anyway, though. Thanks for the quick reply ! You answered the question exactly! Also, go Banana Slugs!

1 Like

Nice post! FWIW, this works out of the box in Soss.jl:

# Define model.
m = @model X begin
    # Priors.
    alpha ~ LogNormal(0.0, 0.1)
    rho ~ LogNormal(0.0, 1.0)
    sigma ~ LogNormal(0.0, 1.0)
    
    # Covariance function.
    kernel = sqexpkernel(alpha, rho)

    # GP (implicit zero-mean).
    gp = GP(kernel)
    
    # Sampling Distribution (MvNormal likelihood).
    y ~ gp(X, sigma^2 + 1e-6)  # add 1e-6 for numerical stability.
end;

# Get some fake data
x = randn(20)
truth = rand(m(X=X))

# Sample from the posterior
y = truth.y
post = dynamicHMC(m(X=X), (;y=y));

Then

julia> ppost = particles(post)
(sigma = 0.141 ± 0.065, rho = 1.65 ± 0.49, alpha = 1.01 ± 0.096)
10 Likes

Blockquote
this works out of the box in Soss.jl:

Just tried this out! Its really slick and runs very quickly, to my (extremely thankful) surprise. One question, though. I notice that each feature set/observation in the data needs to be a column in the input array as opposed to the usual one-row-per-observation that Soss usually uses, or else I get a mismatched dimensions error. I’m assuming this has to do with the format that AbstractGPs/KernelFunctions generates inside the model.
Any idea if this is correct or am I just generating junk samples ?

Great!

Good point. There’s nothing in Soss that’s specific to rows or columns, this is just an artifact of the sampling distribution for GP, which in turn is because of how MvNormal is defined. It may be more efficient to do things columnwise in this case, since Julia’s memory layout is column-major. So the memory for the [i,j] entry will be adjacent to the one for [i+1,j].

One question, though. I notice that each feature set/observation in the data needs to be a column in the input array as opposed to the usual one-row-per-observation that Soss usually uses, or else I get a mismatched dimensions error. I’m assuming this has to do with the format that AbstractGPs/KernelFunctions generates inside the model.
Any idea if this is correct or am I just generating junk samples ?

Concerning the format of AbstractGPs/KF the standard format is to have vectors of vectors. If you give a Matrix it will assume the one col-per-observation but you should be able to give the keyword obsdim=1, it will automatically put it in the right format (without allocations). Check out ColVecs and RowVecs from KernelFunctions.jl

Nice post! FWIW, this works out of the box in Soss.jl:

It’s really great to hear :smiley:

2 Likes

BTW @cscherrer, we currently have a Turing model in the tests of AbstractGPs to check that we don’t break compatibility! You are welcome to bring a PR to do the same with a Soss model!

2 Likes

Great idea, thanks!

2 Likes

If it seems like I’m only returning to this once a month, maybe in preparation for a meeting with a supervisor… that’s because I am! :laughing:. Thanks everyone who continues to engage on this - I’m learning a ton. Super excited to see Soss is accounted for here too!

Yeah, this seems wrong. As I said before, subject numbers should basically be considered uuids, their value has no meaning.

I was able to generate the one-hot encoded matrix with StatsModels as you suggested, which works when I use a small subset of my data, but with the full set, it’s a 921 column matrix, and freezes[^1] trying to sample.

So is there a way to encode this as a categorical variable? Or as suggested above, should I maybe put another SE kernel on this with a tiny length scale (so they’re effectively treated as independent of one another)?

I tried adding this to the mixed model from above, but apparently not adding it to the right spot. I assumed that it would take the place in the previous model of K += LinearAlgebra.I * (sig2 + jitter), but I tried there and a couple of other places and kept gettung MethodErrors. Any ideas?

More generally, I’m still not 100% clear on how these models compose with one another. Some open questions:

  1. In the myGP model here, am I understanding correctly that we have 2 input parameters (stuff that’s modeled as linear and stuff that’s part of the GP), and the linear model coefficients are passed as part of the mean terms of MvNormal and the kernel-transformed timepoint variable is passed to the covariance matrix? If I wanted to model the subject ID with SE kernel with a tiny length scale as suggested above, can I add that to the K = kernelmatrix(... expression, or do I need to do that separately.
  2. If I want to add categorical or boolean components to the linear portion of the model, I need to fit them with different distribution assumptions, right? Eg Binomial for boolean things. Should I add that as y ~ MvNormal(mu .+ X * beta .+ <boolean stuff> * other_beta...?

[1]: In the course of writing this, the progress meter updated to ETA 10.39 days… :grimacing: :mantelpiece_clock:

FWIW, this book may be helpful for your problem. The latter part of the book covers (essentially) random effects GPs with this kind of structure. The idea is that you have a separate GP for each person, possibly with some shared/fixed priors, possibly with person-specific priors (but with a hierarchical prior across them).

That isn’t to say that ideas like encoding subject as a categorical variable, and maybe using some sort of categorical kernal wouldn’t work, but I find the approach discussed in the book to be a somewhat more natural one, if you are used to GLMMs.

1 Like