Gaussian Process Model with Turing

kevbonham · July 2, 2020, 9:10pm

Is it possible to specifiy a Gaussian process (GP) regression model directly in Turing? I’ve looked at Stheno.jl and GaussianProcesses.jl, but neither have an obvious way to specify a multivariate model (GaussianProcesses.jl explicitly says it can’t, and the docs for Stheno are not yet available). I’m also new to GPs and probabilistic programming in general, so it’s possible I’ve missed something obvious, but I’ve looked through all of the tutorials for Turing, and haven’t seen anything that clearly fits what I want. Incidentally, if this doesn’t already exist somewhere and I figure this out, I’d be happy to write it up as a tutorial.

If such a tutorial already exists, feel free to stop reading and point me there.

I figure giving an example is the easiest way to get started / make it clear where the gaps in my knowledge are. I study epidemiology associated with the human gut microbiome. Imagine I have a cohort of human subjects that have each provided samples at multiple time points (separated by variable numbers of days, all starting on an arbitrary day 1). At each of those time points, I can measure the amount of a particular bacterium, (call it bug), and I have diet information from which I can get information about different nutrients (nutrient). I also have some other information that might be relevant, like whether the person is obese or not (binary variable).

We know that the most relevant information for predicting bug is a sample from the same individual, but that effect decreases with time (anyone interested in more detailed info, see here). But we’d also like to model whether the nutrients are having some effect. So the data might look something like this:

code to generate fake data

using DataFrames, Random, Distributions

df = DataFrame(
    subject = repeat(1:5, inner=3),
    obese = repeat(rand(Bool, 5), inner=3),
    timepoint = [1,2,3,1,3,4,1,2,5,1,4,5,1,3,5],
    bug = rand(Beta(0.9, 5), 15),
    nutrient = rand(Beta(0.9,5), 15)
)

15×5 DataFrame
│ Row │ subject │ obese │ timepoint │ bug       │ nutrient   │
│     │ Int64   │ Bool  │ Int64     │ Float64   │ Float64    │
├─────┼─────────┼───────┼───────────┼───────────┼────────────┤
│ 1   │ 1       │ 0     │ 1         │ 0.0244897 │ 0.199635   │
│ 2   │ 1       │ 0     │ 2         │ 0.217759  │ 0.0795724  │
│ 3   │ 1       │ 0     │ 3         │ 0.013862  │ 0.246906   │
│ 4   │ 2       │ 0     │ 1         │ 0.0601671 │ 0.106634   │
⋮
│ 11  │ 4       │ 1     │ 4         │ 0.0369525 │ 0.00247553 │
│ 12  │ 4       │ 1     │ 5         │ 0.125425  │ 0.224936   │
│ 13  │ 5       │ 0     │ 1         │ 0.0334627 │ 0.187304   │
│ 14  │ 5       │ 0     │ 3         │ 0.0809102 │ 0.140293   │
│ 15  │ 5       │ 0     │ 5         │ 0.0770322 │ 0.259843   │

So I’d like to model the relationship between bug and nutrient, given the other variables. I think the Linear Model terminology would be that subject is a random effect and obese and nutrient are fixed effects, but I’m not 100% sure about that.

To model just bug ~ nutrient + obese, and trying to adapt the linear regression tutorial

@model bugmodel1(y, obese, nutrient) = begin
    # Set variance prior.
    σ₂ ~ truncated(Normal(0,100), 0, Inf)
    # Set intercept prior.
    intercept ~ Normal(0, 3)
    
    coef_o ~ Normal(0,10)
    coef_n ~ Normal(0,10)
    μ = intercept .+ (coef_n * nutrient) .+ (coef_o * obese)

    for i in 1:length(y)
        y[i] ~ Normal(μ[i], σ₂)
    end
end

model = bugmodel1(df.bug, df.obese, df.nutrient)

(Incidentally, I’m not sure if it makes sense for the coefficients to follow normal distributions, I don’t totally understand what all the pieces are doing, but with this randomly generated data, the coefficients land ~0, which is what I would expect).

So now, the questions:

Am I totally off-base with what I’m trying to do?
Is it possible to include subject and timepoint with squared exponential covariance as part of the model?
(bonus, this is what I ultimately want to do, possibly makes more sense as a separate post): can I easily compare this model to one that does NOT include nutrient, to see if nutrient is providing additional information?

willtebbutt · July 2, 2020, 10:26pm

Hi Kevin, thanks for bringing this up.

Have I understood correctly that what you want to do is predict the bug column as a function of the other 4 columns using GP?
Assuming I’ve understood 1 correctly, this is a very reasonable thing to do, and should be totally natural
You could absolutely do this in a couple of ways.

If you could let me know about 1 I’ll provide a couple of ways to do this.

kevbonham · July 2, 2020, 11:43pm

I think so? Ultimately what I care about is whether a GP model with nutrient is a better fit (in the language of LMs anyway) than one without it. But presumably, if I know how to predict bug with 4 columns, I should probably be able to figure out how to do it with 3

For certain definitions of “natural” .

L_Adam · July 3, 2020, 3:09am

You want to train a linear regression model, right? Then you need a linear kernel. With that, you could follow this tutorial by switching exponentiated square kernel to linear kernel(maybe set a prior on the anchor) and I think you are good to go.

If I understand correctly, GaussianProcess.jl is only able to model on output that is a scalar for the time being,

Multivariate Gaussian processes, when the output is a vector, which encompasses multi-task GPs, cokriging, multiclass classification

For Turing.jl, I think what you are doing is right. You could of cause remove mean from output and avoid the intercept term.

Correct me if anything is wrong(I’m still learning).

kevbonham · July 3, 2020, 2:11pm

To be honest, I’m not totally sure, but I don’t think so. If I only wanted a linear regression, there’s also this tutorial from Turing - the reason to use a GP (from what I understand) is that the relationship with time is non-linear. Actually, in this case - I would expect time to have zero predictive value, except as it relates to other points. If I take a measurement at the exact same time (or very close), I expect those two measurements to be highly correlated, but as they get further apart, they will be less and less correlated.

L_Adam · July 4, 2020, 3:14am

Sorry, I just read from the model bug ~ nutrient + obese.

You are describing exponentiated square kernel(which is commonly used), or Matern kernel, or general, some stationary kernel that performs relative measurement.

For example, say you chose exponentiated square kernel,

k(t_1, t_2; \sigma, l) = \sigma^2 e^{-\frac{(t_1- t_2)^2}{l^2}}

l(could be interpreted as resolution of time), controls how fast the correlation between two timestamp dies out

Back to the question, the model you want may be bug ~ nut + obese + time. Since you believe first two terms are linear, and the time term is a relative measurement, the kernel may have the form

k(s_1, s_2) = \text{Lin}(s_1[\text{nut, obese}], s_2[\text{nut, obese}] ; \text{diag}(\lambda)) * \text{SE}(s_1[\text{time}], s_2[\text{time}]; \sigma, l)

Here we assume time does not affect nutrient. \text{diag}(\lambda) because nutrient and obese are linear/independent. Multiplication of two terms because two samples need to be closed both in time and other features.

There is no space that is non-linear, as long as you are still talking about Hilbert space. You could always make “non-linear” regression by performing linear regression on features that are properly non-linear transformed. The question is which transformation to choose. And GP is one way to choose(or construct) it by eliciting the kernel(the so called kernel trick)

Still, correct me if anything is wrong(I’m still learning).

kevbonham · July 4, 2020, 2:32pm

As far as I can tell, that looks right (you’re a one-eyed person in the land of the blind here ). Now, do you know how to instantiate such a thing in Turing?

L_Adam · July 5, 2020, 2:29am

You have to construct the kernel matrix by hand after sample from prior on hyperparameter, an example from stan user’s guide.

GaussianProcess.jl also has mcmc using HMC, see the doc

luiarthur · July 16, 2020, 1:34am

Hi @kevbonham, is this what you are looking for?

# Import Libraries
using Turing
using Turing: Variational
using Distributions
using Distances
using PyPlot
using StatsFuns
import Random
using Flux
import LinearAlgebra

# Squared-exponential covariance function
sqexp_cov_fn(D, phi, eps=1e-3) = exp.(-D^2 / phi) + LinearAlgebra.I * eps

# Exponential covariance function
exp_cov_fn(D, phi) = exp.(-D / phi)

@model function GP(y, X, m=0, s=1, cov_fn=exp_cov_fn)
    # Dimensions of predictors .
    N, P = size(X)
    
    # Distance matrix.
    D = pairwise(Distances.Euclidean(), X, dims=1)
    
    # Priors.
    mu ~ Normal(m, s)
    sig2 ~ LogNormal(0, 1)
    phi ~ LogNormal(0, 1)
    
    # Realized covariance function
    K = cov_fn(D, phi)
    
    # Sampling Distribution.
    # The latent variables have been marginalized out here,
    # so there's only one level.
    y ~ MvNormal(mu * ones(N), K + sig2 * LinearAlgebra.I(N))
end

Note that when N is large, this model will be really slow. There are several ways to speed up:

Use a low-rank approximation (e.g. predictive process). This can scale to potentially N=50000. I haven’t tried it myself in Turing, but I believe this should be doable in Turing. It’s not that bad to implement from scratch either.
GP for Big Data: [1309.6835] Gaussian Processes for Big Data

# Generate some non-linear data
N = 150
X = randn(N, 1) * 3
y = sin.(vec(X)) + randn(N) * 0.1
plt.scatter(vec(X), y)

# Fit via ADVI. You can also use HMC.
Random.seed!(0)
m = GP(y, X)
q0 = Variational.meanfield(m)  # initialize variational distribution (optional)
advi = ADVI(1, 2000)  # num_elbo_samples, max_iters
@time q = vi(m, advi, q0, optimizer=Flux.ADAM(1e-1));

# Alternatively, fit via HMC.
# Random.seed!(0)
# chain = sample(GP(y, X), HMC(0.01, 100), 200)

You need to write a bit of code to extract the posterior samples, make predictions, and make the plots; which could be difficult if you haven’t used Turing / implemented GP’s before. Here’s an image of the data I generated (dots), with posterior predictive mean at 100 new locations (red line), and the 95% credible region (red shaded region).

I have a complete example here which shows how to obtain predictions and generate the plot.

EDIT: After reading your post more carefully, I think what you really want, instead of a GP, is a carefully thought out mixed effects model, with random effects for each “subject”, and fixed effects for the obesity indicator, “timepoint”, and “nutrient” variables; the response variable being “bug”. You may need to transform “bug” if the range of values is not amenable to a Normal distribution. You’ll also have to prep the predictors (covariates). e.g. “subject” is categorical, so “1,2,3,4” should not be used directly in your design matrix X; instead you should use indicator variables to refer to each subject. ML people call this one-hot… Anyway, bottom-line, you probably don’t need a GP in this case. You may want to check out lmer4 in R for fitting mixed effects models.

kevbonham · July 16, 2020, 4:02pm

Cool - thanks for all that code/explanation! I’ll dig into it when I return to this problem in a couple of weeks - it definitely looks like it’s going in the right direction. One question if you’ll indulge me -
it looks like all of the parameters are subjected to the squared exponential here. Am I reading that right? And would it be possible to use the kernel only on the subject/time parameter, and leave the other parameters (eg diet) as linear predictors?

I should also note that there’s some additional discussion on zulip, and @theogf wrote an example of doing this using AbstractGPs. With all of the explanations here and there, I’m confident that I’ll get where I need to go. I’ve already learned a ton!

This is done a lot in my field, and there are plenty examples to chose from. Actually, the lab I’m working with has a CLI to do exactly this. The issue (as far as I understand) is that a typical mixed-effects model cannot adequately model the subject/time relationship, where a GP potentially could. For example, in my actual dataset, I have a weird sampling pattern, where there are 2 pairs of samples for each subject, where within each pair are very close together, but the pairs are far apart.

In other words, I might have timepoints at weeks 1, 2, 40, and 41. Here, the values of bug at week 1 and 2 should be tightly coupled, and also between 40 and 41, but between week 2 and 40 I’d expect very little relationship.

In any case, it may be true that a linear model is just as good, but that’s essentially what I’m trying to figure out. And if GPs are better, to develop a tool that can be used easily by researchers without needing to worry much about the internal details.

luiarthur · July 16, 2020, 5:22pm

You can, but I wouldn’t give the GP the “subject” variable, because the GP really shines when the input is continuous; the “subject” variable is categorical. You won’t really gain much, if anything, by putting a GP on categorical variables alone. So it’s probably best to leave it in a linear model component.

The timepoints variable is probably fine. But if it makes sense, if the timepoints in your data is recorded as “t=1, t=2, t=3, t=4, …” but they really refer to weeks “1, 2, 40, 41, …”, I would use “t=1, 2, 40, 41” as opposed to the timepoint labels (1,2,3,4…). This is where the GP shines. The GP will be put more uncertainty in the time intervals between recorded times where there is little/no data. This has great appeal – if you don’t have any data between time 2 and time 40, anything could have happened in between. You want to interpolate in between, but still be honest about your uncertainty between recorded timepoints. The GP does that; whereas vanilla linear models don’t.

I haven’t tried this, but this might work:

@model function GP(y, Z, X, m=0, s=1, s_beta=3, cov_fn=exp_cov_fn)
    # Dimensions of GP predictors. (In your case, I think P=1,
    # and X is the time variable.) Note that if X is a single column 
    # it needs to have dimension Nx1 (a matrix), not an array of 
    # length N, so that it can be processed properly by the distance function.
    N, P = size(X)
    
    # Dimensions of linear model predictors
    J = size(Z, 2)  # Z should be N x J
    
    # Distance matrix.
    D = pairwise(Distances.Euclidean(), X, dims=1)
    
    # Priors.
    mu ~ Normal(m, s)
    sig2 ~ LogNormal(0, 1)
    phi ~ LogNormal(0, 1)
    
    # Realized covariance function
    K = cov_fn(D, phi)
    
    # Prior for linear model coefficients
    beta ~ Normal(zeros(J), s_beta)  # you may need to choose s_beta carefully.

    # Sampling Distribution.
    y ~ MvNormal(mu * ones(N) + Z * beta, K + sig2 * LinearAlgebra.I(N))
end

Come to think of it, if we’re being Bayesian, you probably don’t even need the mixed effects models, because all parameters are assumed random. Make sure your “subject” variable is turned into dummy variables, so that the vector “subject” = (1, 1, 2, 2, 3, 3, 3, 4, 5, 5) is converted to a matrix:

S1	S2	S3	S4	S5
1	0	0	0	0
1	0	0	0	0
0	1	0	0	0
0	1	0	0	0
0	0	1	0	0
0	0	1	0	0
0	0	1	0	0
0	0	0	1	0
0	0	0	0	1

where the columns S1, S2, S3, etc. refer to subject1, subject2, subject3, etc. And similarly for any other categorical variables.

kevbonham · July 28, 2020, 12:26am

Sorry for the delayed response - had to finish up summer course, then went on vacation . Thanks for engaging!

The reason I mentioned subject here is because the temporal link is only within-subject. That is, comparing subject 1 day 1 to subject 1 day 2, I’d expect a strong link, but basically no relationship to subject 2 day 2 (or day 1 for that matter). Is that preserved if the `subject variable isn’t encoded in the GP?

yes for sure

Oh, interesting… I guess this addresses my question above. I think the next step is to just try it and see what happens, I will report back!

willtebbutt · August 2, 2020, 9:39pm

Firstly, many apologies for taking so long to reply. So it looks like the following works just fine:

Turing.jl + Stheno.jl implementation

using DataFrames
using Distributions
using Random
using Stheno
using Turing
using Zygote

# Use Zygote for AD. Be sure to use 0.4.20 rather than 0.5.
Turing.setadbackend(:zygote)

# Generate fake data.
df = DataFrame(
    subject = repeat(1:5, inner=3),
    obese = repeat(rand(Bool, 5), inner=3),
    timepoint = [1,2,3,1,3,4,1,2,5,1,4,5,1,3,5],
    bug = rand(Beta(0.9, 5), 15),
    nutrient = rand(Beta(0.9,5), 15)
)

@model my_model(y, x) = begin

    # Length scales for each input dimension. Longer means dimension is less important.
    ls ~ Product(fill(Gamma(7, 1 / 2), 2))

    # Process variance parameter. Roughly speaking, how much of the variability in the data
    # is explained by the model.
    σ² ~ Gamma(7, 1 / 2)

    # Construct a GP object with the kernel of choice. Don't worry about the GPC thing.
    f = GP(σ² * stretch(Matern52(), 1 ./ ls), GPC())

    # Observation noise variance. Roughly speaking, how much of the data is unexplained by
    # the model / the model believes to be explained by "noise". 1 - σ²_n / (σ²_n + σ²) gives you
    # something a bit like an R^2 score.
    σ²_n ~ Gamma(7, 1 / 2)

    # Observations are `f` at `x` observed under noise `x`. `f(x, σ²_n)` is essentially a
    # multivariate Normal. Add 1e-3 for numerical stability.
    y ~ f(x, σ²_n + 1e-3)
end

# Pack data into a matrix and tell Stheno to treat it as a vector-of-vectors where
# each column is an element. Ideally this would be simpler, but I've to implement `RowVecs`
# in Stheno. I've only used two of the columns of the data here, but you could just as well have used more.
x = ColVecs(collect(hcat(df.obese, df.nutrient)'))

# Construct the model and run HMC for a bit.
chain = sample(my_model(df.bug, x), NUTS(), 1_000)

Make sure that you’re using Zygote 0.4.20 rather than 0.5 – it’s got a weird bug that I need to fix (that I think I may have caused in ChainRules ).

I’ve placed reasonably informative priors over each of the parameters, which may- or may-not be appropriate for your data – they seem reasonable enough to me as defaults if you’ve not got anything crazy going on.

Modeling assumptions

Looks like there’s already been some good discussion about what you want to do in this thread, but I’ll give my 2 cents. In the above example implementation I’ve given example code for pretty much the simplest thing you would think to do with a GP, but it sounds like you somehow want to treat time differently to some of the other inputs.

Given that additive models appear to be on the table, do you believe that a model of the form

f(x, t) = f_t(t) + f_x(x)

is plausible for your data? (where x is all variables other than time). If so we could modify the above example to express that structure.

willtebbutt · August 2, 2020, 9:46pm

As regards @luiarthur’s proposal to 1-hot encode your “subject” variable – I wouldn’t have though that should be necessary as the length scale associated with that dimension can just be made really small, so that the value for each subject can be more-or-less independent from the other (if indeed the fact that they look like they’re ordinal is just a coincidence and there’s really no reason to think 1 and 2 are more closely related than 1 and 5).

kevbonham · August 17, 2020, 2:20pm

Thanks! This is super useful. I suppose no surprise it looks similar to the AbstractGPs solution posted on zulip.

willtebbutt:

Given that additive models appear to be on the table, do you believe that a model of the form
f(x, t) = f_t(t) + f_x(x)
is plausible for your data? (where x is all variables other than time). If so we could modify the above example to express that structure.

Yes, I think that’s plausible, where I think we’d likely assume f_x() would have a linear covariance function.

That is definitely the assumption for the majority of our problems. There actually are some cases where subjects are part of the same household, and then we might hypothesize some connection, but we’d encode that in a separate “household” variable. Subject IDs themselves could just as well be random Strings, they’re strictly identifiers and have no meaning.

kevbonham · August 31, 2020, 6:12pm

This line is causing some issues - model gets constructed, but then when I try to sample:

ERROR: MethodError: no method matching Normal(::Array{Float64,1}, ::Int64)
Closest candidates are:
  Normal(::Integer, ::Integer) at /home/kevin/.julia/packages/Distributions/G4nDk/src/univariate/continuous/normal.jl:43
  Normal(::T, ::T; check_args) where T<:Real at /home/kevin/.julia/packages/Distributions/G4nDk/src/univariate/continuous/normal.jl:36
  Normal(::Real, ::Real) at /home/kevin/.julia/packages/Distributions/G4nDk/src/univariate/continuous/normal.jl:42

Thinking maybe I need and array of normal distributions instead, I tried

beta ~ fill(Normal(0, s_beta), J)

but that gives

ERROR: MethodError: no method matching assume(::Random._GLOBAL_RNG, ::DynamicPPL.SampleFromPrior, ::Array{Normal{Float64},1}, ::DynamicPPL.VarName{:beta,Tuple{}}, ::DynamicPPL.VarInfo{DynamicPPL.Metadata{Dict{DynamicPPL.VarName,Int64},Array{Distribution,1},Array{DynamicPPL.VarName,1},Array{Real,1},Array{Set{DynamicPPL.Selector},1}},Float64})

And I tried MvNormal(zeros(J), s_beta), but that gave me

ERROR: PosDefException: matrix is not positive definite; Cholesky factorization failed.

Any ideas? I’m using HMC to sample, just to reduce the complexity - but based on the error messages, it doesn’t seem related to that.

luiarthur · August 31, 2020, 6:35pm

My bad.

Try: beta ~ filldist(Normal(0, s_beta), J)

luiarthur · August 31, 2020, 6:38pm

I recently posted a couple of examples for fitting GPs using Turing + AbstractGPs + KernelFunctions.

They might be relevant to what you are doing:

kevbonham · August 31, 2020, 7:15pm

This is giving me the same PosDefException - might be an issue with my input data, but I get the same error with:

willtebbutt:

# Generate fake data.
df = DataFrame(
    subject = repeat(1:5, inner=3),
    obese = repeat(rand(Bool, 5), inner=3),
    timepoint = [1,2,3,1,3,4,1,2,5,1,4,5,1,3,5],
    bug = rand(Beta(0.9, 5), 15),
    nutrient = rand(Beta(0.9,5), 15)
)

Awesome - will study these in detail. Definitely looks relevant

luiarthur · August 31, 2020, 9:08pm

Could you post the model?

Some things I would try would be:

Add small positive value to covariance matrix diagonal:
- y ~ MvNormal(mu * ones(N) + Z * beta, K + (1e-6 + sig2) * LinearAlgebra.I(N))
Use a more informed prior for beta: beta ~ filldist(Normal(0, 1), J)

Topic		Replies	Views
Running TemporalGPs/Stheno sampling Probabilistic Programming gaussian-process , stheno	25	1356	May 29, 2021
Conceptual Gaussian Process Questions Probabilistic Programming	15	1798	August 17, 2020
Multi-level varying slopes with two clusters, cross classification Probabilistic Programming statistics , turing , distributions , bayesian-inference	16	2173	August 14, 2021
Modeling gaussian process with Turing.jl Probabilistic Programming gaussian-process	6	1238	February 8, 2022
Strange issues fitting GMMs in Turing.jl by extending a fairly simple example Probabilistic Programming turing , bayesian-inference	10	1925	February 15, 2021

Gaussian Process Model with Turing

Turing.jl + Stheno.jl implementation

Modeling assumptions

Related topics