Is it possible to specifiy a Gaussian process (GP) regression model directly in Turing? I’ve looked at Stheno.jl and GaussianProcesses.jl, but neither have an obvious way to specify a multivariate model (GaussianProcesses.jl explicitly says it can’t, and the docs for Stheno are not yet available). I’m also new to GPs and probabilistic programming in general, so it’s possible I’ve missed something obvious, but I’ve looked through all of the tutorials for Turing, and haven’t seen anything that clearly fits what I want. Incidentally, if this doesn’t already exist somewhere and I figure this out, I’d be happy to write it up as a tutorial.

If such a tutorial already exists, feel free to stop reading and point me there.

I figure giving an example is the easiest way to get started / make it clear where the gaps in my knowledge are. I study epidemiology associated with the human gut microbiome. Imagine I have a cohort of human subjects that have each provided samples at multiple time points (separated by variable numbers of days, all starting on an arbitrary day 1). At each of those time points, I can measure the amount of a particular bacterium, (call it `bug`

), and I have diet information from which I can get information about different nutrients (`nutrient`

). I also have some other information that might be relevant, like whether the person is `obese`

or not (binary variable).

We know that the most relevant information for predicting `bug`

is a sample from the same individual, but that effect decreases with time (anyone interested in more detailed info, see here). But we’d also like to model whether the nutrients are having some effect. So the data might look something like this:

## code to generate fake data

```
using DataFrames, Random, Distributions
df = DataFrame(
subject = repeat(1:5, inner=3),
obese = repeat(rand(Bool, 5), inner=3),
timepoint = [1,2,3,1,3,4,1,2,5,1,4,5,1,3,5],
bug = rand(Beta(0.9, 5), 15),
nutrient = rand(Beta(0.9,5), 15)
)
```

```
15×5 DataFrame
│ Row │ subject │ obese │ timepoint │ bug │ nutrient │
│ │ Int64 │ Bool │ Int64 │ Float64 │ Float64 │
├─────┼─────────┼───────┼───────────┼───────────┼────────────┤
│ 1 │ 1 │ 0 │ 1 │ 0.0244897 │ 0.199635 │
│ 2 │ 1 │ 0 │ 2 │ 0.217759 │ 0.0795724 │
│ 3 │ 1 │ 0 │ 3 │ 0.013862 │ 0.246906 │
│ 4 │ 2 │ 0 │ 1 │ 0.0601671 │ 0.106634 │
⋮
│ 11 │ 4 │ 1 │ 4 │ 0.0369525 │ 0.00247553 │
│ 12 │ 4 │ 1 │ 5 │ 0.125425 │ 0.224936 │
│ 13 │ 5 │ 0 │ 1 │ 0.0334627 │ 0.187304 │
│ 14 │ 5 │ 0 │ 3 │ 0.0809102 │ 0.140293 │
│ 15 │ 5 │ 0 │ 5 │ 0.0770322 │ 0.259843 │
```

So I’d like to model the relationship between `bug`

and `nutrient`

, given the other variables. I think the Linear Model terminology would be that `subject`

is a random effect and `obese`

and `nutrient`

are fixed effects, but I’m not 100% sure about that.

To model just `bug ~ nutrient + obese`

, and trying to adapt the linear regression tutorial

```
@model bugmodel1(y, obese, nutrient) = begin
# Set variance prior.
σ₂ ~ truncated(Normal(0,100), 0, Inf)
# Set intercept prior.
intercept ~ Normal(0, 3)
coef_o ~ Normal(0,10)
coef_n ~ Normal(0,10)
μ = intercept .+ (coef_n * nutrient) .+ (coef_o * obese)
for i in 1:length(y)
y[i] ~ Normal(μ[i], σ₂)
end
end
model = bugmodel1(df.bug, df.obese, df.nutrient)
```

(Incidentally, I’m not sure if it makes sense for the coefficients to follow normal distributions, I don’t totally understand what all the pieces are doing, but with this randomly generated data, the coefficients land ~0, which is what I would expect).

So now, the questions:

- Am I totally off-base with what I’m trying to do?
- Is it possible to include
`subject`

and`timepoint`

with squared exponential covariance as part of the model? - (bonus, this is what I ultimately want to do, possibly makes more sense as a separate post): can I easily compare this model to one that does NOT include
`nutrient`

, to see if`nutrient`

is providing additional information?