Gaussian Process Model with Turing

theogf · November 18, 2020, 1:22pm

Note that I don’t really understand the distinction

What you really care about is how many unique samples you get from your sampler. For example in Metropolis Hastings, if you would reject all proposal you would get stuck with an effective sample size of one (although you might have done calculations for 10 millions tries).

I did some googling and didn’t find any julia-specific information about this - any ideas where I should look?

I don’t think there is any Laplace method implemented at the moment, also because Julia AD is pretty bad with Hessians right now…

cscherrer · November 18, 2020, 2:31pm

This is a fun topic! Here’s a very hand-wavy explanation, just to build on @theogf’s point.

It’s common for “number of samples” n to show up in computations statistical computations, usually with the assumption that samples are independent. Think of the computation for standard error of the mean, for example.

In an MCMC context, we can often compute things more directly than depending on formulas like this, which often only hold asymptotically and with some assumptions (iid, etc).

That’s great, but when you plug everything in, the equations often don’t quite work out, because some assumptions are broken. It turns out we can quantify “how broken” they are (again, hand-wavily).

So we pretend we don’t know n, and solve for it, calling it n_{\text{eff}}, the effective number of samples. However many samples we actually took, the result behaves as if we had taken n_{\text{eff}}.

Though we often sweep it under the rug, it’s important to remember that there are different ways of computing this. So when you dig into the details, you always need to consider the asymptotic result used, and which assumptions from that result are broken.

kevbonham · November 18, 2020, 7:03pm

Sure! Given that I’m not totally certain that the code I’m writing is what I want, I’ll state it again in words. What I am most interested in is comparing two models, one that contains one variable (in this case, diet) and the other that doesn’t. The components of the model are:

bug - the dependent variable (continuous)
diet - I want to compare models with and without this variable (continuous), which should have linear relationship with bug
x_1 \dots x_n some number of covariates with a linear relationship to bug
subject (categorical) and time - I’ve got multiple samples from each subject over time.

bug varies over time, with samples that are closer to one another in time more similar to one another. But this relationship only holds for the same subject. That is, time=1 for subject=1 and time=1 for subject=2 have no relationship. Also, the number subject has no numerical meaning, it’s just a UID.

So here’s the model I came up with^[1] in code. Here, I’m ignoring the x_n stuff, just trying to do bug, diet, subject and time:

sekernel(alpha, rho) = 
  alpha^2 * KernelFunctions.transform(SEKernel(), sqrt(0.5)/rho)

@model function GPmodel1(bug, diet, subj, tp, jitter=1e-6)
    ns = length(unique(subj))
    
    # subj coefficients (random effects) priors
    a ~ filldist(Normal(), ns)
    
    # diet coefficient prior
    b ~ Normal()

    # gp priors
    sig2 ~ LogNormal(0, 1)
    alpha ~ LogNormal(0, 0.1)
    rho ~ LogNormal(0, 1)

    # lm specs
    mu = a[subj] + b * bug
    
    # gp specs
    kernel = sekernel(alpha, rho)  # covariance function
    K = kernelpdmat(kernel, reshape(tp, length(tp), 1), obsdim=1)  # cov matrix
    K += LinearAlgebra.I * (sig2 + jitter)
    
    bug .~ MvNormal(mu, K)
end

When I say “I came up with,” I really mean “that I mostly copied from examples earlier in this thread from @luiarthur” ↩︎

luiarthur · November 18, 2020, 9:28pm

I think this line:

mu = a[subj] + b * bug

should be

mu = a[subj] + b * diet

Also, I think it might help to think about the model more abstractly. Something like

y_{s,j} = \beta_0 + d_{s,j} \cdot \beta_d + f_s(t_{s,j}) + \epsilon_{s,j}, for j=1,\dots,J_s, and s=1,\dots,S,

where y_{s,j} represents the dependent variable (bug) for the j-th measurement for subject s, d_{s,j} is the diet for subject s at measurement j, \beta_d is the diet effect, \beta_0 is the intercept, t_{s,j} is the time at which measurement j was taken for subject s, f_s(\cdot) is a (possibly non-linear) function of time for subject s, and \epsilon_{s,j} is the residuals. (f could be a number of things. e.g., a linear function, a polynomial, some other non-linear function, etc.)

I would think that the relationship between the dependent variable and time might be different for each subject. So, I included a subscript on f for each subject. Though, this was not accounted for in the Turing model.

@kevbonham Does this look like the model you intend to fit? If not, what is the high level description of your intended model? For example, do you think f should be indexed by subject? Or is one diet prescribed for all time for each subject and, thus, d does not need to be indexed by j?

mschauer · November 18, 2020, 10:01pm

Can you show a couple of trajectory of persons? Perhaps one can see what is the actual correlation structure in time.

kevbonham · November 19, 2020, 4:49pm

Totally right - I used to have this with x and y but changed them for this post and inserted the wrong thing there. It was correct when I was running it

Yes, or at least that’s definitely close, thanks for writing it out! In terms of model complexity, I was planning to assume that diet effects are the same for all subjects (though that’s certainly wrong biologically). I’m not 100% sure what you mean by “do you think f should be indexed by subject?” - in the GP context, we certainly expect to use the same covariance function (SE). As with diet, I was thinking that we’d just be modeling the same hyperparameters though this is certainly wrong biologically.

If it’s possible to fit a model where everything is subject-specific in an efficient way, that’s more plausible biologically, but I think it’s safe to assume that diet effects, and the length-scale of the GP are the approximately the same if that speeds things up substantially.

And to restate, the thing we’re primarily interested in is whether including the diet term improves the model fit. Beyond that, it would be nice to get an estimate of the magnitude of the diet effect, though that’s not essential. The only other parameter that might be of interest is the GP length-scale, but that’s even less necessary.

Yeah, for sure, this is something I need to do anyway. Were you thinking a 3D axis with time, diet, and bug, and separate lines for different subjects?

mschauer · November 19, 2020, 5:37pm

Or just 2d. Important thing is response over time for different subjects, perhaps with explanatories plotted along. That’s where my visual brain kicks in.

jkbest2 · November 19, 2020, 5:39pm

I started a package, but it’s definitely not ready yet. It sounds like higher-order autodiff is being addressed by Keno’s Diffractor.jl, so I’m hopeful that that piece will improve.

This looks like you’re creating a single covariance matrix for all patients at all times? For P patients each with T observations, this scales like O((PT)^3), which is not great. If you instead form one covariance matrix per patient, this scales like O(P T^3). There is probably also a package that will do this as a block-diagonal matrix.

luiarthur · November 19, 2020, 5:46pm

@kevbonham Let me clarify my questions.

Is it the case that for each subject s, that the diet is the same over time? Or is it the case that over time, the diet changes for a subject. (If the diet is the same over time for a given subject, then d does not need to be indexed by j. In terms of modeling, it really doesn’t matter, but it might help in thinking about the problem.) It also seems that you want to model one global diet effect. So, \beta_d doesn’t need to be indexed by subject.

f is a function. If it is indexed by subject, then the relationship between the dependent variable and time is possibly different for each subject. In an extreme case, it is possible that for one subject, “bug” increases with time; and for another subject, “bug” decreases with time. Do you believe this to be the case? If so, each subject should get their own f (in other words, f should be indexed by s).

Also, what is the interpretation of “bug” and “subject”?

kevbonham · November 19, 2020, 6:19pm

Ahh, I see. Yes, we get a different diet observation at each time point.

It can definitely be the case that bug increases over time for one subject and decreases over time for another. But in terms of the effect of time on bug, we expect it to be random with a mean of zero for all subjects. I’m not totally sure if this is a complete answer to your question.

“subject” is an individual person in the study. “bug” is a measurement of the relative abundance of some species of bacteria in the gut (well, really the poop) of a subject.

luiarthur · November 19, 2020, 7:32pm

Thanks for the clarification.

It can definitely be the case that bug increases over time for one subject and decreases over time for another. But in terms of the effect of time on
bug , we expect it to be random with a mean of zero for all subjects. I’m not totally sure if this is a complete answer to your question.

This is great. To be more specific, do you expect the effect of time on bug to be random with mean zero across all subjects, or within each subject?

Also, what is the range for bug? (e.g. is bug some number between 0 and 1? Is it a positive number? This may affect your choice of likelihood.)

I think I’ve said this before, but I think it might help to see what a simple mixed effects model gives you.

Something like this:

y_{s,j} = \beta_0 + d_{s,j} \cdot \beta_d + \gamma_s + \epsilon_{s,j}

where \gamma_s is a random intercept for each subject, and \epsilon_{s,j} is zero-mean Gaussian error. If you want to go Bayesian, then in Turing, you would do something like this (you should consider better priors):

# NOTE: This code has not been tested.
"""
y: dependent variable for each subject and time point (length: N)
diet: diet of subject and time point (length: N)
subject_indicator: a vector containing the subject id (a number between 1 and S, where S is the number of subjects). Length of `subject_indicator` should be N. 
"""
@model bugmodel(y, diet, subject_indicator)
  num_subjects = length(unique(subject_indicator))  # number of subjects
  
  intercept ~ Normal(0, 1)
  diet_effect ~ Normal(0, 1)
  subject_effects ~ filldist(Normal(0, 1), num_subjects)
  noise_sd ~ LogNormal(0, 1)  # eps_{s,j} ~ Normal(0, noise_sd)  

  m = intercept .+ diet * diet_effect + subject_effects[subject_indicator]
  y .~ Normal.(m, noise_sd)  # you may need a different likelihood depending on the range of "bug".
end

kevbonham · November 19, 2020, 7:57pm

Nice! Looking forward to it (and if you need a guinea pig…)

Ah, excellent point. So is changing that as simple as doing

 K = kernelpdmat(kernel, reshape(tp[subj], length(tp[subj]), 1), obsdim=1)  # cov matrix

?

jkbest2 · November 19, 2020, 9:53pm

I’m not super familiar with kernelpdmat, but I think this will give you a P-dimensional Gaussian process for T time points, rather than P separate GPs. I think the most straightforward solution would be to loop over subj and construct a separate K for each.

theogf · November 20, 2020, 11:12am

kernelpdmat only guarantees that the output is PDMat matrix (that you can pass directly to a MvNormal, it automatically add noise on the diagonal until positive definiteness is achieved. I think what @jkbest2 is saying is that you should have a different GP for each patient (but using the same hyperparameters). The scaling would linear in the number of patients instead of cubic but you would lose direct correlation between the patients.

kevbonham · November 20, 2020, 8:13pm

2 posts incoming, this one is responding to a bunch of previous stuff, and the next one is a new GP specification that I think is closer to what I want. I managed have a meeting with the guy that started this whole thing, and he cleared up some misconceptions I had.

At a higher level, I feel like maybe I should wind this thread down, and kick off other ones that are a bit less meandering. Though I hugely appreciate everyone that’s put in the effort to knock down some of my ignorance, this has been enormously helpful!

As requested, some plots:

You may notice that a weird result of sample collection is that we have 4 timepoints per subject, but in pairs that are quite close together.

I think both? If it’s true within each subject, it would be true across all subjects as well, right? I suppose we would expect to observe bug appearing to increase or decrease over time for some subjects, but that this wouldn’t be because of time per say.

When we measure, it’s between 0 and 1, but in this case, it’s been normalized using inverse normal transformation (which I’ve been assured is appropriate in this case). So in principle, the values I’m feeding in should be zero-centered, although you can see above that they’re not actually.

Thanks for the example - that seems to work, though the output seems off

Summary Statistics
           parameters      mean       std   naive_se      mcse       ess      rhat  
               Symbol   Float64   Float64    Float64   Missing   Float64   Float64  
                                                                                    
          diet_effect   -0.1687    0.0000     0.0000   missing       NaN       NaN  
            intercept   -1.8538    0.0000     0.0000   missing    2.0408    0.9899  
             noise_sd    1.1454    0.0000     0.0000   missing       NaN       NaN  
   subject_effects[1]   -0.6571    0.0000     0.0000   missing       NaN       NaN  
   subject_effects[2]   -1.1052    0.0000     0.0000   missing    2.0408    0.9899  
   subject_effects[3]    0.3499    0.0000     0.0000   missing       NaN       NaN  
# ... etc

And all the quantiles for all parameters are the same

Quantiles
           parameters      2.5%     25.0%     50.0%     75.0%     97.5%  
               Symbol   Float64   Float64   Float64   Float64   Float64  
                                                                         
          diet_effect   -0.1687   -0.1687   -0.1687   -0.1687   -0.1687  
            intercept   -1.8538   -1.8538   -1.8538   -1.8538   -1.8538  
             noise_sd    1.1454    1.1454    1.1454    1.1454    1.1454  
   subject_effects[1]   -0.6571   -0.6571   -0.6571   -0.6571   -0.6571  
   subject_effects[2]   -1.1052   -1.1052   -1.1052   -1.1052   -1.1052 
# ... etc

I tried to plot these chains and I got a bunch of warnings and no actual plots (though I’m able to run tutorials with plots of chains just fine), eg

GKS: Rectangle definition is invalid in routine SET_VIEWPORT
┌ Warning: No strict ticks found
└ @ PlotUtils ~/.julia/packages/PlotUtils/3Ttrk/src/ticks.jl:283
GKS: Rectangle definition is invalid in routine SET_VIEWPORT
GKS: Possible loss of precision in routine SET_WINDOW
GKS: Rectangle definition is invalid in routine SET_VIEWPORT
GKS: Rectangle definition is invalid in routine SET_VIEWPORT
GKS: Rectangle definition is invalid in routine SET_VIEWPORT
GKS: Rectangle definition is invalid in routine SET_VIEWPORT
GKS: Rectangle definition is invalid in routine SET_VIEWPORT
┌ Warning: No strict ticks found
└ @ PlotUtils ~/.julia/packages/PlotUtils/3Ttrk/src/ticks.jl:283
GKS: Rectangle definition is invalid in routine SET_VIEWPORT

Anyway, debugging this code is not necessarily top priority. It’s nice to see how that’s done, and it was super fast.

kevbonham · November 20, 2020, 8:19pm

So here’s a new GP model specification. As I said, I had a meeting with someone that hand-coded the GPs that I’m trying to reproduce in Turing. He cleared up some misconceptions I had, including things that I’m pretty sure others in this thread have explained, but that I didn’t quite get.

Here, I put everything into the GP covariance matrix. Starting with the SE kernel for time, I multiply by a matrix that removes all time-based correlation with different subjects. Then I add linear covariance associated with diet data.

At least, that’s the intention:

function subjectcorrmat(subjects, t=T) where T
    n = length(subjects)
    scov = zeros(t, n, n)
    # if same person, 1.0, otherwise, 0.0
    for i in 1:n
        for j in 1:n
            subjects[i] == subjects[j] && (scov[i,j] = one(t))
        end
    end
    return scov
end

@model function GPmodel1(bug, diet, subj, tp, jitter=1e-6, T=Float64)
    nobs = length(bug)
    # assume zero mean
    mu = zeros(T, nobs)

    # subj-specific cov
    scov = subjectcorrmat(subj, T)

    # diet covariance
    dietcov = diet * diet'
    
    # variance contributed by diet slope
    b ~ Normal()
    dietcov *= b

    # gp priors
    sig2 ~ LogNormal(0, 1)
    alpha ~ LogNormal(0, 0.1)
    rho ~ LogNormal(0, 1)
    
    # gp specs
    kernel = sekernel(alpha, rho)  # covariance function
    K = kernelpdmat(kernel, reshape(tp, length(tp), 1), obsdim=1)  # cov matrix for time
    
    # remove cross-person covariance
    K = K .* scov
    # add outter product of diet * "slope" (cov caused by slope)
    K = K .+ dietcov
    
    # jitter for numerical stability
    K += LinearAlgebra.I * (sig2 + jitter)
    
    bug ~ MvNormal(mu, K)
end

Unfortunately, I’m back to getting PosDefException , so I can’t benchmark, but I’m guessing the vastly reduced number of parameters should help some, and I think this is closer to what I’m going for anyway.

mschauer · November 20, 2020, 8:37pm

These are the explanatory variables over time, not the response?

kevbonham · November 21, 2020, 10:48pm

bug is the response.

mschauer · November 21, 2020, 10:57pm

Thank you, now I get it! What are the funny large last minute changes in bug before terminal time?

kevbonham · November 25, 2020, 8:16pm

That’s just the nature of this stuff, sometimes there can be wild swings. But I think it mostly looks weird because of the long interval between time 2 and time 3. The sampling is like

1--2-----------------------------------3---4

Topic		Replies	Views
Running TemporalGPs/Stheno sampling Probabilistic Programming gaussian-process , stheno	25	1356	May 29, 2021
Conceptual Gaussian Process Questions Probabilistic Programming	15	1798	August 17, 2020
Multi-level varying slopes with two clusters, cross classification Probabilistic Programming statistics , turing , distributions , bayesian-inference	16	2173	August 14, 2021
Modeling gaussian process with Turing.jl Probabilistic Programming gaussian-process	6	1238	February 8, 2022
Strange issues fitting GMMs in Turing.jl by extending a fairly simple example Probabilistic Programming turing , bayesian-inference	10	1925	February 15, 2021

Gaussian Process Model with Turing

Related topics