Gaussian Process Model with Turing

cgeoga · November 27, 2020, 4:55pm

Hi all. I’m a little worried that this has already been said in the thread and I just missed it in my scrolling, and I see that some of the PosDef exceptions have been from repeated observations leading to true rank deficiencies. So this might not be a helpful comment. Sorry in advance if not.

But: for completeness of troubleshooting, it’s worth pointing out that the squared exponential kernel without any kind of diagonal perturbation is analytic on its entire domain, and so numerically it is incredibly brittle. As an example:

const pts = [randn(2) for _ in 1:1024]
const K   = Symmetric([exp(-norm(x-y)^2) for x in pts, y in pts])
@show rank(K) # not 1024
cholesky!(K)  # PosDefException

Not sure that that’s what’s going on here, but probably worth keeping in mind. For what it’s worth beyond that, at least in my field of environmental statistics, the squared exponential kernel (or any kernel that is analytic at the origin) yield sample paths with some pretty odd properties that aren’t particularly desirable. For example, observing such a process on an interval [-T, 0] with T > 0 enables perfect prediction of Z(t) for any t (Stein 1999, top pg. 121 or ex. 2.16). I won’t claim that there are no applications where that property is reasonable or desirable, but it certainly isn’t for anything I’ve ever done or probably will ever do.

In whatever code you use to generate covariance matrices, it might help to have an assert for no repeated measurements, which in pseudocode for my notation here might be

if iszero(diagonal_perturbation_parameter)
    @assert length(unique(pts)) == length(pts) "Your covariance matrix is rank deficient"
end

or something. If that passes and you’re still having issues, you might consider switching to a kernel that is not analytic at the origin. A Matern class kernel with a convenient half-integer smoothness might be a reasonable drop-in, at the very least in debugging stages where you want to eliminate the possibility of very technical numerical concerns.

kevbonham · November 27, 2020, 10:08pm

To be frank, I don’t have any idea what this means . I have the # for numerical stability comment, but I copied that from someone else’s code, and don’t totally understand what it means. I’m a biologist by training and my linear algebra knowledge is limited to this forum and 3blue1brown videos .

But isn’t that line in the model (K += LinearAlgebra.I * (sig2 + jitter) a diagonal perturbation? Are you saying that may not be sufficient?

That said,

This is quite helpful, as is the suggestion to try the marten kernel.

I have found that if I run the model and comment out the diet and subject modifications to the kernel, such that I only have the se kernel on time, it seems to run without the posdef exception. If I do the subject thing (to remove inter-subject covariance in time) or add the diet covariance or both, it happens again.

cgeoga · November 28, 2020, 12:08am

Hey @kevbonham—sorry for too much jargon, and for not reading your code closely enough. The “jitter for numerical stability” definitely would take care of the issue I’m describing. As a personal soapbox, though, I think changing your model to accomodate for the poor numerics of a kernel that is also not itself a good choice (caveat: at least in my applications) is not ideal. I can expand a bit on the first quoted block if you’d like, but the one-line form of the suggestion is to use a kernel function that isn’t infinitely differentiable at zero.

Taking a closer look at your code and with regard to the second line: I have never used any of the Bayesian GP methods in Julia or anywhere else, so I think I’m not the most capable person to help with what’s going wrong. But if my interpretation of subjectcorrmat is correct in that it could also be written as subjectcorrmat(subjects, t) = convert.(t, I(length(subjects))), then you’re making K entirely a diagonal matrix, which means that no information about the range parameter, for example, is making its way into the likelihood. From the specification of the parameters, it doesn’t seem like it would be possible to feed invalid parameters into the kernel function, but if you were just doing maximum likelihood and using a gradient-free method, for example, the fact that the range parameter doesn’t affect the likelihood at all would definitely make me concerned that the optimizer would do wacky things and find some way to break things. Maybe something similar is happening here. In any case, I think it would make sense to follow up with somebody about why you’re specifying a covariance function and then zeroing out everything but the diagonal of the covariance matrix after doing so. Can you also comment on why you’re adding dietcov?

I don’t like providing help for questions that aren’t what people actually asked about, but if you’re interested in considering maximum likelihood methods and willing to use GPL code, I think your formulation of shoving everything into the covariance function (my favorite method) would make it pretty easy to try doing that as well. Let me know if that is at all interesting to you.

kevbonham · November 28, 2020, 2:00am

No worries! This project has been a constant opportunity to confront my ignorance. It’s good for me

Well, for what it’s worth, I put in a Matern32Kernel in place of the SE, and it worked! Ha!

I don’t think so. The subjects vector is like [1,1,1,2,2,3,3,3,3...]. So the function makes a matrix like

[1 1 1 0 0 0 0 0 0 ...
 1 1 1 0 0 0 0 0 0 ...
 1 1 1 0 0 0 0 0 0 ...
 0 0 0 1 1 0 0 0 0 ...
 0 0 0 1 1 0 0 0 0 ...
...

The idea is to keep the covariance modeled by the kernel within a given subject the same, but remove any time covariance between subjects, since time 1 for subject 1 and time 1 for subject 2 have no relationship. That’s the idea anyway, it’s entirely possible I did that wrong.

Adding the diet covariance, because that’s the thing I’m actually trying to model. Or, to be more accurate, I’m trying to determine if including diet improves the model (the plan is to compare with a model that doesn’t include diet). My original thought was to have a separate linear kernel for that and add them together at the end, but my colleague that I consulted suggested I do it this way (at least, that’s how I interpreted what he said).

Definitely interested in the possibility of including everything in the same covariance function. Given the success of your previous idea, I’m all ears .

cgeoga · November 28, 2020, 6:12pm

Glad Matern32 works! I think that function—which corresponds to sample paths that have one (mean-square) derivative—is a nice default in exploratory stages like these. Not too rough, but also pretty okay for numerics.

Thanks for the clarification on subjects, sorry to have missed that. I bet it is somewhere above and I just scrolled poorly. With regard to diet, I see your point. If you really tried to write everything in terms of a covariance function, then, I gather that it might look like this for two things in the same subject group:

K( (t, d), (t', d')) = \alpha \mathcal{M}_{3/2}\left(\frac{|t-t'|}{\rho}\right) + d d' + (\sigma^2 + \epsilon) \delta_{(t,d) = (t',d')}

where \mathcal{M}_{3/2} is the Matern correlation function with smoothness 3/2 and \epsilon is your numerical jitter on the diagonal. Is that correct? This suggests that the variability grows as diet increases. If that’s what you want, you might consider at least adding another parameter, say \theta, to make it \theta d d', unless there’s again some physical reason to think that \theta really has to be one. If you’re still not happy with your output, that might be the easiest thing to try and check again.

If we pretended like t and d were both spatial coordinates and wanted to use the most obvious model, another formulation you might try is

K( (t, d), (t', d')) = \alpha \mathcal{M}_{3/2}\left( \sqrt{\frac{(t-t')^2}{\rho} + \frac{(d-d')^2}{\eta}} \right) + (\sigma^2 + \epsilon) \delta_{(t,d) = (t',d')} .

Of course, neither of your coordinates are spatial coordinates, and this changes the interpretation a lot: now the covariance of measurements decays with the difference of the diet, and the variance will be the same if both measurements have diet 1 or 5 or whatever, which maybe isn’t what you want.

In any case, if you organize your values and measurements like this, then a few lines of code can give you an objective function to minimize (please treat this as pseudo-code, as I haven’t checked it for bugs):

using ForwardDiff

# Load in and reasonably shape your data and coordinates, which this code assumes are organized like this:
# data[j] is measured at coordinates[j], and so in your case if I understand correctly, coordinates[j] = (t_j, diet_j, subject_j), for example.
const DATA = ...
const COORDINATES = ...

function kernel(x,y,params)
    x[3] == y[3] || return 0.0 # check for x.subject == y.subject like this, for example
   ... # whatever covariance function you end up choosing
end

# negative log likelihood. Totally AD-able.
function nll(params, data, coordinates) 
  K = cholesky!(Symmetric([kernel(x,y,params) for x in coordinates, y in coordinates]))
  0.5*(logdet(K) + dot(data, K\data))
end
nll_grad(p) = ForwardDiff.gradient(z->nll(z, DATA, COORDINATES), p)
nll_hess(p) = ForwardDiff.hessian(z->nll(z, DATA, COORDINATES), p)

So you have a likelihood and two derivatives and can definitely do some good optimization there. I use and worship Ipopt, which I use directly with Ipopt.jl, and I can give that code setup as well if you’d like. But this is getting long, so maybe I’ll stop here for now. Even if you aren’t interested in doing maximum likelihood estimation and want to stay Bayesian, something like this might be a good sanity check to see that the model parameterization is sensible, in the sense that you could look at gradients and Hessians of the likelihood and see that all of your parameters do actually inform it, and stuff like that.

Finally, as a little aside: I have no idea how easy this is to do in Turing, but if I knew my covariance matrix had that kind of block diagonal structure, I’d personally write my likelihood as a sum of a bunch of smaller likelihoods. Speed is probably not an issue here, though, so this is probably not crucial.

Topic		Replies	Views
Modeling gaussian process with Turing.jl Probabilistic programming gaussian-process	6	1222	February 8, 2022
Implementing sparse Gaussian process model with Turing and Stheno Probabilistic programming question , turing	15	1946	December 23, 2020
Speed up Gaussian Process model Probabilistic programming turing , stheno	16	1338	June 9, 2022
Gaussian process in Julia Statistics	13	5361	May 29, 2019
[ANN] AugmentedGaussianProcesses.jl Package Announcements	3	861	July 1, 2019

Gaussian Process Model with Turing

Related topics