This is pretty much code-gold. I have some regression methods that someone else might like using. Is there a place to put handy snippits of Prob. prog. models?
That’s very kind, thank you
For Soss models, I’d love to flesh out GitHub - cscherrer/SossExamples.jl . PRs welcome!
Thanks! That does look useful. I’ve been working my way through this one (which has the benefit that it’s free), but the math quickly goes over my head and I’m finding it somewhat challenging to go from book examples into practical code (though I have to say, the design of these packages does make it easier to fiddle and get something working even if I only have a vague idea why)
Carl’s book is a really good introduction to GPs in my opinion.
Yeah, it’s far more comprehensible than many other things I’ve looked at. The deficit is entirely on me, and the fact that the last math class I took was during my freshman year of college… 18 years ago…
Question for the KernelFunctions/AbstractGP authors, though if anyone else has any idea feel free to chime in too. Whats the usual procedure for diagnosing/fixing PosDef Errors during estimation of hyperparameters? I’ve got a fairly large model (~6000 observations) and it’s seems somewhat random if I get the error or not, so I’ve just been repeating dynamicHMC sample calls until it doesn’t error. Obviously this is not sustainable and rather uniformed. Is the conventional wisdom just to increase the value added to the diagonal of the covariance matrix until I stop getting errors or should is this an indication that my choice of prior for hyperparameters is off? Again, awesome work on both packages!
It’s not clear to me whether a singular covariance is a problem for the model, or just for the way it’s currently being computed. Maybe PositiveFactorizations.jl could help with this?
Ah actually looks like the KernelFunctions team already answered my question. Check out https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/blob/6bca6a17284dcd12fc4eca945435697e50a524b3/src/matrix/kernelpdmat.jl
if anyone else is running into this issue. Still errors after switching from kernelmatrix to kernelpdmat in the model specification, but at least the error message is much more informative . Looks like I might need to fiddle with some priors or reformat the data to make the feature matrix more sparse.
Hey @nucklass, one of the authors here. I designed kernelpdmat exactly for this purpose but it can still not be enough for some situations. What you might want to look for is if your kernel matrix is low-rank, i.e. you have some columns which are numerically identical. This is usually due to having 2 observations extremely close to each other, This closeness can happen for example with very large lengthscales for instance (I imagine this is what is happening in your optimization). To be honest there is no really good solution for this… Sparse GPs tend to avoid this problem and given that you have a lot of observations this might be the way to go.
EDIT: We should probably write a full paragraph about the “good GP practices” somewhere in some docs
Thanks for the reply!
This is looking more and more like the direction I should go. I’m only familiar with the simplest sparse approximations, ie Subsets of Data or Subsets of Regressors. Would it be entirely foolish to try and combine sparse approximations and the use of MCMC sampling of kernel hyperparamters? I.E. try and extend the Soss + Turing programs above with some sparse approximation of the latent function. From a cursory glance it seems like quite a few of Sparse methods attempt to optimize hyperparameters and inducing points simultaneously, and I’m not quite sure what estimation methods are being used in each instance. From a performance standpoint, I run a lot of Bayesian models so waiting a few days for sampling to finish is nothing new. Running an analysis that some one could reproduce/rerun is rather important, though. Also if Augmented Gaussian Processes implements any of this already, thanks in advance.
FYI, theres a PR to Stheno right now that will let you use Sparse GPs in Turing. Not battle-tested yet, but I’ve got it to run for some simple problems. And here’s the thread that led to it, if you didn’t see it:
Oh that’s very cool! I’ll definitely give this a try. I’ve heard Stheno’s kernel interface is going to switch over to KernelFunctions.jl fairly soon, so I guess everything’s coming together .
Alright, I’m back! This is my focus until I get something working. To reiterate the goals here, because this post has meandered a bit (which is great, I’m learning a ton!), the situation is this:
- I have longitudinal patient biosamples. That is, I’ve taken measurements from multiple people (
subject
s) over time. - Many of the measurements are continuous, in this case in particular, I’m looking at the dependent variable
bug
, and one of the predictor variables isnutrient
-
bug
typically changes over time, but not in a single direction. We expect measurements for a given subject to be highly correlated for samples close in time, less so for samples more distant in time. - we expect measurements from different
subject
s to be completely independent of one another (that is, there’s no relationship betweenbug1
forsubj1
atdate1
andbug1
forsubj2
atdate1
. Akin to a random effect. - There are other categorical and boolean variables that do not tend to change for a particular subject (eg gender, obesity)
We would like to model the measurement of bug
over time as a gaussian process, where (in the language of LMs) things like obese
are fixed effects, subject
is a random effect, and the covariance of nutrient
over time
is squared exponential.
After doing some additional reading, I think what I really want is to encode at least subject
, and possibly obese
as index variables, a la https://github.com/TuringLang/Turing.jl/pull/1422. But I’m having issues figuring this out, and I think it has to do with a misunderstanding of how different things interact. So first things first, I tried building a slightly extended version of the demo
model in #1422, such that the g
variable is an index variable, and X is a matrix of typical continuous variables (also using the linreg tutorial as inspiration. This seems to work:
# Dummy data
x = vcat(randn(10), randn(10).+10, randn(10).-5)
y = vcat(randn(10), randn(10).+10, randn(10).-5) .* 2
g = vcat(ones(Int, 10), ones(Int, 10)*2, ones(Int, 10)*3)
@model function demo(x, y, g)
ng = length(unique(g))
# group coefficients priors
a ~ filldist(Normal(), ng)
# x coefficient prior
b ~ Normal()
mu = a[g] + b * x
y .~ Normal.(mu)
end
model = demo(x,y,g);
sample(model, NUTS(0.8), 1_000)
Summary Statistics
parameters mean std naive_se mcse ess rhat
Symbol Float64 Float64 Float64 Float64 Float64 Float64
a[1] -0.6291 0.2943 0.0132 0.0117 380.2325 1.0030
a[2] 2.6719 0.8799 0.0394 0.0438 174.4973 1.0048
a[3] -2.0860 0.4941 0.0221 0.0229 175.6049 1.0010
b 1.7736 0.0921 0.0041 0.0048 167.5996 1.0039
So now imagine I have a tp
argument that includes timepoints. How do I incorporate the above specification into a GP? The following code runs, but I’m not 100% sure that it’s actually doing what I’m intending…
tp = repeat(1:10, 3)
sekernel(alpha, rho) =
alpha^2 * KernelFunctions.transform(SEKernel(), sqrt(0.5)/rho)
@model function demo(x, y, g, tp, jitter=1e-6)
ng = length(unique(g))
# group coefficients priors
a ~ filldist(Normal(), ng)
# x coefficient prior
b ~ Normal()
# gp priors
sig2 ~ LogNormal(0, 1)
alpha ~ LogNormal(0, 0.1)
rho ~ LogNormal(0, 1)
# lm specs
mu = a[g] + b * x
# gp specs
kernel = sekernel(alpha, rho) # covariance function
K = kernelmatrix(kernel, reshape(tp, length(tp), 1), obsdim=1) # cov matrix
K += LinearAlgebra.I * (sig2 + jitter)
y .~ MvNormal(mu, K)
end
model = demo(x,y,g,tp);
sample(model, NUTS(0.8), 1_000)
Summary Statistics
parameters mean std naive_se mcse ess rhat
Symbol Float64 Float64 Float64 Float64 Float64 Float64
a[1] -0.4016 0.7114 0.0318 0.0164 746.6413 0.9994
a[2] 0.5394 0.9962 0.0446 0.0565 430.2212 0.9982
a[3] -0.6955 0.8198 0.0367 0.0488 503.1865 0.9993
alpha 0.9929 0.0944 0.0042 0.0012 653.7770 0.9980
b 1.9989 0.1195 0.0053 0.0074 382.8640 0.9982
rho 1.7239 2.2199 0.0993 0.0645 376.1499 1.0025
sig2 5.9109 1.6832 0.0753 0.0698 497.3395 0.9982
Does this model make sense / fit what I described above?
This also runs on my real data (no PosDefException
, yay!) where the number of unique g
is ~300), but is suuuuper slow. Any suggestions on this front would be greatly appreciated.
Bump. Also some updates.
I’ve been trying to do a bit of benchmarking and optimizing on the model above, following the performance tips. I’ve already avoided loops, and when I do @code_warntype model = demo(x,y,g,tp)
, everything is blue (though this is my first time using that macro, not sure that’s right).
Then I experimented with different backends. First thing to note - Zygote
gives me an error (something about not being able to iterate nothing
). I’m assuming this has to do with this statement:
In case of
:tracker
and:zygote
, it is necessary to avoid loops for now. This is mainly due to the reverse-mode AD backendsTracker
andZygote
which are inefficient for such cases.ReverseDiff
does better but vectorized operations will still perform better.
I don’t explicitly use a loop, is it possible that the GP is using one implicitly?
My real data is 921 rows, each subject has 1-4 timepoints, and there are 305 total subjects. When I do the first 50 rows, 100 samples for the chain and forwarddiff
:
NUTS(0.8)
takes 17.6 secHMC(0.1, 10)
takes 12.9 secHMC(0.1, 20)
takes 6.7 sec
The first 50 rows has 19 subjects. When I bump up to 100 rows (36 subjects), that last one takes 51 sec (almost 10 times as long), and using reversediff
takes about 10.5 min. At 150 rows (53 subjects), it’s 230 sec. If things keep scaling like this, I’m looking at days for the whole set. Is there a better way to do this?
I also made an attempt to do what was suggested earlier and add a separate subject GP with a really short length scale - it’s possible I did that wrong, but it did not improve things much.
Note that the relevant benchmark is probably not time/sample, but time/effective sample size.
Sorry to hijack this thread, but this is moving towards a more general question I’ve had for Bayesian modelling in general and Turing specifically for a while: how can I form a reasonable expectation about how long a model will take to estimate in advance? Is it about the number of observations, the number of parameters, or the structure of the model itself? And what is effective sample size in this context?
For someone only trained in classical econometrics, where one estimates mostly linear models that take a fraction of a second to estimate, it’s hard to guess when I could feasibly switch my analysis strategy over to Bayesian.
Generally,
- evaluation time of the posterior (with gradients, if you are using those), which does depend on the number of observations etc (unless you have summary statistics),
- how well-behaved the log posterior is (eg non-uniform curvature is difficult to deal with)
See the Stan manual for an explanation of ESS.
For example in this situation, a kernel matrix K
of size length(tp) x length(tp)
is created and passed to MvNormal
. Now MvNormal
needs to invert this matrix with complexity length(tp)^3
, which is usually the bottleneck with GPs.
So basically every sampling step has at least complexity N^3
, which, for N=300
is not negligible!
Note that I don’t really understand the distinction
Yeah, this doesn’t seem great . In this case, I really only care about the GP and linear hyperparameters, not the subject-specific stuff (a[g]
in the model). I found a presentation from a former colleague, where he wrote that we want to
- approximate the latent posterior and integrate it out.
- approximation by laplace method
- use saddle-free Newton to avoid saddle nodes
I did some googling and didn’t find any julia-specific information about this - any ideas where I should look?
@kevbonham Can you restate the actual model you are interested in?