# Gaussian Processes with missing data

Hi everyone,

My data (and data generating process) is probably quite peculiar, but nevertheless me (and my supervisor) are convinced it should be possible to model using probabilistic programming, and we believe Gaussian Processes make the most sense.
So in my data about dogs, we have ‘tests’ at certain time points, with some test scores at those time points.
At time point 1, we additionally have a behavioral score that sort of indicates the occurrence of certain problem behavior for a dog (based on a validated questionnaire).
The idea is to model this behavioral score over the timepoints, where it is only really observed at time point 1. Additionally, the test scores essentially depend on the behavioral score, because the behavioral score indicates to some extent how we expect the dog to behave (conditional on certain circumstances).
So we have some theory about how we expect the behavioral score to vary over time, which is also how we sort of heuristically pick our kernel parameters for the gaussian process (instead of learning them from the data like you normally would).

Now, the issue is, in order to model the behavioral score over time, we need to pass a mix of Float values and missing values to the GP, with the expectation that the missing values will be sampled from the kernel accordingly, but that is where it throws an error. I also tried the approach where I use an MvNormal instead of AbstractGP’s, but it throws a similar error. So I am not sure how I could do my modeling instead in order to achieve my goal, or whether this is a bug somewhere related to missing values.

Error for AbstractGP's implementation (when running the code below)
ERROR: LoadError: MethodError: no method matching loglikelihood(::AbstractGPs.FiniteGP{GP{AbstractGPs.ConstMean{Float64},ScaledKernel{TransformedKernel{SqExponentialKernel,ScaleTransform{Float64}},Float64}},Array{Int64,1},LinearAlgebra.Diagonal{Float64,FillArrays.Fill{Float64,1,Tuple{Base.OneTo{Int64}}}}}, ::Array{Union{Missing, Float64},1})
Closest candidates are:
loglikelihood(::AbstractGPs.FiniteGP, ::AbstractArray{var"#s16",2} where var"#s16"<:Real) at /home/michiel/.julia/packages/AbstractGPs/TRhxZ/src/abstract_gp/finite_gp.jl:222
loglikelihood(::Distribution{Univariate,S} where S<:ValueSupport, ::AbstractArray) at /home/michiel/.julia/packages/Distributions/hAr7L/src/univariates.jl:522
loglikelihood(::Distribution{Multivariate,S} where S<:ValueSupport, ::AbstractArray{var"#s123",1} where var"#s123"<:Real) at /home/michiel/.julia/packages/Distributions/hAr7L/src/multivariates.jl:264
...
Stacktrace:
[1] observe(::DynamicPPL.SampleFromUniform, ::AbstractGPs.FiniteGP{GP{AbstractGPs.ConstMean{Float64},ScaledKernel{TransformedKernel{SqExponentialKernel,ScaleTransform{Float64}},Float64}},Array{Int64,1},LinearAlgebra.Diagonal{Float64,FillArrays.Fill{Float64,1,Tuple{Base.OneTo{Int64}}}}}, ::Array{Union{Missing, Float64},1}, ::DynamicPPL.VarInfo{DynamicPPL.Metadata{Dict{DynamicPPL.VarName,Int64},Array{Distribution,1},Array{DynamicPPL.VarName,1},Array{Real,1},Array{Set{DynamicPPL.Selector},1}},Float64}) at /home/michiel/.julia/packages/DynamicPPL/u14IH/src/context_implementations.jl:152
[2] _tilde(::DynamicPPL.SampleFromUniform, ::AbstractGPs.FiniteGP{GP{AbstractGPs.ConstMean{Float64},ScaledKernel{TransformedKernel{SqExponentialKernel,ScaleTransform{Float64}},Float64}},Array{Int64,1},LinearAlgebra.Diagonal{Float64,FillArrays.Fill{Float64,1,Tuple{Base.OneTo{Int64}}}}}, ::Array{Union{Missing, Float64},1}, ::DynamicPPL.VarInfo{DynamicPPL.Metadata{Dict{DynamicPPL.VarName,Int64},Array{Distribution,1},Array{DynamicPPL.VarName,1},Array{Real,1},Array{Set{DynamicPPL.Selector},1}},Float64}) at /home/michiel/.julia/packages/DynamicPPL/u14IH/src/context_implementations.jl:109
[3] tilde(::DynamicPPL.DefaultContext, ::DynamicPPL.SampleFromUniform, ::AbstractGPs.FiniteGP{GP{AbstractGPs.ConstMean{Float64},ScaledKernel{TransformedKernel{SqExponentialKernel,ScaleTransform{Float64}},Float64}},Array{Int64,1},LinearAlgebra.Diagonal{Float64,FillArrays.Fill{Float64,1,Tuple{Base.OneTo{Int64}}}}}, ::Array{Union{Missing, Float64},1}, ::DynamicPPL.VarInfo{DynamicPPL.Metadata{Dict{DynamicPPL.VarName,Int64},Array{Distribution,1},Array{DynamicPPL.VarName,1},Array{Real,1},Array{Set{DynamicPPL.Selector},1}},Float64}) at /home/michiel/.julia/packages/DynamicPPL/u14IH/src/context_implementations.jl:67
[4] tilde_observe(::DynamicPPL.DefaultContext, ::DynamicPPL.SampleFromUniform, ::AbstractGPs.FiniteGP{GP{AbstractGPs.ConstMean{Float64},ScaledKernel{TransformedKernel{SqExponentialKernel,ScaleTransform{Float64}},Float64}},Array{Int64,1},LinearAlgebra.Diagonal{Float64,FillArrays.Fill{Float64,1,Tuple{Base.OneTo{Int64}}}}}, ::Array{Union{Missing, Float64},1}, ::DynamicPPL.VarName{:behavioral_score,Tuple{}}, ::Tuple{}, ::DynamicPPL.VarInfo{DynamicPPL.Metadata{Dict{DynamicPPL.VarName,Int64},Array{Distribution,1},Array{DynamicPPL.VarName,1},Array{Real,1},Array{Set{DynamicPPL.Selector},1}},Float64}) at /home/michiel/.julia/packages/DynamicPPL/u14IH/src/context_implementations.jl:89
[5] #35 at /mnt/c/Users/michiel.verburg/Documents/repositories/master_project_gitlab/scripts/cbarq/small_reproducible_model_discourse_missing_values.jl:30 [inlined]
[6] (::var"#35#36")(::Random._GLOBAL_RNG, ::DynamicPPL.Model{var"#35#36",(:behavioral_score, :tp, :test_score, :jitter),(:jitter,),(),Tuple{Array{Union{Missing, Float64},1},Array{Int64,1},Array{Float64,1},Float64},Tuple{Float64}}, ::DynamicPPL.VarInfo{DynamicPPL.Metadata{Dict{DynamicPPL.VarName,Int64},Array{Distribution,1},Array{DynamicPPL.VarName,1},Array{Real,1},Array{Set{DynamicPPL.Selector},1}},Float64}, ::DynamicPPL.SampleFromUniform, ::DynamicPPL.DefaultContext, ::Array{Union{Missing, Float64},1}, ::Array{Int64,1}, ::Array{Float64,1}, ::Float64) at ./none:0
[7] macro expansion at /home/michiel/.julia/packages/DynamicPPL/u14IH/src/model.jl:0 [inlined]
[8] _evaluate(::Random._GLOBAL_RNG, ::DynamicPPL.Model{var"#35#36",(:behavioral_score, :tp, :test_score, :jitter),(:jitter,),(),Tuple{Array{Union{Missing, Float64},1},Array{Int64,1},Array{Float64,1},Float64},Tuple{Float64}}, ::DynamicPPL.VarInfo{DynamicPPL.Metadata{Dict{DynamicPPL.VarName,Int64},Array{Distribution,1},Array{DynamicPPL.VarName,1},Array{Real,1},Array{Set{DynamicPPL.Selector},1}},Float64}, ::DynamicPPL.SampleFromUniform, ::DynamicPPL.DefaultContext) at /home/michiel/.julia/packages/DynamicPPL/u14IH/src/model.jl:154
[9] evaluate_threadunsafe(::Random._GLOBAL_RNG, ::DynamicPPL.Model{var"#35#36",(:behavioral_score, :tp, :test_score, :jitter),(:jitter,),(),Tuple{Array{Union{Missing, Float64},1},Array{Int64,1},Array{Float64,1},Float64},Tuple{Float64}}, ::DynamicPPL.VarInfo{DynamicPPL.Metadata{Dict{DynamicPPL.VarName,Int64},Array{Distribution,1},Array{DynamicPPL.VarName,1},Array{Real,1},Array{Set{DynamicPPL.Selector},1}},Float64}, ::DynamicPPL.SampleFromUniform, ::DynamicPPL.DefaultContext) at /home/michiel/.julia/packages/DynamicPPL/u14IH/src/model.jl:127
[10] (::DynamicPPL.Model{var"#35#36",(:behavioral_score, :tp, :test_score, :jitter),(:jitter,),(),Tuple{Array{Union{Missing, Float64},1},Array{Int64,1},Array{Float64,1},Float64},Tuple{Float64}})(::Random._GLOBAL_RNG, ::DynamicPPL.VarInfo{DynamicPPL.Metadata{Dict{DynamicPPL.VarName,Int64},Array{Distribution,1},Array{DynamicPPL.VarName,1},Array{Real,1},Array{Set{DynamicPPL.Selector},1}},Float64}, ::DynamicPPL.SampleFromUniform, ::DynamicPPL.DefaultContext) at /home/michiel/.julia/packages/DynamicPPL/u14IH/src/model.jl:92
[11] DynamicPPL.VarInfo(::Random._GLOBAL_RNG, ::DynamicPPL.Model{var"#35#36",(:behavioral_score, :tp, :test_score, :jitter),(:jitter,),(),Tuple{Array{Union{Missing, Float64},1},Array{Int64,1},Array{Float64,1},Float64},Tuple{Float64}}, ::DynamicPPL.SampleFromUniform, ::DynamicPPL.DefaultContext) at /home/michiel/.julia/packages/DynamicPPL/u14IH/src/varinfo.jl:126
[12] DynamicPPL.VarInfo(::Random._GLOBAL_RNG, ::DynamicPPL.Model{var"#35#36",(:behavioral_score, :tp, :test_score, :jitter),(:jitter,),(),Tuple{Array{Union{Missing, Float64},1},Array{Int64,1},Array{Float64,1},Float64},Tuple{Float64}}, ::DynamicPPL.SampleFromUniform) at /home/michiel/.julia/packages/DynamicPPL/u14IH/src/varinfo.jl:125
[14] macro expansion at /home/michiel/.julia/packages/AbstractMCMC/Nw3Wn/src/sample.jl:78 [inlined]
[15] macro expansion at /home/michiel/.julia/packages/ProgressLogging/6KXlp/src/ProgressLogging.jl:328 [inlined]
[17] with_logstate(::Function, ::Any) at ./logging.jl:408
[18] with_logger(::Function, ::LoggingExtras.TeeLogger{Tuple{LoggingExtras.EarlyFilteredLogger{TerminalLoggers.TerminalLogger,AbstractMCMC.var"#1#3"{Module}},LoggingExtras.EarlyFilteredLogger{Logging.ConsoleLogger,AbstractMCMC.var"#2#4"{Module}}}}) at ./logging.jl:514
[19] with_progresslogger(::Function, ::Module, ::Logging.ConsoleLogger) at /home/michiel/.julia/packages/AbstractMCMC/Nw3Wn/src/logging.jl:34
[20] macro expansion at /home/michiel/.julia/packages/AbstractMCMC/Nw3Wn/src/logging.jl:10 [inlined]
[23] sample at /home/michiel/.julia/packages/Turing/gQLRw/src/inference/hmc.jl:123 [inlined]
[24] #sample#2 at /home/michiel/.julia/packages/Turing/gQLRw/src/inference/Inference.jl:142 [inlined]
[25] sample at /home/michiel/.julia/packages/Turing/gQLRw/src/inference/Inference.jl:142 [inlined]
[26] #sample#1 at /home/michiel/.julia/packages/Turing/gQLRw/src/inference/Inference.jl:132 [inlined]
[28] top-level scope at /mnt/c/Users/michiel.verburg/Documents/repositories/master_project_gitlab/scripts/cbarq/small_reproducible_model_discourse_missing_values.jl:38
[29] include(::String) at ./client.jl:457
[30] top-level scope at REPL[9]:1
in expression starting at /mnt/c/Users/michiel.verburg/Documents/repositories/master_project_gitlab/scripts/cbarq/small_reproducible_model_discourse_missing_values.jl:38

Error for MvNormal implementation
ERROR: LoadError: MethodError: no method matching loglikelihood(::MvNormal{Float64,PDMats.PDMat{Float64,Array{Float64,2}},Array{Float64,1}}, ::Array{Union{Missing, Float64},1})

using Turing
using KernelFunctions
using AbstractGPs
using StatsFuns

# n_days = 50
tp = [1, 15, 42]
# the true scores, in reality we only know the first one.
behavioral_score = [3.0, 2.5, 1.9]
# let's say the predictor is exactly linearly related with the outcome score.
# Divide by 4 because behavioral scores are in the range of 0 to 4, and test scores are in the range of 0 to 1.
test_score = behavioral_score/4
# set all scores to missing that aren't the first timepoint.
behavioral_score_missing = Array{Union{Missing,Float64}}(behavioral_score)
behavioral_score_missing[2:length(behavioral_score_missing)] .= missing

sekernel(alpha, rho) = alpha^2 * KernelFunctions.transform(SEKernel(), 1/(rho*sqrt(2)))

@model function score_model(behavioral_score, tp, test_score, jitter=1e-6 )
sigma ~ LogNormal(0.0, 1.0)

f = GP(2.0, sekernel(2.0, 20.0))

behavioral_score ~ f(tp, sigma^2 + jitter)

# the mean is the same as the behavioral_score/4, since we expect the behavioral
# score to linearly predict the score on the test.
test_score .~ Normal.(behavioral_score/4, 0.1)
end

model = score_model(behavioral_score_missing, tp, test_score)
chain = sample(model, NUTS(0.65), 100)


Note:
At first I thought maybe it is not a bug, but that gaussian processes are not meant to be able to do this, and hence I also considered that maybe I should just do the predictions of the behavioral score on timepoints where it is missing after obtaining the chain. However, aside from the fact that I run into some troubles to do that practically, I also realized that doing this will make me lose some information. Because in theory it is possible that 2 tests are really close together in time (e.g. timepoint 1 and 2), and thus the behavioral score, despite not being known, can be estimated with a high probability, and thus it would contribute to the process of finding relationships between behavioral scores and test scores.

Hi,

First off, the current implementation of AbstractGPs will not work with missing data. It’s not completely trivial to replace missing data and therefore it is absent from our implementation.
Now about your model, if I understand correctly, you have observations, the test scores, and hidden variable, the behavioral score, which you only know at time 1.
GPs aim at representing a function y = f(x), by using existing data x and y. If you only have one instance of y I am not sure if they are really the solution for what you aim at.
To me it sounds like a hidden Markov model would be generally much more adapted.

1 Like

Hmm that is interesting, because we did consider hidden markov models initially too, but then when I was exploring time-series methods, it seemed like gaussian processes naturally model variable time intervals, as they do happen in my data too, whereas hidden markov models seem to assume this to be constant.
So I do still wonder if there is another way to work around this with gaussian processes somehow.

Well there are very strong connection between the two. For example in this paper they view GPs as some kind of Kalman filtering to allow linear scalability : https://arxiv.org/pdf/1811.06588.pdf
But again I think you would need more observations.

If your problem is about having a fixed step size between your observations, note that there are continuous versions of HMMs, and that more generally ODE could do the trick (at least I think)

Building on what @theogf is saying, from your code example it looks like you have some time-ordered “locations” \{x_j\}_{j=1}^k and you have the first behavioral score y_1. I think it might be helpful to provide a little bit more information about what you have in some more mathematical notation. If you really only have a single measurement, then it’s hard to imagine how you’ll be able to start thinking about dependence structure, which is certainly a primary point of Gaussian process models.

Also, I am not familiar with Turing or Bayesian methods for GPs, but I might also say that jumping in to a pretty sophisticated estimation procedure and trying to work backwards from those error messages seems like a pretty brutal job. It would probably save a lot of headache to work out exactly what your model is on pen and paper before you start writing code and trying to parse some beefy error messages.

1 Like

Good to know! I will check out if ODE’s make sense for my application (and if I can make sense of them :P). I also encountered Linear Dynamical Systems in the book by Bishop (Pattern Recognition and Machine Learning), which actually almost seems to be a synonym for continuous HMM’s anyway.

Here is my attempt to sort of ‘formalize’ my problem using mathematical notation:

Let D be the number of dogs, let T_d be the number of timepoints for a dog d, let P be the number of features/attributes (i.e. test scores) at every timepoint.

Then for a dog d we have \{z_{d,t}\}_{t=1}^{T_d} with only z_{d,1} being observed, where 0 \leq z \leq 4 and z could potentially be discretized as well if desired (because in the original data z are multiples of 0.25 because it is calculated based on the answers to a questionnaire), but I believe it being continuous might work to our advantage anyway.
Additionally, for a dog d we have \{X_{d,t}\}_{t=1}^{T_d} where X is P-dimensional, because in actuality there are multiple test scores at each timepoint (whereas in my initial post I assumed there to be only 1 for simplicity’s sake). X is an integer with 0 \leq X \leq 100.

I thought it might also be helpful to include a diagram:

Lastly, I did an attempt at forming a bit of an equation, inspired by Arthur’s reply in another topic.

x_{d,t,p} = \beta_{0,p} + z_{d,t} \cdot \beta_{p} + \epsilon_{d,t,p} for d=1,\dots,D, and t=1,\dots,T_d, and p=1,\dots,P.

Where:

• x_{d,t,p} is test score p for dog d at time point t
• \beta_{0,p} is the intercept for test score p
• z_{d,t} is the latent behavioral score for dog d at time t
• \beta_p is the coefficient for the behavioral score z on the test score x_p
• \epsilon_{d,t,p} are the residuals

z_{d,t} = f(z_{d, t-1}) for d=1,\dots,D, and t=2,\dots,T_d

where f is some function of time.

1 Like

Wow, that is a very nice clarifying post. Complete with a diagram! That’s awesome. If you only have z_{d,1} observed, it’s hard to see how any inference about f (or any time dependence) will be possible. I am sympathetic to people who don’t want to write a Markov model, because arguably one purpose of Gaussian process models is to conveniently account for dependence without a Markovian assumption. But with that said, your setup here does really seem well-suited to a nice vector-valued Markov chain model.

To step back again, is there any chance that you could get more or different data for this? Like, is there dog data from last year that you could fit, and then I suppose do some kind of forecasting to your dog data from this year? Or do you have some other data that is at least related that you could use to try and extract information about the time-dependence in the problem?

I get your point about not being able to do inference about f, and indeed essentially that is why I had a ‘hardcoded’ Gaussian process as well, because the parameters of it aren’t learned, because they indeed simply cannot be learned. I have to make hard assumptions abouts the maximum change per unit, because the behavioral score measures the behavioral ‘problems’ of a dog on a certain dimension, so a dog with the maximum score has a lot of problems, and hence we don’t expect such a dog to quickly go to 0.
But the idea is that there might/should be some kind of relationship between the test scores and the behavioral scores already, and that that in combination with some additional assumptions (such as no jumps between 4 and 0 in a short period of time) should be able to kind of make it work.

But I will definitely look into Markov chain models again then!

1 Like

As a sidenote, in case anyone ever runs into an issue with this.
I tried some other things, and after discussion with my supervisor we figured we could in theory do our own sampling from a MvNormal distribution, and together with Martin Trapp’s contribution of a code sample of how to do this in Turing using the Cholesky decomposition, I managed to essentially create my own implementation of a MvNormal specific to my case, where I essentially drawn n-1 independent samples from a Normal distribution, and then do the cholesky decomposition to produce values from the Multivariate normal (similar as described in the book by Rasmussen on Gaussian Processes) where I exclude the first column of the covariance/kernel matrix which I multiply with my z which is produced from the Normal distributions, and then add to that the first column multiplied with my known value ‘z1’.

This is sort of a quick rough description of what I did. Maybe it helps someone in the future. I definitely could sort of see the nontrivialness of doing this in every possible case with missing values, but in my case luckily it was possible at least.

Great to hear that it eventually works. Having arbitrary missing values is somewhat tricky in this case as Turing would have to know how to obtain the marginals, which is not impossible in some tractable cases but not possible at the moment.

Instead of using Turing it might also make sense for you to just directly use AdvancedHMC and define the log joint by hand.

Hmm I’ll have to look into that indeed. What do you think I would gain by implementing the log joint by hand, would it be more efficient mostly? Or could it also be that in its current state it could produce ‘wrong’ results to some extent?

Mostly more efficient, and it will be easier to handle more complex scenarios in which you encounter missing data. Depending on the model you have (I assume it’s not just a GP) it could be beneficial but would also require more hand-coding, e.g. you would need to use Bijectors for constraint RVs while Turing takes care of this automatically. But I would keep it in mind. For complex models and in case you know how the model looks like, like for BNNs, implementing the log joint by hand can be quite beneficial as you can optimise the code easier.