Weights in gaussian process

I have a model like @willtebbutt’s TemporalGPs example https://github.com/JuliaGaussianProcesses/TemporalGPs.jl#learning-kernel-parameters-with-optimjl-parameterhandlingjl-and-zygotejl

In my dataset, different observations should have different weights to account for the sampling procedure. In StatsBase, there are functions like fit(formula, tbl, wts=weightvec) (StatsBase.fit). How can I apply sampling weights to data points in the GP fit?

I realize there are different types of weighting and I don’t want to use the wrong invocation by accident (as cautioned here).

In my dataset, different observations should have different weights to account for the sampling procedure.

Could you elaborate a bit on this? My first thought would be to incorporate your weights through the observation variance, but I’d like to figure out whether this is a reasonable thing to do or not.

The dataset is a phone survey produced by a survey company, conducted over a period of time. It includes demographic information about respondents such as sex, age, race. I am producing estimates of a certain variable over time. In the collected survey data, some demographic groups appear more or less often than they do in the real population (as determined by census data). In order to produce estimates representative of the population, the survey company adds to each row a weight, indicating how much that response should be weighted in order to match the population distribution, which is generated by iterative proportional fitting. For example, on a certain day of surveying if 77-year-old white women appeared more often than they do in the population, such a response might be assigned a sampling weight of 0.8. Then my estimates and predictions will produce results that match the population distribution, rather than the distribution of responses that happened to occur on that day.

Hmm interesting. I’m really not sure what the appropriate way to think about these weights is :grimacing: . It’s not obvious to me that it’s appropriate to downweight observations via changing the likelihood variance based on your description, and there’s not really another way that I’m aware of to re-weight observations in the log marginal likelihood / for doing posterior inference.

edit: this isn’t a problem I’ve got a lot of experience with though, so I might be missing something.

1 Like

For the question about weights, I often go back to that sas article https://blogs.sas.com/content/iml/2017/10/02/weight-variables-in-statistics-sas.html

With this terminology, Julia’s GLM provides frequencies weights, and my LinearRegression package provides analytical weights. To my knowledge at this time, there is no implementation of regression with survey weights.
Although StatsBase defines Probability weights which I think are survey weights.
https://juliastats.org/StatsBase.jl/stable/weights/[https://juliastats.org/StatsBase.jl/stable/weights/](https://juliastats.org/StatsBase.jl/stable/weights/) this could be a starting point.

Correct me if I’m wrong, but don’t these various weighting schemes assume models of the form

\log p(y_{1:N} | x_{1:N}, \theta) = \prod_{n=1}^N \log p(y_n | x_n, \theta)

? If so, this isn’t really what’s going on with GPs, at least not obviously (maybe there’s a way to cast them in this framework then marginalise over \theta?)

I can’t comment on the math formulation.
I would also expect the tool needed to have features that enable to correctly:

  • relate the sampled population to the overall population accounting for the strata of interest in the study.
  • and also some methodology/strategy to account for the non-respondent (in this case the ones that do not answer the phone) and how they relate to the population and/or its strata.
    While this is far from my comfort zone, I guess there
    Is voluminous literature on the subject.

I have seen two packages that appear to be related to the topic:
https://github.com/jamanrique/SurveyAnalysis.jl
And
https://github.com/grahamstark/SurveyDataWeighting.jl

Hope this helps a little.

This might be terribly hacky, but perhaps just do a separate GP for each demographic cell of interest? Then maybe look at the weighted mean and variance of those individual processes?