Weights in gaussian process

jzr · October 30, 2021, 1:11am

I have a model like @willtebbutt’s TemporalGPs example https://github.com/JuliaGaussianProcesses/TemporalGPs.jl#learning-kernel-parameters-with-optimjl-parameterhandlingjl-and-zygotejl

In my dataset, different observations should have different weights to account for the sampling procedure. In StatsBase, there are functions like fit(formula, tbl, wts=weightvec) (StatsBase.fit). How can I apply sampling weights to data points in the GP fit?

I realize there are different types of weighting and I don’t want to use the wrong invocation by accident (as cautioned here).

willtebbutt · October 31, 2021, 10:14pm

In my dataset, different observations should have different weights to account for the sampling procedure.

Could you elaborate a bit on this? My first thought would be to incorporate your weights through the observation variance, but I’d like to figure out whether this is a reasonable thing to do or not.

jzr · October 31, 2021, 11:08pm

The dataset is a phone survey produced by a survey company, conducted over a period of time. It includes demographic information about respondents such as sex, age, race. I am producing estimates of a certain variable over time. In the collected survey data, some demographic groups appear more or less often than they do in the real population (as determined by census data). In order to produce estimates representative of the population, the survey company adds to each row a weight, indicating how much that response should be weighted in order to match the population distribution, which is generated by iterative proportional fitting. For example, on a certain day of surveying if 77-year-old white women appeared more often than they do in the population, such a response might be assigned a sampling weight of 0.8. Then my estimates and predictions will produce results that match the population distribution, rather than the distribution of responses that happened to occur on that day.

willtebbutt · November 1, 2021, 4:47pm

Hmm interesting. I’m really not sure what the appropriate way to think about these weights is . It’s not obvious to me that it’s appropriate to downweight observations via changing the likelihood variance based on your description, and there’s not really another way that I’m aware of to re-weight observations in the log marginal likelihood / for doing posterior inference.

edit: this isn’t a problem I’ve got a lot of experience with though, so I might be missing something.

Eric · November 4, 2021, 10:09am

For the question about weights, I often go back to that sas article https://blogs.sas.com/content/iml/2017/10/02/weight-variables-in-statistics-sas.html

With this terminology, Julia’s GLM provides frequencies weights, and my LinearRegression package provides analytical weights. To my knowledge at this time, there is no implementation of regression with survey weights.
Although StatsBase defines Probability weights which I think are survey weights.
https://juliastats.org/StatsBase.jl/stable/weights/[https://juliastats.org/StatsBase.jl/stable/weights/](https://juliastats.org/StatsBase.jl/stable/weights/) this could be a starting point.

willtebbutt · November 4, 2021, 10:21am

Correct me if I’m wrong, but don’t these various weighting schemes assume models of the form

\log p(y_{1:N} | x_{1:N}, \theta) = \prod_{n=1}^N \log p(y_n | x_n, \theta)

? If so, this isn’t really what’s going on with GPs, at least not obviously (maybe there’s a way to cast them in this framework then marginalise over \theta?)

Eric · November 4, 2021, 12:41pm

I can’t comment on the math formulation.
I would also expect the tool needed to have features that enable to correctly:

relate the sampled population to the overall population accounting for the strata of interest in the study.
and also some methodology/strategy to account for the non-respondent (in this case the ones that do not answer the phone) and how they relate to the population and/or its strata.
While this is far from my comfort zone, I guess there
Is voluminous literature on the subject.

I have seen two packages that appear to be related to the topic:
https://github.com/jamanrique/SurveyAnalysis.jl
And
https://github.com/grahamstark/SurveyDataWeighting.jl

Hope this helps a little.

opera_malenky · November 4, 2021, 1:51pm

This might be terribly hacky, but perhaps just do a separate GP for each demographic cell of interest? Then maybe look at the weighted mean and variance of those individual processes?

Topic		Replies	Views
Usage of different types of weights Statistics	12	3178	July 12, 2017
Using Survey/Inverse Probability Weights in Regression Statistics	3	2047	April 19, 2018
[ANN] WeightedOnlineStats.jl Package Announcements package , announcement , statistics	12	1085	January 8, 2019
Faster gp fitting on million observations? Probabilistic Programming gaussian-process	15	1377	May 4, 2021
Weightened linear model in GLM.jl General Usage glm	3	111	September 28, 2024

Weights in gaussian process

Related topics