Usage of different types of weights

jeffwong · July 11, 2017, 2:52am

The weights.jl file describes three types of weights: frequency weights, probability weights, and analytic weights.

This is an amazing feature to Julia, as only commercial software like STATA and SAS understand the differences between these 3 weights. R and Python only understand one type of weight, which I think is something like an importance weight.

Being able to use these 3 types of weights properly is crucial to the field that I work in (causal machine learning). Most software implementations use a single weight vector for everything - this will get you through weighted maximum likelihood but it will not be able to get you through proper causal inference (covariance matrix is dependent on what type of weight the user passed in). If Julia has a consistent use of fweights, pweights, and aweights it would be a very distinguishing factor. Are there any plans to standardize this for all functions throughout JuliaStats?

gdkrmr · July 11, 2017, 7:52am

it would be great to have a recommended/standardized way to pass weights to fit, either as position (e.g. fit(Type, X, y, weights, args...) or as a named argument.

nalimilan · July 11, 2017, 8:51am

I agree with both of you. There’s no detailed plan right now but indeed our idea when we added these weights types (only a few months ago) was to use them progressively everywhere it makes sense. I thought GLM.jl would be a good start. Currently that package takes a vector of reals for weights and interprets them as frequency weights, but that’s not documented anywhere. It shouldn’t be hard to make it accept the three types of weights and do the right thing with them.

Regarding the fit interface, there as been some discussion in StatsBase and in StatsModels, but no decision so far. Weights should probably be a keyword argument because having more than say 4-5 positional arguments is confusing.The weights argument would accept either a weights vector or the name of a data frame column which is a weights vector.

I’m not sure what kind of functions need to take into account the type of weights they accept. var, cov and std where the main ones, but I haven’t looked beyond them yet. Ideas?

juliohm · July 11, 2017, 3:34pm

Hi @jeffwong, thank you for sharing these concepts. Can you or someone elaborate on the differences? The doc strings doesn’t say much, perhaps an example code would clarify the differences better.

nalimilan · July 11, 2017, 4:52pm

See this description.

juliohm · July 11, 2017, 6:44pm

I see the value in having different types of weights in the language because it helps package writers to exploit multiple dispatch in Julia. Other than that, I am not very convinced that this distinction is worth mathematically speaking.

nalimilan · July 11, 2017, 8:17pm

That means you never found yourself in a situation of using weights when fitting models or computing a variance. There’s no way around them in these contexts.

juliohm · July 11, 2017, 9:54pm

Maybe we are talking about different things? I have a package that is all about combining weights into means/variances and I never felt the need to differentiate or define what a weight vector is. There is a whole family of methods for estimating spatial variance known as Kriging estimators available in my GeoStats.jl package: http://juliohm.github.io/GeoStats.jl/stable/estimation.html

More than that, these weights can be used to estimate variance everywhere as seen in this example: http://nbviewer.jupyter.org/github/juliohm/GeoStats.jl/blob/master/examples/AnisotropicModels.ipynb

jeffwong · July 12, 2017, 7:27am

Hi @juliohm, here are some good lecture notes I found describing fweights, pweights, and aweights.

For anyone doing inference it is crucial to know the difference between these three types of weights. One example is they affect the covariance matrix for regression coefficients in GLMs, hence it affects whether or not a coefficient is significant. The difference in these weights can be pretty subtle as they do not affect the coefficient itself (only the covariance).

The inability to get the right covariance matrix has implications in a lot of applications. One clear example is in finance, where the covariance of your portfolio is used as a kind of risk assessment.

R and Python have made the mistake of building a ton of modeling algorithms that only understand one type of weight (I think it’s an importance weight), or no weights at all. Here is an example on Rbloggers highlighting the confusion in R’s lm and glm functions and getting good inference is really difficult. As far as I know, only the commercial software like SAS and STATA get it right

juliohm · July 12, 2017, 3:57pm

Hi @jeffwong, thank you for the links, they are very useful in order to put this topic into context. From what I’ve understood, the distinction that is being made between weights has the purpose of 1) encompassing duplicates in the data or 2) performing regression on aggregated points.

My understanding is that 1) is a design choice where one has to decide between asking the user to remove duplicates in the data before applying regression or implementing adjustments to the regression coefficients to accommodate the repetition explicitly. If this is an equivalence and there is no situation where cleaning the data beforehand solves the issue, then I don’t see much value in modifying the implementation with weights, to me it feels like unnecessary complexity. Please let me know if they are not equivalent, I’d be interested to learn.

For 2), in the lecture notes you linked, they have an example with a individual/village regression where they state that the “random unit” is the village, which is an aggregate of individuals. In geostatistics, this is a well known problem in which one has to perform estimates or inference on blocks that are on different support than that of the samples. We have developed plenty of methods for this problem that take this weighting into account, but at no point in history we had to define a different weight types explicitly. Sadly, GeoStats.jl doesn’t have these methods implemented yet, so that I could demonstrate what I mean, but they will be there at some point.

With that said, it is good to see weight types in Julia anyways for multiple dispatch and for triggering the appropriate variant of the estimator. Specially for 2) when the data comes already aggregated and there is no way to undo the aggregation.

nalimilan · July 12, 2017, 4:38pm

It makes a big difference if by using frequency weights you can avoid repeating each line 10 or 100 times. Also frequency weights are very useful for resampling procedures like bootstrap or survey/imputation replicate weights. They are usually quite simple to implement, so it’s not a big deal to support them.

I think you just don’t realize you need different weights types as long as you deal with only one type in a given field or family of methods. If you take the averaging into units into account via weights, then you are actually using analytical/precision weights without using the name. But these are quite different from frequency or sampling weights.

juliohm · July 12, 2017, 4:48pm

@nalimilan yes, it is probably a naming convention that is not popular in our field. We do this analytical weighting and other things all the time without calling it such.

jeffwong · July 12, 2017, 4:54pm

I’d like to help extend GLM.jl to support these 3 weights. 1 thing that I think will be important is the ability to pass multiple weight types, for example both a fweight and a pweight. The fweights can easily come from aggregated data, and pweights might come from some kind of stratified sampling. It will be important to accommodate datasets that were generated using both. I personally haven’t come across the scenario where a dataset was constructed with all 3 types of weights though

Topic		Replies	Views
Do I need to specify what kind of weights I am providing? New to Julia statistics	5	565	March 16, 2021
Weightened linear model in GLM.jl General Usage glm	3	111	September 28, 2024
[ANN] WeightedOnlineStats.jl Package Announcements package , announcement , statistics	12	1085	January 8, 2019
Using Survey/Inverse Probability Weights in Regression Statistics	3	2047	April 19, 2018
How to add weights parameter to Generalized Mixed Model Statistics question	33	3223	June 24, 2020

Usage of different types of weights

Related topics