The weights.jl file describes three types of weights: frequency weights, probability weights, and analytic weights.
This is an amazing feature to Julia, as only commercial software like STATA and SAS understand the differences between these 3 weights. R and Python only understand one type of weight, which I think is something like an importance weight.
Being able to use these 3 types of weights properly is crucial to the field that I work in (causal machine learning). Most software implementations use a single weight vector for everything - this will get you through weighted maximum likelihood but it will not be able to get you through proper causal inference (covariance matrix is dependent on what type of weight the user passed in). If Julia has a consistent use of fweights, pweights, and aweights it would be a very distinguishing factor. Are there any plans to standardize this for all functions throughout JuliaStats?
it would be great to have a recommended/standardized way to pass weights to fit, either as position (e.g. fit(Type, X, y, weights, args...) or as a named argument.
I agree with both of you. There’s no detailed plan right now but indeed our idea when we added these weights types (only a few months ago) was to use them progressively everywhere it makes sense. I thought GLM.jl would be a good start. Currently that package takes a vector of reals for weights and interprets them as frequency weights, but that’s not documented anywhere. It shouldn’t be hard to make it accept the three types of weights and do the right thing with them.
Regarding the fit interface, there as been some discussion in StatsBase and in StatsModels, but no decision so far. Weights should probably be a keyword argument because having more than say 4-5 positional arguments is confusing.The weights argument would accept either a weights vector or the name of a data frame column which is a weights vector.
I’m not sure what kind of functions need to take into account the type of weights they accept. var, cov and std where the main ones, but I haven’t looked beyond them yet. Ideas?
Hi @jeffwong, thank you for sharing these concepts. Can you or someone elaborate on the differences? The doc strings doesn’t say much, perhaps an example code would clarify the differences better.
I see the value in having different types of weights in the language because it helps package writers to exploit multiple dispatch in Julia. Other than that, I am not very convinced that this distinction is worth mathematically speaking.
That means you never found yourself in a situation of using weights when fitting models or computing a variance. There’s no way around them in these contexts.
Maybe we are talking about different things? I have a package that is all about combining weights into means/variances and I never felt the need to differentiate or define what a weight vector is. There is a whole family of methods for estimating spatial variance known as Kriging estimators available in my GeoStats.jl package: http://juliohm.github.io/GeoStats.jl/stable/estimation.html
Hi @juliohm, here are some good lecture notes I found describing fweights, pweights, and aweights.
For anyone doing inference it is crucial to know the difference between these three types of weights. One example is they affect the covariance matrix for regression coefficients in GLMs, hence it affects whether or not a coefficient is significant. The difference in these weights can be pretty subtle as they do not affect the coefficient itself (only the covariance).
The inability to get the right covariance matrix has implications in a lot of applications. One clear example is in finance, where the covariance of your portfolio is used as a kind of risk assessment.
R and Python have made the mistake of building a ton of modeling algorithms that only understand one type of weight (I think it’s an importance weight), or no weights at all. Here is an example on Rbloggers highlighting the confusion in R’s lm and glm functions and getting good inference is really difficult. As far as I know, only the commercial software like SAS and STATA get it right
Hi @jeffwong, thank you for the links, they are very useful in order to put this topic into context. From what I’ve understood, the distinction that is being made between weights has the purpose of 1) encompassing duplicates in the data or 2) performing regression on aggregated points.
My understanding is that 1) is a design choice where one has to decide between asking the user to remove duplicates in the data before applying regression or implementing adjustments to the regression coefficients to accommodate the repetition explicitly. If this is an equivalence and there is no situation where cleaning the data beforehand solves the issue, then I don’t see much value in modifying the implementation with weights, to me it feels like unnecessary complexity. Please let me know if they are not equivalent, I’d be interested to learn.
For 2), in the lecture notes you linked, they have an example with a individual/village regression where they state that the “random unit” is the village, which is an aggregate of individuals. In geostatistics, this is a well known problem in which one has to perform estimates or inference on blocks that are on different support than that of the samples. We have developed plenty of methods for this problem that take this weighting into account, but at no point in history we had to define a different weight types explicitly. Sadly, GeoStats.jl doesn’t have these methods implemented yet, so that I could demonstrate what I mean, but they will be there at some point.
With that said, it is good to see weight types in Julia anyways for multiple dispatch and for triggering the appropriate variant of the estimator. Specially for 2) when the data comes already aggregated and there is no way to undo the aggregation.
It makes a big difference if by using frequency weights you can avoid repeating each line 10 or 100 times. Also frequency weights are very useful for resampling procedures like bootstrap or survey/imputation replicate weights. They are usually quite simple to implement, so it’s not a big deal to support them.
I think you just don’t realize you need different weights types as long as you deal with only one type in a given field or family of methods. If you take the averaging into units into account via weights, then you are actually using analytical/precision weights without using the name. But these are quite different from frequency or sampling weights.
@nalimilan yes, it is probably a naming convention that is not popular in our field. We do this analytical weighting and other things all the time without calling it such.
I’d like to help extend GLM.jl to support these 3 weights. 1 thing that I think will be important is the ability to pass multiple weight types, for example both a fweight and a pweight. The fweights can easily come from aggregated data, and pweights might come from some kind of stratified sampling. It will be important to accommodate datasets that were generated using both. I personally haven’t come across the scenario where a dataset was constructed with all 3 types of weights though