Do I need to specify what kind of weights I am providing?

jo-fleck · March 9, 2021, 2:03pm

I found helpful instructions on using weights and its source code (here, here, here) but I am still not sure I understand how to use it correctly. Maybe I am just confusing/misunderstanding something basic here…

My question: Do I need to specify what kind of weights I am supplying? If no, how does weights infer if I am providing frequency, analytic, or probability weights?

I am wondering about this because I need to transfrom frequency weights included in a dataset. A few of these transformed weights are less than 1. Hence, the distinction between frequency and probability or analytic weights is no longer obvious.

As a result, I am not sure if differences in weighted outcomes are due to my transformation of the weights or because weights assumes that the transformed weights are not of the same kind than the original ones (or both). See example below.

Many thanks for clarifying and explaining.

using Distributions, Random, StatsBase

rng = MersenneTwister(1234);
N = 1000;

a_dist = Normal(1000, 100); a = rand(rng, a_dist, N); # vector with obs

weights_frequency_orig = rand(rng, 1:100, N); # vector with ‘original’ freq weights
weights_frequency_transformed = rand(rng, 0.1:0.1:100.0, N); # transformed freq weights

a_std_weight_frequency_orig = std(a, weights(weights_frequency_orig));
a_std_weight_frequency_transformed = std(a, weights(weights_frequency_transformed));

a_std_weight_frequency_orig == a_std_weight_frequency_transformed # false

Is this because weights_frequency_orig != weights_frequency_transformed or because weights assumes different weights?

FPGro · March 9, 2021, 10:00pm

Probably a misunderstanding, you should explicitly use FrequencyWeights or fweights, ProbabilityWeights or pweights or AnalyticWeigths or aweights. Just weights will not make any assumption about the type and just error out if you try to do anything that needs this information (see your third link a bit down)

maxkapur · March 10, 2021, 1:04am

Am I missing something, or is the fact that you call rand() twice (generating independent weights for the original and “transformed” cases) the culprit here? What happens if you replace == in the last line with ≈?

Notice that setting rng fixes the random state for the first trial but it keeps updating after that:

julia> rng = MersenneTwister(1234);

julia> rand(rng)
0.5908446386657102

julia> rand(rng)
0.7667970365022592

julia> rand(rng)
0.5662374165061859

jo-fleck · March 15, 2021, 1:37pm

Thanks! Makes sense.

So weights does not make any inference about the type of weights and just forwards responses of functions in which a given type of weights leads to errors?

jo-fleck · March 15, 2021, 1:42pm

Thanks for looking into this - I realize now my illustrative example is confusing.

The weights are supposed to differ by construction: weights_frequency_transformed are [0.1,100] while weights_frequency_orig are [1,100].

But I see your point that calling rng repeatedly brings in additional differences. I am not sure this makes a big difference since N = 1000 but it’s definitely confusing.

maxkapur · March 16, 2021, 12:02am

Several things going on here:

std() is in Statistics, not StatsBase
Generating the weights randomly ensures that the outputs won’t be exactly the same
0.1:0.1:100.0 should have an upper limit of 10.0 instead to be equivalent to 1:100
1000 is not enough test points for the stdevs to come out close

Here is a corrected example with random weights:

using Random, Statistics   # std() is in Statistics, not StatsBase

N = 100000                       # More points
a = 100 .* 1000 .+ randn(N)      # Equivalent to Normal(1000, 100)

weights_frequency_orig = rand(1:100, N) 
weights_frequency_transformed = rand(0.1:0.1:10.0, N)  # Upper limit corrected to 10.0

a_std_weight_frequency_orig = std(a, weights(weights_frequency_orig));
a_std_weight_frequency_transformed = std(a, weights(weights_frequency_transformed));

@show a_std_weight_frequency_orig
@show a_std_weight_frequency_transformed
# a_std_weight_frequency_orig = 0.9992338257618083
# a_std_weight_frequency_transformed = 0.9968675255497309

And one where I actually transform the original weights instead of generating new ones:

using Random, Statistics

N = 1000 
a = 100 .* 1000 .+ randn(N)

weights_frequency_orig = rand(1:100, N) 
weights_frequency_transformed = weights_frequency_orig ./ 10  # Transformed instead

a_std_weight_frequency_orig = std(a, weights(weights_frequency_orig));
a_std_weight_frequency_transformed = std(a, weights(weights_frequency_transformed));

@show a_std_weight_frequency_orig
@show a_std_weight_frequency_transformed
# a_std_weight_frequency_orig = 1.00674757512939
# a_std_weight_frequency_transformed = 1.0067475751293908

Notice that the result matches up to floating-point error.

To answer your original question, it isn’t clear to me which type of weights is assumed by the generic weights() function. For a side-by-side comparison like this, rather than hoping that weights() will figure out what you intended, it’d be better to use

a_std_weight_frequency_orig = std(a, ProbabilityWeights(weights_frequency_orig))
a_std_weight_frequency_transformed = std(a, ProbabilityWeights(weights_frequency_transformed))

or similar to remove the ambiguity, where the different weight constructors are as listed here.

Topic		Replies	Views
Usage of different types of weights Statistics	12	3171	July 12, 2017
Using Weights from statsbase in Julia, and using an array New to Julia package	2	2868	November 27, 2020
Weightened linear model in GLM.jl General Usage glm	3	109	September 28, 2024
[ANN] WeightedOnlineStats.jl Package Announcements package , announcement , statistics	12	1084	January 8, 2019
Using Survey/Inverse Probability Weights in Regression Statistics	3	2047	April 19, 2018

Do I need to specify what kind of weights I am providing?

Related topics