Do I need to specify what kind of weights I am providing?

I found helpful instructions on using weights and its source code (here, here, here) but I am still not sure I understand how to use it correctly. Maybe I am just confusing/misunderstanding something basic here…

My question: Do I need to specify what kind of weights I am supplying? If no, how does weights infer if I am providing frequency, analytic, or probability weights?

I am wondering about this because I need to transfrom frequency weights included in a dataset. A few of these transformed weights are less than 1. Hence, the distinction between frequency and probability or analytic weights is no longer obvious.

As a result, I am not sure if differences in weighted outcomes are due to my transformation of the weights or because weights assumes that the transformed weights are not of the same kind than the original ones (or both). See example below.

Many thanks for clarifying and explaining.

using Distributions, Random, StatsBase

rng = MersenneTwister(1234);
N = 1000;

a_dist = Normal(1000, 100); a = rand(rng, a_dist, N); # vector with obs

weights_frequency_orig = rand(rng, 1:100, N); # vector with ‘original’ freq weights
weights_frequency_transformed = rand(rng, 0.1:0.1:100.0, N); # transformed freq weights

a_std_weight_frequency_orig = std(a, weights(weights_frequency_orig));
a_std_weight_frequency_transformed = std(a, weights(weights_frequency_transformed));

a_std_weight_frequency_orig == a_std_weight_frequency_transformed # false

Is this because weights_frequency_orig != weights_frequency_transformed or because weights assumes different weights?

Probably a misunderstanding, you should explicitly use FrequencyWeights or fweights, ProbabilityWeights or pweights or AnalyticWeigths or aweights. Just weights will not make any assumption about the type and just error out if you try to do anything that needs this information (see your third link a bit down)

1 Like

Am I missing something, or is the fact that you call rand() twice (generating independent weights for the original and “transformed” cases) the culprit here? What happens if you replace == in the last line with ?

Notice that setting rng fixes the random state for the first trial but it keeps updating after that:

julia> rng = MersenneTwister(1234);

julia> rand(rng)
0.5908446386657102

julia> rand(rng)
0.7667970365022592

julia> rand(rng)
0.5662374165061859

Thanks! Makes sense.

So weights does not make any inference about the type of weights and just forwards responses of functions in which a given type of weights leads to errors?

Thanks for looking into this - I realize now my illustrative example is confusing.

The weights are supposed to differ by construction: weights_frequency_transformed are [0.1,100] while weights_frequency_orig are [1,100].

But I see your point that calling rng repeatedly brings in additional differences. I am not sure this makes a big difference since N = 1000 but it’s definitely confusing.

Several things going on here:

  • std() is in Statistics, not StatsBase
  • Generating the weights randomly ensures that the outputs won’t be exactly the same
  • 0.1:0.1:100.0 should have an upper limit of 10.0 instead to be equivalent to 1:100
  • 1000 is not enough test points for the stdevs to come out close

Here is a corrected example with random weights:

using Random, Statistics   # std() is in Statistics, not StatsBase

N = 100000                       # More points
a = 100 .* 1000 .+ randn(N)      # Equivalent to Normal(1000, 100)

weights_frequency_orig = rand(1:100, N) 
weights_frequency_transformed = rand(0.1:0.1:10.0, N)  # Upper limit corrected to 10.0

a_std_weight_frequency_orig = std(a, weights(weights_frequency_orig));
a_std_weight_frequency_transformed = std(a, weights(weights_frequency_transformed));

@show a_std_weight_frequency_orig
@show a_std_weight_frequency_transformed
# a_std_weight_frequency_orig = 0.9992338257618083
# a_std_weight_frequency_transformed = 0.9968675255497309

And one where I actually transform the original weights instead of generating new ones:

using Random, Statistics

N = 1000 
a = 100 .* 1000 .+ randn(N)

weights_frequency_orig = rand(1:100, N) 
weights_frequency_transformed = weights_frequency_orig ./ 10  # Transformed instead

a_std_weight_frequency_orig = std(a, weights(weights_frequency_orig));
a_std_weight_frequency_transformed = std(a, weights(weights_frequency_transformed));

@show a_std_weight_frequency_orig
@show a_std_weight_frequency_transformed
# a_std_weight_frequency_orig = 1.00674757512939
# a_std_weight_frequency_transformed = 1.0067475751293908

Notice that the result matches up to floating-point error.

To answer your original question, it isn’t clear to me which type of weights is assumed by the generic weights() function. For a side-by-side comparison like this, rather than hoping that weights() will figure out what you intended, it’d be better to use

a_std_weight_frequency_orig = std(a, ProbabilityWeights(weights_frequency_orig))
a_std_weight_frequency_transformed = std(a, ProbabilityWeights(weights_frequency_transformed))

or similar to remove the ambiguity, where the different weight constructors are as listed here.