I found helpful instructions on using weights and its source code (here, here, here) but I am still not sure I understand how to use it correctly. Maybe I am just confusing/misunderstanding something basic here…
My question: Do I need to specify what kind of weights I am supplying? If no, how does weights infer if I am providing frequency, analytic, or probability weights?
I am wondering about this because I need to transfrom frequency weights included in a dataset. A few of these transformed weights are less than 1. Hence, the distinction between frequency and probability or analytic weights is no longer obvious.
As a result, I am not sure if differences in weighted outcomes are due to my transformation of the weights or because weights assumes that the transformed weights are not of the same kind than the original ones (or both). See example below.
Many thanks for clarifying and explaining.
using Distributions, Random, StatsBase
rng = MersenneTwister(1234);
N = 1000;
a_dist = Normal(1000, 100); a = rand(rng, a_dist, N); # vector with obs
Probably a misunderstanding, you should explicitly use FrequencyWeights or fweights, ProbabilityWeights or pweights or AnalyticWeigths or aweights. Just weights will not make any assumption about the type and just error out if you try to do anything that needs this information (see your third link a bit down)
Am I missing something, or is the fact that you call rand() twice (generating independent weights for the original and “transformed” cases) the culprit here? What happens if you replace == in the last line with ≈?
Notice that setting rng fixes the random state for the first trial but it keeps updating after that:
So weights does not make any inference about the type of weights and just forwards responses of functions in which a given type of weights leads to errors?
Thanks for looking into this - I realize now my illustrative example is confusing.
The weights are supposed to differ by construction: weights_frequency_transformed are [0.1,100] while weights_frequency_orig are [1,100].
But I see your point that calling rng repeatedly brings in additional differences. I am not sure this makes a big difference since N = 1000 but it’s definitely confusing.
Notice that the result matches up to floating-point error.
To answer your original question, it isn’t clear to me which type of weights is assumed by the generic weights() function. For a side-by-side comparison like this, rather than hoping that weights() will figure out what you intended, it’d be better to use