Reduce "overweighted" collection entries to a given weight

djholiver · March 3, 2021, 9:45pm

Hi,

I’ve been searching around (statsbase, linearalgebra) for a solution to the following (which I believe is some form of normalisation but I dont know the name) and a Julia implementation.

Note that this is a simplified form but illustrates the need:

I start with a collection of values: [“a”,“a”,“b”,“a”,“a”,“c”,“b”,“a”,“b”,“c”]

Determine the weight by distinct count:

a → 50%
b → 30%
c → 20%

I now have to recursively remove (whole) items from each bucket to the extent that a threshold % is achieved (<=) for all values, e.g., with 35%:

a goes from 5 entries to 2 (2 / 7 = 28%) by removing 3
b is now 42% (3 / 7) so remove 1 and becomes (2 / 6 = 33%)
c is now 2 / 6 = 33%
a is now 2 / 6 = 33%

end

overall,

a → remove 3 (33%)
b → remove 1 (33%)
c → remove 0 (33%)
10 → 6

is this a known statistical / mathematical algorithm? the intention is to prevent any of the original collection entries “dominating” the statistics of any accompanying values e.g. it could be a tuple :

[(“a”,1000),(“a”,500),(“b”,1000),(“a”,200),(“a”,10000),(“c”,200),(“b”,250),(“a”,200),(“b”,4000),(“c”,10000)]

I wont go into how the entries are removed after the fact (unless anyone is keen). Also note that the above are just values, so wouldn’t stand up to much “challenge” on why this is a valuable.

Regards

cchderrick · March 8, 2021, 2:52am

I mean if you willing to modify a collection of data to the extent of removing data points, I don’t know if your application would allow to just rebuild it exactly the way you like it?

I think Distributions.jl allows you make a new sampler based on a defined distribution. Otherwise, I suppose you could also under-sample an over-sampled collections.

djholiver · March 8, 2021, 9:29pm

Hi @cchderrick many thanks for your response - I’d all but given up on this so appreciate you taking the time. Yes, rebuilding the collection is appropriate - it’s more the determination of how many to remove that I was seeking an implementation for: potentially this isn’t a standard approach.

I will take a look at Distributions.jl.

Regards

Topic		Replies	Views
[ANN] BiweightStats.jl v0.2: Robust statistics based on the biweight transform Package Announcements statistics	2	411	June 20, 2022
Apply weights in JuliaDB groupby Data	3	765	May 10, 2018
Sorting common elements into bins General Usage	10	1721	March 8, 2018
Finding parameter (via MLE) of distributons with binned data New to Julia question , statistics	1	450	July 21, 2021
Applying weights to Randomized Arrays In a Dataframe Statistics	1	403	November 12, 2021

Reduce "overweighted" collection entries to a given weight

Related topics