# Reduce "overweighted" collection entries to a given weight

Hi,

I’ve been searching around (statsbase, linearalgebra) for a solution to the following (which I believe is some form of normalisation but I dont know the name) and a Julia implementation.

Note that this is a simplified form but illustrates the need:

Determine the weight by distinct count:

• a → 50%
• b → 30%
• c → 20%

I now have to recursively remove (whole) items from each bucket to the extent that a threshold % is achieved (<=) for all values, e.g., with 35%:

a goes from 5 entries to 2 (2 / 7 = 28%) by removing 3
b is now 42% (3 / 7) so remove 1 and becomes (2 / 6 = 33%)
c is now 2 / 6 = 33%
a is now 2 / 6 = 33%

end

overall,

a → remove 3 (33%)
b → remove 1 (33%)
c → remove 0 (33%)
10 → 6

is this a known statistical / mathematical algorithm? the intention is to prevent any of the original collection entries “dominating” the statistics of any accompanying values e.g. it could be a tuple :

[(“a”,1000),(“a”,500),(“b”,1000),(“a”,200),(“a”,10000),(“c”,200),(“b”,250),(“a”,200),(“b”,4000),(“c”,10000)]

I wont go into how the entries are removed after the fact (unless anyone is keen). Also note that the above are just values, so wouldn’t stand up to much “challenge” on why this is a valuable.

Regards

I mean if you willing to modify a collection of data to the extent of removing data points, I don’t know if your application would allow to just rebuild it exactly the way you like it?

I think `Distributions.jl` allows you make a new sampler based on a defined distribution. Otherwise, I suppose you could also under-sample an over-sampled collections.

Hi @cchderrick many thanks for your response - I’d all but given up on this so appreciate you taking the time. Yes, rebuilding the collection is appropriate - it’s more the determination of how many to remove that I was seeking an implementation for: potentially this isn’t a standard approach.

I will take a look at Distributions.jl.

Regards