Hi,
I’ve been searching around (statsbase, linearalgebra) for a solution to the following (which I believe is some form of normalisation but I dont know the name) and a Julia implementation.
Note that this is a simplified form but illustrates the need:
I start with a collection of values: [“a”,“a”,“b”,“a”,“a”,“c”,“b”,“a”,“b”,“c”]
Determine the weight by distinct count:
- a → 50%
- b → 30%
- c → 20%
I now have to recursively remove (whole) items from each bucket to the extent that a threshold % is achieved (<=) for all values, e.g., with 35%:
a goes from 5 entries to 2 (2 / 7 = 28%) by removing 3
b is now 42% (3 / 7) so remove 1 and becomes (2 / 6 = 33%)
c is now 2 / 6 = 33%
a is now 2 / 6 = 33%
end
overall,
a → remove 3 (33%)
b → remove 1 (33%)
c → remove 0 (33%)
10 → 6
is this a known statistical / mathematical algorithm? the intention is to prevent any of the original collection entries “dominating” the statistics of any accompanying values e.g. it could be a tuple :
[(“a”,1000),(“a”,500),(“b”,1000),(“a”,200),(“a”,10000),(“c”,200),(“b”,250),(“a”,200),(“b”,4000),(“c”,10000)]
I wont go into how the entries are removed after the fact (unless anyone is keen). Also note that the above are just values, so wouldn’t stand up to much “challenge” on why this is a valuable.
Regards