Finding low frequency numbers in a large collection

Hey!

I can know the numbers beforehand (just not their frequency). Ideally I would find a solution that can handle up to the max of UInt64. For a test set, I have the distribution of numbers that looks evenly sampled from the range:

The counts for the numbers are heavily skewed, in fact, 75% of the numbers are <=100. This is because these numbers originate from DNA data and hence are not random:


(The x-scale is log10)


I’m not sure if I get your second proposal. Could you elaborate as it sounds nice :slight_smile:

if you were to sort your list before counting
…as you can start each thread from a different part in the array

I do not really have a list beforehand as I have to read the numbers from a file (or actually parse them as it is text format with some mess in between). I could pass bytes to different threads, parse the numbers, and then maybe do what you propose?