Finding low frequency numbers in a large collection

CodeGodz · January 26, 2023, 6:44pm

Hey!

I can know the numbers beforehand (just not their frequency). Ideally I would find a solution that can handle up to the max of UInt64. For a test set, I have the distribution of numbers that looks evenly sampled from the range:

The counts for the numbers are heavily skewed, in fact, 75% of the numbers are <=100. This is because these numbers originate from DNA data and hence are not random:

(The x-scale is log10)

I’m not sure if I get your second proposal. Could you elaborate as it sounds nice

if you were to sort your list before counting
…as you can start each thread from a different part in the array

I do not really have a list beforehand as I have to read the numbers from a file (or actually parse them as it is text format with some mess in between). I could pass bytes to different threads, parse the numbers, and then maybe do what you propose?

Topic		Replies	Views
Manage tasks from Threads.@spawn without using map() General Usage parallel , multithreading , threads	16	244	August 27, 2024
Multi-threading doesn't seem to scale well on this counting problem Performance multithreading	15	2150	January 14, 2018
Count words challenge Performance	24	2177	March 23, 2021
Appropiate use of Threads.@threads? New to Julia question , multithreading	6	1851	June 27, 2017
How to count all unique character frequency in a string? New to Julia question , statistics , strings	25	12105	January 8, 2019

Finding low frequency numbers in a large collection

Related topics