I don’t think you need a “user-land” task pool mechanism if it is just for load balancing. julia runtime already has a “thread pool” and we can just use it. For example, Transducers.foldxt provides a simple parameter basesize to control the load balancing while directly using julia’s task scheduler (I think basesize=1 corresponds to ThreadPools.tmap and the default basesize = length(data) ÷ nthreads() corresponds to Threads.@threads).
A potential benefit of ThreadPools.jl-like approach is that, when you have compute-intensive tasks mixed with I/O, it can be used to limit the number of concurrently executed tasks (I’m not sure if it’s already implemented in ThreadPools.jl but I’ve been wanting to add this in Transducers.jl). This is useful sometimes because you can limit the resource (e.g., memory) required for the entire process even when there are a lot of I/O. However, it comes with a cost because communication with Channel (used in ThreadPools.jl and handy for implementing this kind of stuff) has higher cost than simply spawning a task.
So, imagine you want to run 1000 simulations, and they take anywhere from 1ms to 40 minutes with an average say of 1 minute. Suppose you have say 6 cores. If you use ThreadPools you can queue up 6 simulations at a time, as soon as one of them finishes, a new one will spawn. You’ll always have 6 of them running at any one time, and you’ll never have a situation like you would with @threads for i = 1:1000 ... where the first 100 of them all take 40 minutes because the parameter you’re sweeping through is similar for all of them… and then this one thread takes 4000 minutes while the entire rest of the calculation across all the other threads takes 2000/5 = 400 minutes.
With thread pools, instead of waiting 4000 minutes for the first thread to finish, you’d wait 4000/6 + 2000/6 = 1000 minutes
What you are describing is just a limitation of @threads and not julia’s task system. It’s also not the property of the task pool provided by ThreadPools.jl that helps you getting the load balancing behavior. That’s the implementation of the higher-level API like ThreadPools.tmap that happens to use one task per element. This can be done without implementing the task pool. I’m pretty sure there are a few other packages (besides Transducers.jl and ThreadsX.jl) with a similar implementation that directly uses julia’s thread pool (e.g., Parallelism.tmap uses basesize=1 by default).
Just be clear, I’m not discouraging anyone to use ThreadPools.jl in application (= non-library) code especially if you don’t care about the throughput of task spawns (some tasks taking 40 min while others can end in 1 ms is a good example). It is tested in the wild and comes with an excellent profiling facility. I just wanted to discuss the properties of it so that we can understand the pros and cons.