Automatically fusing together several for-loops

tkf · September 9, 2020, 3:12am

I don’t think you need a “user-land” task pool mechanism if it is just for load balancing. julia runtime already has a “thread pool” and we can just use it. For example, Transducers.foldxt provides a simple parameter basesize to control the load balancing while directly using julia’s task scheduler (I think basesize=1 corresponds to ThreadPools.tmap and the default basesize = length(data) ÷ nthreads() corresponds to Threads.@threads).

A potential benefit of ThreadPools.jl-like approach is that, when you have compute-intensive tasks mixed with I/O, it can be used to limit the number of concurrently executed tasks (I’m not sure if it’s already implemented in ThreadPools.jl but I’ve been wanting to add this in Transducers.jl). This is useful sometimes because you can limit the resource (e.g., memory) required for the entire process even when there are a lot of I/O. However, it comes with a cost because communication with Channel (used in ThreadPools.jl and handy for implementing this kind of stuff) has higher cost than simply spawning a task.

dlakelan · September 9, 2020, 3:24am

So, imagine you want to run 1000 simulations, and they take anywhere from 1ms to 40 minutes with an average say of 1 minute. Suppose you have say 6 cores. If you use ThreadPools you can queue up 6 simulations at a time, as soon as one of them finishes, a new one will spawn. You’ll always have 6 of them running at any one time, and you’ll never have a situation like you would with @threads for i = 1:1000 ... where the first 100 of them all take 40 minutes because the parameter you’re sweeping through is similar for all of them… and then this one thread takes 4000 minutes while the entire rest of the calculation across all the other threads takes 2000/5 = 400 minutes.

With thread pools, instead of waiting 4000 minutes for the first thread to finish, you’d wait 4000/6 + 2000/6 = 1000 minutes

tkf · September 9, 2020, 3:46am

What you are describing is just a limitation of @threads and not julia’s task system. It’s also not the property of the task pool provided by ThreadPools.jl that helps you getting the load balancing behavior. That’s the implementation of the higher-level API like ThreadPools.tmap that happens to use one task per element. This can be done without implementing the task pool. I’m pretty sure there are a few other packages (besides Transducers.jl and ThreadsX.jl) with a similar implementation that directly uses julia’s thread pool (e.g., Parallelism.tmap uses basesize=1 by default).

tkf · September 9, 2020, 4:01am

Just be clear, I’m not discouraging anyone to use ThreadPools.jl in application (= non-library) code especially if you don’t care about the throughput of task spawns (some tasks taking 40 min while others can end in 1 ms is a good example). It is tested in the wild and comes with an excellent profiling facility. I just wanted to discuss the properties of it so that we can understand the pros and cons.

Topic		Replies	Views
[ANN] FunctionWranglers.jl - Fast, inlined execution of arrays of functions Package Announcements package , announcement	3	825	October 22, 2020
Trying to identify possible optimizations (or errors) in a graph algorithm Graphs lightgraphs	24	1780	January 31, 2019
Slow code with Union and Box General Usage	17	294	March 6, 2024
Assistance with optimization Performance question	0	214	February 27, 2024
Performance: Building directed graph of game scores (0,0) -> (0,1) etc Performance question	9	1035	June 28, 2019

Automatically fusing together several for-loops

Related topics