Consider the following scenario (in local scope):
@threads for i ∈ 1:128
y[i] = dosomethingexpensive( x[i] )
Say I’m running this on a machine with 32 physical cores. Then often, the last n < 32 calls to dosomethingexpensive are completed by fewer than n threads.
What would be the best way to achieve greater balancing?
Background: an example would be the case in which each i corresponds to a replication in a simulation study, where each replication can take a few minutes, but where there is no ex ante expectation that one replication would take longer than another.
Currently, only direct segmentation is supported by the
@threads macro. That is, there is no work stealing API directly available yet - you’ll have to do balancing yourself. One way “around” that is to use a
Channel of tasks, which are created ahead of time and pushed into that channel. After all tasks are created,
take! from the channel on all threads and execute the given task, thereby emulating a work-stealing scheduler.
You can also take a look at Threadpools.jl, though that comes with some caveats due to the fact that julia doesn’t pin threads to certain CPU threads etc.
Threads.@spawn does load balancing (similar to
I’ve played with something like
@sync for ....
but I recall reading on this forum that @threads is preferable for load balancing reasons… I’ll play around some more.
@threads has lower overhead (“is cheaper”) but doesn’t do load balancing at all. The iteration range of the loop is split into equals parts according to the number of available threads. OTOH,
@spawn implements a form of load balancing but has more overhead. See Announcing composable multi-threaded parallelism in Julia.
@spawn is basically the same as managing the tasks explicitly by hand via a
Channel. In the case of
@spawn, it’s the julia task system that’s doing the “balancing” for you implicitly.
As of julia 1.5,
@threads has an argument
schedule, though currently only
:static (“which creates one task per thread and divides the iterations equally among them”) is supported. In the future, when more kinds of scheduling would be supported,
@threads may be the better option (though I’m not sure what the current direction of things in that regard are).
I’ve implemented load-balancing threaded parallel loops in FLoops.jl which can also use a wide class of scheduling policies depending on your needs (plus other things like distributed and GPU -based parallel loops and reductions).