function ....
@threads for i ∈ 1:128
y[i] = dosomethingexpensive( x[i] )
end
end
Say I’m running this on a machine with 32 physical cores. Then often, the last n < 32 calls to dosomethingexpensive are completed by fewer than n threads.
What would be the best way to achieve greater balancing?
Background: an example would be the case in which each i corresponds to a replication in a simulation study, where each replication can take a few minutes, but where there is no ex ante expectation that one replication would take longer than another.
Currently, only direct segmentation is supported by the @threads macro. That is, there is no work stealing API directly available yet - you’ll have to do balancing yourself. One way “around” that is to use a Channel of tasks, which are created ahead of time and pushed into that channel. After all tasks are created, take! from the channel on all threads and execute the given task, thereby emulating a work-stealing scheduler.
You can also take a look at Threadpools.jl, though that comes with some caveats due to the fact that julia doesn’t pin threads to certain CPU threads etc.
@threads has lower overhead (“is cheaper”) but doesn’t do load balancing at all. The iteration range of the loop is split into equals parts according to the number of available threads. OTOH, @spawn implements a form of load balancing but has more overhead. See Announcing composable multi-threaded parallelism in Julia.
Yes, using @spawn is basically the same as managing the tasks explicitly by hand via a Channel. In the case of @spawn, it’s the julia task system that’s doing the “balancing” for you implicitly.
As of julia 1.5, @threads has an argument schedule, though currently only :static (“which creates one task per thread and divides the iterations equally among them”) is supported. In the future, when more kinds of scheduling would be supported, @threads may be the better option (though I’m not sure what the current direction of things in that regard are).