Gcc vs Threads.@threads vs Threads.@spawn for large loops

A statically scheduled loop in OpenMP splits the loop range over num_threads and typically uses a tree to fan-out the work to the threads in parallel. At the end of the loop, the inverse of the tree is typically used for a barrier. This broadcast-barrier pair of synchronization constructs are pretty much the entire overhead for the loop and these are very well studied – each takes only hundreds to a few thousand cycles, depending on the processor and the number of threads.

But what happens if in the loop body you call another function that also has a parallel loop? With OpenMP you have to analyze your call graph, determine thread allocation at each level, and then carefully set up thread affinities, and possibly environment variables for libraries, etc. in order to use static scheduling all the way down. It is not impossible, just very very hard. So OpenMP added tasks and teams. And those aren’t pervasive in libraries anyway.

What I’m getting at is that it isn’t just variable duration loop iterations that require dynamic scheduling of the sort Julia’s scheduler manages.

As you say, ‘not everything can be done at once’, or IOW, there’s no magic bullet. Nonetheless, we still hope to improve the common case.