Are there any plans to improve the composability of @batch and/or reduce overhead of @threads?

Does it only require a lot of work? Would it require julia 2.0? Etc.

This would definitely be useful info when planning projects.

Related to what I learned here:

I think there are inherent tradeoffs depending on the design choices made that might make it impossible to create the one best solution.

I’d say the Task-based system offered by Julia is fairly performant for the safety that it offers. There was quite a bit of effort to reduce the memory footprint of each Task while retaining safety. I am certainly no expert on this but have read a bit this on this forum e.g. with regards to how the global RNG is split to spawned Tasks such that each Task has a distinct stream of random numbers. Of course if you don’t use this feature than that is “unnecessary” overhead for your usecase. Another great feature is the composability of the multithreading which comes with a bit of scheduling overhead. All in all the overhead of spawning a Task and having it run some code is \mathcal{O}(\mu s). If your worksloads are much faster than that you need to look for other ways to speed it up or parallelize at a higher level.

Your example shows a function that runs in 20\mathrm{ns} and thus is just not a good target for multithreading (probably in any language!). What your example also shows that you can get away with less overhead if you choose different tradeoffs (by using Polyester.jl). Now I am not an expert on Polyester.jl either but I seem to recall that they somewhat work around Julia’s scheduler by having the threads waiting for new Polyester-Jobs in a spin-loop (might be wrong). I think this does not have the composability of Julia’s scheduler. Additionally I don’t know if there are other differences with regard to safety (e.g. the Readme hints at less bound checks being performed by default). The application of Polyester.jl seems to be more to speedup hot inner loops of (numeric) computations and much less general purpose multithreading. So it makes sense if Polyester.jl chooses different tradeoffs.

In the end, I think it is good to have a save, general purpose default system and different libraries with different design choices for specialized applications. Of course you need to understand what the features and limitations of these systems are in order to use them correctly in the correct situation.

1 Like

The MWE was meant to demonstrate the “offending” allocations. In the real code (requiring 10 threads) the timings are ~ 1.1 ms, 400 us, and 150 us for sequential, threads, and batch.

However, if nested inside another parallel loop @threads will outperform @batch due to the better scheduling. And that scenario will likely always be the case when performance is a priority here.

So multi-threading is definitely a benefit, despite those allocations. Its just bugging me to have a few remaining allocations only due to multithreading overhead.

It sounds like these are necessary tradeoffs, which is good to know. Thanks.