I’m trying to understand the multithreading options in julia. Seems that there are at least three options Threads.@threads, polyester.@batch, and LoopVectorization.@tturbo. Reading it sounds like @batch can be faster than Threads.@threads. Can anyone give a quick summary of the pros and cons of these three methods and when each of the options might be preferred?
Bump. Why does Polyester.@batch
have lower overhead than Threads.@threads
? Does this lower overhead make it less flexible? And how does it compare with OpenMP applied to for-loops in C/C++/Fortran?
P.S. I found some information about the difference between Polyester and LoopVectorization. Otherwise, does Polyester aim to be an eventual replacement for Threads.@threads
?
@Elrod is the mastermind behind these tools, so he can better answer how they evolved, but here is a quick description of how I use them.
@tturbo
is threading+SIMD instructions (CPU instructions that act simultaneously on 4 or 8 neighboring array elements). It is just a threaded version of @turbo
and it uses Polyester
for the threading. It is meant for parallelizing simple inner loops, which typically are operations where a single execution of the bare loop takes no more than a couple hundred nanoseconds. You would probably never use @tturbo
on something that is not an array of isbits
objects.
Polyester.@batch
hijacks the threads provided by Julia and uses a much faster, but simpler scheduler. It simply does not provide as many ways to nest threads as @threads
. Because of its simplicity it has drastically lower overhead, so it is useful for multi-threading things that are already very fast (while setting up the @threads
scheduling might be slower than your fast operation). Usually you should use @batch
only if your threaded jobs might often be small.
You can nest @batch
inside of @thread
but the scheduling of the threads might get very confused. I usually just disable the Polyester
threads when I do such nesting. I think the documentation (and accompanying benchmarks) of this thread-disabling feature (implemented 2 days ago) would be of interest to you: https://github.com/JuliaSIMD/Polyester.jl#disabling-polyester-threads (at the bottom of the README)
@tturbo
is best applied to the outermost loop where it is valid.
It may parallelize any of those loops.
In this way, it is different from @simd
.
Othwise, @Krastanov’s summary is good.
When it is valid, @tturbo
will probably generally be the fastest.
@tturbo
also handles reductions, e.g.
function mysum(x)
s = 0.0
@tturbo for i = eachindex(x)
s += x[i]
end
s
end
which will either not work or lead to incorrect answers if you use @batch
or Threads.@threads
instead.
In general, you should be able to remove or add @tturbo
from a loop without it changing the behavior.
@tturbo
does the most, so it is the most vulnerable to bugs, making this helpful: if your answer changes when you add or remove @tturbo
, it is @tturbo
’s fault rather than your own.
Thank you for the explanation. It sounds like @tturbo or @batch are the preferred way to parallelize simple loops, where @threads give more flexibility for complex loops.