Threading, Threads.@threads vs polyester.@batch vs LoopVectorization.@tturbo

I’m trying to understand the multithreading options in julia. Seems that there are at least three options Threads.@threads, polyester.@batch, and LoopVectorization.@tturbo. Reading it sounds like @batch can be faster than Threads.@threads. Can anyone give a quick summary of the pros and cons of these three methods and when each of the options might be preferred?

9 Likes

Bump. Why does Polyester.@batch have lower overhead than Threads.@threads? Does this lower overhead make it less flexible? And how does it compare with OpenMP applied to for-loops in C/C++/Fortran?

P.S. I found some information about the difference between Polyester and LoopVectorization. Otherwise, does Polyester aim to be an eventual replacement for Threads.@threads?

@Elrod is the mastermind behind these tools, so he can better answer how they evolved, but here is a quick description of how I use them.

@tturbo is threading+SIMD instructions (CPU instructions that act simultaneously on 4 or 8 neighboring array elements). It is just a threaded version of @turbo and it uses Polyester for the threading. It is meant for parallelizing simple inner loops, which typically are operations where a single execution of the bare loop takes no more than a couple hundred nanoseconds. You would probably never use @tturbo on something that is not an array of isbits objects.

Polyester.@batch hijacks the threads provided by Julia and uses a much faster, but simpler scheduler. It simply does not provide as many ways to nest threads as @threads. Because of its simplicity it has drastically lower overhead, so it is useful for multi-threading things that are already very fast (while setting up the @threads scheduling might be slower than your fast operation). Usually you should use @batch only if your threaded jobs might often be small.

You can nest @batch inside of @thread but the scheduling of the threads might get very confused. I usually just disable the Polyester threads when I do such nesting. I think the documentation (and accompanying benchmarks) of this thread-disabling feature (implemented 2 days ago) would be of interest to you: https://github.com/JuliaSIMD/Polyester.jl#disabling-polyester-threads (at the bottom of the README)

7 Likes

@tturbo is best applied to the outermost loop where it is valid.
It may parallelize any of those loops.

In this way, it is different from @simd.

Othwise, @Krastanov’s summary is good.

When it is valid, @tturbo will probably generally be the fastest.
@tturbo also handles reductions, e.g.

function mysum(x)
    s = 0.0
    @tturbo for i = eachindex(x)
        s += x[i]
    end
    s
end

which will either not work or lead to incorrect answers if you use @batch or Threads.@threads instead.
In general, you should be able to remove or add @tturbo from a loop without it changing the behavior.
@tturbo does the most, so it is the most vulnerable to bugs, making this helpful: if your answer changes when you add or remove @tturbo, it is @tturbo’s fault rather than your own.

4 Likes

Thank you for the explanation. It sounds like @tturbo or @batch are the preferred way to parallelize simple loops, where @threads give more flexibility for complex loops.