Gcc vs Threads.@threads vs Threads.@spawn for large loops

Hi, Schönauer vector triad again, see also the benchmarking site of Georg Hager.

This time we compare multithreading vs. scalar, and also compare to gcc.

Here ist the generating code.

We see that for scalar performance, julia shines vs. gcc -Ofast. For multithreading performance (4 threads), the picture appears to be different. It seems that the bookkeeping overhead for handling threading is still larger compared to what we see for gcc. For small array sizes we see this overhead problem also for gcc. For large array sizes, performance is limited by memory access (it’s a laptop…).

This is the @threads based loop:

            Threads.@threads for i=1:N
                @inbounds @fastmath  d[i]=a[i]+b[i]*c[i]
            end

And this is the gist of the faster, @spawn implementation:

mapreduce(task->fetch(task),+,[Threads.@spawn _kernel(a,b,c,d,loop_begin[i],loop_end[i]) for i=1:ntasks])

A very similar picture one finds in a 2013 post by G. Hager on intel vs. gnu compiler:

He also suspects barrier performance for gcc to be the reason for the performance difference.

I am aware of the fact that it is still early times for multithreading in Julia, and things are clearly marked as experimental, so this post is meant as an encouragement to continue the endeavour of trying to be on par with C performance-wise (and thanks anyway for starting this!)

8 Likes

Have you looked at https://github.com/tro3/ThreadPools.jl or https://github.com/chriselrod/LoopVectorization.jl

Thanks for the post. We do have some work to do in reducing overheads. In particular, @threads for is not an optimized implementation.

However, I do want to note that a fairer comparison would be with #pragma omp parallel for schedule(dynamic) since Julia’s multi-threading is inherently dynamically scheduled. This is a part of how we achieve nested parallelism which is very difficult to accomplish with OpenMP. A less happy consequence is that we are unlikely to match a statically scheduled OpenMP parallel loop in terms of overheads – we simply do far more.

3 Likes

Out of personal interest, is there a recommended alternative? Or this a case of, in general this is thing to use, we just haven’t optimized it yet?

1 Like

There is no “recommended” alternative, but there are at least 2 and probably more packages that layer additional/different parallelism functionality above the language’s runtime. We do plan to enhance the parallel loop support in the runtime and it may stay in this form or have a different name/semantics – not concrete yet.

1 Like

Hi, thanks for the hints.

I am trying out ThreadPools, but need to understand more about this package before I would say anything.

@avx doesn’t give any additional performance here as it is memory acces bound, but surprisingly it changes the picture with caching and SharedArrays,
see my other post which I will update.

Thank you for the info. Below I update the figure with the result for
#pragma omp for schedule(dynamic, N/omp_max_threads()). Indeed this is slower than omp with static scheduling. I tried out several different ways of “by hand” scheduling for Julia, but these don’t make much of a difference.
I think that these large, statically scheduleable loops are not that seldom in PDE numerics, especially iterative solvers. But then, of course, not everything can be done at once…

I thought @threads for was a static scheduling over the range you give it. But perhaps the underlying scheduler doesn’t exploit this so there is still the same overhead as for dynamic scheduling.

A statically scheduled loop in OpenMP splits the loop range over num_threads and typically uses a tree to fan-out the work to the threads in parallel. At the end of the loop, the inverse of the tree is typically used for a barrier. This broadcast-barrier pair of synchronization constructs are pretty much the entire overhead for the loop and these are very well studied – each takes only hundreds to a few thousand cycles, depending on the processor and the number of threads.

But what happens if in the loop body you call another function that also has a parallel loop? With OpenMP you have to analyze your call graph, determine thread allocation at each level, and then carefully set up thread affinities, and possibly environment variables for libraries, etc. in order to use static scheduling all the way down. It is not impossible, just very very hard. So OpenMP added tasks and teams. And those aren’t pervasive in libraries anyway.

What I’m getting at is that it isn’t just variable duration loop iterations that require dynamic scheduling of the sort Julia’s scheduler manages.

As you say, ‘not everything can be done at once’, or IOW, there’s no magic bullet. Nonetheless, we still hope to improve the common case.

Hi, some update here:
While the above results have been obtained on a laptop with 1 NUMA node + 12MB L3 cache, here I show the result from a Xeon server with 2 NUMA nodes with 256GB + 40MB L3 cache each. As above, the parallel case uses 4 threads.

So here the picture is a bit different. The processor is slightly slower, and cache is larger, so we see a much larger region where parallelization works. Moreover we have two lanes to memory. Now, @threads makes a better impression ;-).
So if chunk sizes are large enough, Julia appears to be on par.

1 Like

I thought @threads for was a static scheduling over the range you give it. But perhaps the underlying scheduler doesn’t exploit this so there is still the same overhead as for dynamic scheduling.

It is. But the broadcast+barrier implementations here are completely serial. As N increases, these overheads become relatively smaller (edit: as shown in the most recent graph), but there’s lots of scope for improvement. :slight_smile:

3 Likes

What is the difference between those two packages?

One is for threaded execution of the code, the other takes advantage of multiple instructions per cycle when working on array data.

Then ThreadPools.jl is an alternative to Threads .@ spawn?
And is it a good idea to try to use both ThreadPools and LoopVectorization at the same time?

I do use threads and LoopVectorization in the same program.
I believe the ThreadPools free the programmer from having to
manage the threads manually. I haven’t used it myself.