Overhead of `Threads.@threads`

There is a certain overhead of using `Threads.@threads` even if only a single threads is used.

For example, running Julia v1.5.3 with julia --threads=1 yields

julia> using BenchmarkTools

julia> function foo_serial!(dest, src)
           for i in eachindex(dest, src)
               @inbounds dest[i] = sin(cos(src[i])) # proxy for some operation
           end
           return nothing
       end

julia> function foo_threaded!(dest, src)
           Threads.@threads for i in eachindex(dest, src)
               @inbounds dest[i] = sin(cos(src[i])) # proxy for some operation
           end
           return nothing
       end

julia> Threads.nthreads()

julia> src = rand(10^3); dest = similar(src);

julia> @benchmark foo_serial!($dest, $src)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.426 Ī¼s (0.00% GC)
  median time:      10.532 Ī¼s (0.00% GC)
  mean time:        10.653 Ī¼s (0.00% GC)
  maximum time:     46.086 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark foo_threaded!($dest, $src)
BenchmarkTools.Trial:
  memory estimate:  832 bytes
  allocs estimate:  6
  --------------
  minimum time:     12.653 Ī¼s (0.00% GC)
  median time:      12.929 Ī¼s (0.00% GC)
  mean time:        13.255 Ī¼s (0.00% GC)
  maximum time:     44.788 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Of course, the threaded version will become faster if multiple threads are used, e.g. using julia --threads=4

julia> using BenchmarkTools

julia> function foo_serial!(dest, src)
           for i in eachindex(dest, src)
               @inbounds dest[i] = sin(cos(src[i])) # proxy for some operation
           end
           return nothing
       end

julia> function foo_threaded!(dest, src)
           Threads.@threads for i in eachindex(dest, src)
               @inbounds dest[i] = sin(cos(src[i])) # proxy for some operation
           end
           return nothing
       end

julia> Threads.nthreads()

julia> src = rand(10^3); dest = similar(src);

julia> @benchmark foo_serial!($dest, $src)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.048 Ī¼s (0.00% GC)
  median time:      10.289 Ī¼s (0.00% GC)
  mean time:        10.407 Ī¼s (0.00% GC)
  maximum time:     29.898 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark foo_threaded!($dest, $src)
BenchmarkTools.Trial:
  memory estimate:  2.98 KiB
  allocs estimate:  21
  --------------
  minimum time:     5.005 Ī¼s (0.00% GC)
  median time:      5.827 Ī¼s (0.00% GC)
  mean time:        6.155 Ī¼s (1.51% GC)
  maximum time:     333.063 Ī¼s (95.86% GC)
  --------------
  samples:          10000
  evals/sample:     6

We are writing a library of numerical algorithms for partial differential equations that should be relatively easy to use for students and extensible for researchers. At the same time, the performance should not be too bad. Currently, we just use Threads.@threads for loops where multithreading will give us improvements when using common problem sizes and multiple threads. However, we would also like to avoid the overhead of this approach when only one thread is used. Hence, we would like to know what you think will be the best approach for us.

  • Just use Threads.@threads for ... as we do now and hope that the overhead will be reduced in the future. Are there any plans/roadmaps for that?
  • Write functions that can benefit from using threads twice, one version with Threads.@threads, the other version without, maybe using a macro internally to simplify things. There are different ways to realize something like this.
    • We could just check the number of threads at runtime, e.g. end up having something like
      julia> function foo_maybethreaded!(dest, src)
                if Threads.nthreads() == 1
                    foo_serial!(dest, src)
                else
                    foo_threaded!(dest, src)
                end
            end
      
      This is probably okay since the loops are usually more costly than a single if check.
    • We could add another type parameter to our discretization structs and use that to dispatch, e.g. something like parallelization::Val{:serial} or parallelization::Val{:threaded}.
    • We could use functions returning true or false and use these to decide whether to use threading or not globally in our package, similar to enable_debug_timing/disable_debug_timing in TimerOutputs.jl

What would you propose?

1 Like

Did you benchmark to see whether the overhead is something you even need to worry about?

I once did this PR https://github.com/JuliaLang/julia/pull/33964, which made @threads a no-op if nthreads()==1. But apparently that was not save but no work-around was suggested. The road-map PR is linked in that PR of mine.

1 Like

Yes, we do see an overhead. If I remove all Threads.@threads, we get the following timings for different numbers of DG elements for one right-hand side evaluation.

  #Elements | Runtime in seconds
          1 | 1.39e-06
          4 | 1.66e-06
         16 | 2.86e-06
         64 | 7.70e-06
        256 | 2.69e-05
       1024 | 1.04e-04
       4096 | 4.27e-04
      16384 | 1.87e-03
      65536 | 9.75e-03

If we use Threads.@threads with a single thread as we do now, I get

#Elements | Runtime in seconds
          1 | 1.86e-05
          4 | 1.87e-05
         16 | 2.03e-05
         64 | 2.68e-05
        256 | 4.62e-05
       1024 | 1.25e-04
       4096 | 4.44e-04
      16384 | 1.93e-03
      65536 | 9.67e-03

Thanks for this link!

One option would be to make your own upgraded @threads macro, which generates both a multi-threaded and a single-threaded body. It could switch between them based on nthreads(), but perhaps ideally also based on problem size. Maybe @threads 1000 for i in x only launches a second thread if length(x)>1000, and a third only when length(x)>2000?

Something like this is built into FLoops.jl, from here: Parallel loops Ā· FLoops it looks like you write @floop ThreadedEx(basesize = 1000) for i in x.

5 Likes

Thanks for this suggestion. However, there seems to be a typo mixing i and j. The corrected version should read something like

julia> function foo_threaded2!(dest, src)
           p = collect(Iterators.partition(eachindex(src), length(src) Ć· Threads.nthreads()))
           Threads.@threads for tid in 1:Threads.nthreads()
               for i in p[tid]
                   @inbounds dest[i] = sin(cos(src[i]))
               end
           end
           return nothing
       end

julia> @benchmark foo_threaded2!($dest, $src)
BenchmarkTools.Trial: 
  memory estimate:  944 bytes
  allocs estimate:  7
  --------------
  minimum time:     13.395 Ī¼s (0.00% GC)
  median time:      13.601 Ī¼s (0.00% GC)
  mean time:        14.251 Ī¼s (0.00% GC)
  maximum time:     67.795 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Thus, this has even more overhead on my system than Threads.@threads.

Thanks! If there are no better suggestions or improvements to base Julia in the near future, that will probably be the approach we will use as a workaround.

sorry, my mistake, I removed the suggestion

A curiosity here: in your use case that loop is launched thousands of times for that overhead to be important? If that is the case, shouldnā€™t the thread splitting be one level above that one?

1 Like

Weā€™re dealing with time-dependent PDEs, so the right-hand side coming from our spatial discretization is called thousands of times for the time stepping. If we wanted to use parallelization one level higher, we would need to look into parallel-in-time schemes (instead of standard ODE solvers).

2 Likes

@Elrod has a package for low latency threading by maintaining the thread pools instead of constantly spinning them back up.

1 Like

Note that the threads do go to sleep after probably a few milliseconds of nothing to do. It begins waiting currently after 2 ^ 20 (around a million) pauses without work. So the actual time is CPU dependent; aside from clock speeds differing, the pause instruction can take wildly different numbers of clock cycles from one arch to the next.
We can configure that if different settings yield better performance.

3 Likes

LoopVectorization 0.10 uses these threadpools for vmapt:

julia> using LoopVectorization

julia> src = randn(10_000);

julia> dest = similar(src);

julia> function foo_threaded2!(dest, src)
                  p = collect(Iterators.partition(eachindex(src), length(src) Ć· Threads.nthreads()))
                  Threads.@threads for tid in 1:Threads.nthreads()
                      for i in p[tid]
                          @inbounds dest[i] = sin(cos(src[i]))
                      end
                  end
                  return nothing
              end
foo_threaded2! (generic function with 1 method)

julia> @benchmark foo_threaded2!($dest, $src)
BenchmarkTools.Trial:
  memory estimate:  16.78 KiB
  allocs estimate:  182
  --------------
  minimum time:     104.429 Ī¼s (0.00% GC)
  median time:      129.044 Ī¼s (0.00% GC)
  mean time:        133.739 Ī¼s (0.92% GC)
  maximum time:     5.737 ms (95.97% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark vmap!(sin āˆ˜ cos, $dest, $src) # single threaded
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.836 Ī¼s (0.00% GC)
  median time:      17.051 Ī¼s (0.00% GC)
  mean time:        17.050 Ī¼s (0.00% GC)
  maximum time:     48.747 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark vmapt!(sin āˆ˜ cos, $dest, $src) # multi threaded
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.235 Ī¼s (0.00% GC)
  median time:      2.370 Ī¼s (0.00% GC)
  mean time:        2.400 Ī¼s (0.00% GC)
  maximum time:     14.836 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> Threads.nthreads()
36

EDIT:
A couple caveats:

  1. Of course, the code has to be vmap!-able for this to work. Maybe I should make a non-v version that just uses the threads.
  2. Throwing errors isnā€™t supported in these threads. Try not to do that.
4 Likes

Our actual code is not really vmap!-able or easily expressed in this form, itā€™s often more like

for element in some_range
  # do stuff accessing `u[some_set_of_some indices..., element]` and
  # `some_cache[some_indices_again..., element]` 
  # store some result in `du[some_other_indices..., element]`
end

Do you refer to GitHub - JuliaSIMD/ThreadingUtilities.jl: Utilities for low overhead threading in Julia.?

See also here for a discussion on thread latency and some benchmarks from last year:

https://github.com/JuliaLang/julia/issues/32701#issuecomment-640942249

Iā€™m curious about nested schedulers in general. So, I wonder how spin-based task pool management interacts with Juliaā€™s cooperative multitasking. Did you look at scenarios like

@sync for ...
    @spawn vmap!(...)
end

?

Since Juliaā€™s parallel runtime randomizes scheduling, I think an interesting statistics for this is the median (ā€œtypical caseā€) than the minimum (ā€œlucky pathā€).

1 Like

The API needs work, currently vmap! (no t) is single threaded, and vmapt! (t for tthreads) is threaded. Thereā€™s an issue suggesting a better API.

I looked at them in the sense that I added:

    if !((nt > 1) && iszero(ccall(:jl_in_threaded_region, Cint, ())))
        vmap_singlethread!(f, ptry, Zero(), N, Val{NonTemporal}(), ptrargs)
        return
    end

i.e., if itā€™s nested in other threaded code, it will run single threaded. Octavian.jl also checks to avoid nested threading.

Functions like maps and arenā€™t likely to utilize multiple cores as well as more granular threading.
In my earlier example, I only got an 8x speedup from a system with 18 cores/36 threads with vmap!.
I do think I can optimize the overhead of the dispatches more, so it does still get better with larger arrays:

julia> src = randn(40_000); dest = similar(src);

julia> @benchmark vmapt!(sin āˆ˜ cos, $dest, $src) # multi threaded
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.075 Ī¼s (0.00% GC)
  median time:      5.237 Ī¼s (0.00% GC)
  mean time:        5.290 Ī¼s (0.00% GC)
  maximum time:     35.734 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

julia> @benchmark vmap!(sin āˆ˜ cos, $dest, $src) # single threaded
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     67.932 Ī¼s (0.00% GC)
  median time:      68.133 Ī¼s (0.00% GC)
  mean time:        68.188 Ī¼s (0.00% GC)
  maximum time:     131.029 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

But I think simd-map!s in general are very overhead sensitive; by the time the problem is big enough to overcome scheduling overhead, youā€™re running into memory bandwidth problems.
FWIW, on this computer:

julia> @benchmark wait(Threads.@spawn nothing)
BenchmarkTools.Trial:
  memory estimate:  448 bytes
  allocs estimate:  5
  --------------
  minimum time:     61.246 Ī¼s (0.00% GC)
  median time:      72.259 Ī¼s (0.00% GC)
  mean time:        71.841 Ī¼s (0.00% GC)
  maximum time:     101.282 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The @spawn overhead can be very large on some computers. On a different computer of mine itā€™s more like 7 microseconds median, which still means a single @spawn is slower than sin and cos on 40_000 Float64.

My point here is that thereā€™s an advantage to simple (non-existant?) schedulers, opening many operations as profitable to thread that were not previously.

Consistency is another advantage of static scheduling. If there is only path, theyā€™re all lucky :wink: .

2 Likes

I donā€™t think @spawn sets jl_in_threaded_region. Also, consider

t = @spawn f()
g()
wait(t)

Arguably, g is ā€œthreadedā€ in some sense. However, it runs on the ā€œrootā€ task in which this code is executed. Itā€™s indistinguishable from any other ā€œsingle-threadā€ code. So, I donā€™t think there is a valid way to detect ā€œthreaded code.ā€

Also, using single threaded code for the inner computation is the opposite of depth-first scheduling which is how Juliaā€™s parallel runtime was designed; i.e., itā€™s better for the innermost computation to take available cores since they can share the cache.

This depends on what function is passed to the map. Each invocation can take an arbitrarily long time (and can use very small cache/memory). I feel itā€™d be limiting to assume that maps do not utilize multiple cores.

I think a better approach is to set the size of the base case so that you donā€™t @spawn anything when itā€™d slow things down. This yields much more composable parallel code since the code you donā€™t know that can run in parallel can take usable cores because Juliaā€™s runtime can choose to schedule available tasks. Since, IIUC, LoopVectorization.jl has a cost model, canā€™t it infer the base size? I think itā€™d be really awesome if LoopVectorization.jl can auto-magically determine the appropriate base size.

(Or does it do so already? I didnā€™t check.)

1 Like

FWIW, Tullio has an extremely crude ā€œcost modelā€ which tries to do exactly this. (It counts 1 for any combination of *, +, abs etc, and 10 per log or other unrecognised function.) Or you can provide a size at which you claim it is worth dividing the work among two threads.

This uses @spawn, and when applied to fairly simple operations (including matrix multiplication) there is a regime where the overhead of this is significant.

If @spawn is an overhead, does it mean the base size takes around a few hundreds/tens of microseconds? I guess itā€™s imaginable to parallelize even such problem size. But, my point is that Tullio could be used in the context where parallelizing at an outer level is much more beneficial than squeezing out the performance at this level. Of course, I wouldnā€™t be making the noise if such optimizations can be done without a significant overhead (compared to parallelization at the outer level) and without loss of composability. But, if it is conceivable that manual management of task pool could be harmful when embedded in a large application, I think itā€™d be better to avoid making it the default behavior. I see no problem in providing scheduling options (e.g., managed pools) if itā€™s up to the end-user to specify this. In fact, Iā€™m playing with a similar idea, too.

1 Like

Even lower, MKL benefits from multiple threads on 50x50 matrix multiplication, which takes about 5Ī¼s. For Tullio this is below the (default) threshold to turn on any threading.

On larger problems, it just calls @spawn at present, so ought to be composable as usual. I guess thereā€™s still a regime where, if itā€™s only just beneficial to spawn threads when working alone, it would be significantly better not to when working within some outer-level parallel loop. I donā€™t know if anything can be done about that (automatically) apart from being fairly conservative about deciding when to turn on threading. On very large problems of course these costs wash out entirely.

I havenā€™t properly investigated the Octavian / ThreadingUtilities story. As you say perhaps it ought to be opt-in.

2 Likes