Overhead of `Threads.@threads`

There is a certain overhead of using `Threads.@threads` even if only a single threads is used.

For example, running Julia v1.5.3 with julia --threads=1 yields

julia> using BenchmarkTools

julia> function foo_serial!(dest, src)
           for i in eachindex(dest, src)
               @inbounds dest[i] = sin(cos(src[i])) # proxy for some operation
           end
           return nothing
       end

julia> function foo_threaded!(dest, src)
           Threads.@threads for i in eachindex(dest, src)
               @inbounds dest[i] = sin(cos(src[i])) # proxy for some operation
           end
           return nothing
       end

julia> Threads.nthreads()

julia> src = rand(10^3); dest = similar(src);

julia> @benchmark foo_serial!($dest, $src)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.426 μs (0.00% GC)
  median time:      10.532 μs (0.00% GC)
  mean time:        10.653 μs (0.00% GC)
  maximum time:     46.086 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark foo_threaded!($dest, $src)
BenchmarkTools.Trial:
  memory estimate:  832 bytes
  allocs estimate:  6
  --------------
  minimum time:     12.653 μs (0.00% GC)
  median time:      12.929 μs (0.00% GC)
  mean time:        13.255 μs (0.00% GC)
  maximum time:     44.788 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Of course, the threaded version will become faster if multiple threads are used, e.g. using julia --threads=4

julia> using BenchmarkTools

julia> function foo_serial!(dest, src)
           for i in eachindex(dest, src)
               @inbounds dest[i] = sin(cos(src[i])) # proxy for some operation
           end
           return nothing
       end

julia> function foo_threaded!(dest, src)
           Threads.@threads for i in eachindex(dest, src)
               @inbounds dest[i] = sin(cos(src[i])) # proxy for some operation
           end
           return nothing
       end

julia> Threads.nthreads()

julia> src = rand(10^3); dest = similar(src);

julia> @benchmark foo_serial!($dest, $src)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.048 μs (0.00% GC)
  median time:      10.289 μs (0.00% GC)
  mean time:        10.407 μs (0.00% GC)
  maximum time:     29.898 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark foo_threaded!($dest, $src)
BenchmarkTools.Trial:
  memory estimate:  2.98 KiB
  allocs estimate:  21
  --------------
  minimum time:     5.005 μs (0.00% GC)
  median time:      5.827 μs (0.00% GC)
  mean time:        6.155 μs (1.51% GC)
  maximum time:     333.063 μs (95.86% GC)
  --------------
  samples:          10000
  evals/sample:     6

We are writing a library of numerical algorithms for partial differential equations that should be relatively easy to use for students and extensible for researchers. At the same time, the performance should not be too bad. Currently, we just use Threads.@threads for loops where multithreading will give us improvements when using common problem sizes and multiple threads. However, we would also like to avoid the overhead of this approach when only one thread is used. Hence, we would like to know what you think will be the best approach for us.

  • Just use Threads.@threads for ... as we do now and hope that the overhead will be reduced in the future. Are there any plans/roadmaps for that?
  • Write functions that can benefit from using threads twice, one version with Threads.@threads, the other version without, maybe using a macro internally to simplify things. There are different ways to realize something like this.
    • We could just check the number of threads at runtime, e.g. end up having something like
      julia> function foo_maybethreaded!(dest, src)
                if Threads.nthreads() == 1
                    foo_serial!(dest, src)
                else
                    foo_threaded!(dest, src)
                end
            end
      
      This is probably okay since the loops are usually more costly than a single if check.
    • We could add another type parameter to our discretization structs and use that to dispatch, e.g. something like parallelization::Val{:serial} or parallelization::Val{:threaded}.
    • We could use functions returning true or false and use these to decide whether to use threading or not globally in our package, similar to enable_debug_timing/disable_debug_timing in TimerOutputs.jl

What would you propose?

1 Like

Did you benchmark to see whether the overhead is something you even need to worry about?

I once did this PR Make the at-threads a no-op if nthreads()==0 by mauro3 · Pull Request #33964 · JuliaLang/julia · GitHub, which made @threads a no-op if nthreads()==1. But apparently that was not save but no work-around was suggested. The road-map PR is linked in that PR of mine.

1 Like

Yes, we do see an overhead. If I remove all Threads.@threads, we get the following timings for different numbers of DG elements for one right-hand side evaluation.

  #Elements | Runtime in seconds
          1 | 1.39e-06
          4 | 1.66e-06
         16 | 2.86e-06
         64 | 7.70e-06
        256 | 2.69e-05
       1024 | 1.04e-04
       4096 | 4.27e-04
      16384 | 1.87e-03
      65536 | 9.75e-03

If we use Threads.@threads with a single thread as we do now, I get

#Elements | Runtime in seconds
          1 | 1.86e-05
          4 | 1.87e-05
         16 | 2.03e-05
         64 | 2.68e-05
        256 | 4.62e-05
       1024 | 1.25e-04
       4096 | 4.44e-04
      16384 | 1.93e-03
      65536 | 9.67e-03

Thanks for this link!

One option would be to make your own upgraded @threads macro, which generates both a multi-threaded and a single-threaded body. It could switch between them based on nthreads(), but perhaps ideally also based on problem size. Maybe @threads 1000 for i in x only launches a second thread if length(x)>1000, and a third only when length(x)>2000?

Something like this is built into FLoops.jl, from here: Parallel loops · FLoops it looks like you write @floop ThreadedEx(basesize = 1000) for i in x.

5 Likes

Thanks for this suggestion. However, there seems to be a typo mixing i and j. The corrected version should read something like

julia> function foo_threaded2!(dest, src)
           p = collect(Iterators.partition(eachindex(src), length(src) ÷ Threads.nthreads()))
           Threads.@threads for tid in 1:Threads.nthreads()
               for i in p[tid]
                   @inbounds dest[i] = sin(cos(src[i]))
               end
           end
           return nothing
       end

julia> @benchmark foo_threaded2!($dest, $src)
BenchmarkTools.Trial: 
  memory estimate:  944 bytes
  allocs estimate:  7
  --------------
  minimum time:     13.395 μs (0.00% GC)
  median time:      13.601 μs (0.00% GC)
  mean time:        14.251 μs (0.00% GC)
  maximum time:     67.795 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Thus, this has even more overhead on my system than Threads.@threads.

Thanks! If there are no better suggestions or improvements to base Julia in the near future, that will probably be the approach we will use as a workaround.

sorry, my mistake, I removed the suggestion

A curiosity here: in your use case that loop is launched thousands of times for that overhead to be important? If that is the case, shouldn’t the thread splitting be one level above that one?

1 Like

We’re dealing with time-dependent PDEs, so the right-hand side coming from our spatial discretization is called thousands of times for the time stepping. If we wanted to use parallelization one level higher, we would need to look into parallel-in-time schemes (instead of standard ODE solvers).

2 Likes

@Elrod has a package for low latency threading by maintaining the thread pools instead of constantly spinning them back up.

1 Like

Note that the threads do go to sleep after probably a few milliseconds of nothing to do. It begins waiting currently after 2 ^ 20 (around a million) pauses without work. So the actual time is CPU dependent; aside from clock speeds differing, the pause instruction can take wildly different numbers of clock cycles from one arch to the next.
We can configure that if different settings yield better performance.

2 Likes

LoopVectorization 0.10 uses these threadpools for vmapt:

julia> using LoopVectorization

julia> src = randn(10_000);

julia> dest = similar(src);

julia> function foo_threaded2!(dest, src)
                  p = collect(Iterators.partition(eachindex(src), length(src) ÷ Threads.nthreads()))
                  Threads.@threads for tid in 1:Threads.nthreads()
                      for i in p[tid]
                          @inbounds dest[i] = sin(cos(src[i]))
                      end
                  end
                  return nothing
              end
foo_threaded2! (generic function with 1 method)

julia> @benchmark foo_threaded2!($dest, $src)
BenchmarkTools.Trial:
  memory estimate:  16.78 KiB
  allocs estimate:  182
  --------------
  minimum time:     104.429 μs (0.00% GC)
  median time:      129.044 μs (0.00% GC)
  mean time:        133.739 μs (0.92% GC)
  maximum time:     5.737 ms (95.97% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark vmap!(sin ∘ cos, $dest, $src) # single threaded
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.836 μs (0.00% GC)
  median time:      17.051 μs (0.00% GC)
  mean time:        17.050 μs (0.00% GC)
  maximum time:     48.747 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark vmapt!(sin ∘ cos, $dest, $src) # multi threaded
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.235 μs (0.00% GC)
  median time:      2.370 μs (0.00% GC)
  mean time:        2.400 μs (0.00% GC)
  maximum time:     14.836 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> Threads.nthreads()
36

EDIT:
A couple caveats:

  1. Of course, the code has to be vmap!-able for this to work. Maybe I should make a non-v version that just uses the threads.
  2. Throwing errors isn’t supported in these threads. Try not to do that.
4 Likes

Our actual code is not really vmap!-able or easily expressed in this form, it’s often more like

for element in some_range
  # do stuff accessing `u[some_set_of_some indices..., element]` and
  # `some_cache[some_indices_again..., element]` 
  # store some result in `du[some_other_indices..., element]`
end

Do you refer to GitHub - chriselrod/ThreadingUtilities.jl: Utilities for low overhead threading in Julia.?

See also here for a discussion on thread latency and some benchmarks from last year:

I’m curious about nested schedulers in general. So, I wonder how spin-based task pool management interacts with Julia’s cooperative multitasking. Did you look at scenarios like

@sync for ...
    @spawn vmap!(...)
end

?

Since Julia’s parallel runtime randomizes scheduling, I think an interesting statistics for this is the median (“typical case”) than the minimum (“lucky path”).

1 Like

The API needs work, currently vmap! (no t) is single threaded, and vmapt! (t for tthreads) is threaded. There’s an issue suggesting a better API.

I looked at them in the sense that I added:

    if !((nt > 1) && iszero(ccall(:jl_in_threaded_region, Cint, ())))
        vmap_singlethread!(f, ptry, Zero(), N, Val{NonTemporal}(), ptrargs)
        return
    end

i.e., if it’s nested in other threaded code, it will run single threaded. Octavian.jl also checks to avoid nested threading.

Functions like maps and aren’t likely to utilize multiple cores as well as more granular threading.
In my earlier example, I only got an 8x speedup from a system with 18 cores/36 threads with vmap!.
I do think I can optimize the overhead of the dispatches more, so it does still get better with larger arrays:

julia> src = randn(40_000); dest = similar(src);

julia> @benchmark vmapt!(sin ∘ cos, $dest, $src) # multi threaded
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.075 μs (0.00% GC)
  median time:      5.237 μs (0.00% GC)
  mean time:        5.290 μs (0.00% GC)
  maximum time:     35.734 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

julia> @benchmark vmap!(sin ∘ cos, $dest, $src) # single threaded
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     67.932 μs (0.00% GC)
  median time:      68.133 μs (0.00% GC)
  mean time:        68.188 μs (0.00% GC)
  maximum time:     131.029 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

But I think simd-map!s in general are very overhead sensitive; by the time the problem is big enough to overcome scheduling overhead, you’re running into memory bandwidth problems.
FWIW, on this computer:

julia> @benchmark wait(Threads.@spawn nothing)
BenchmarkTools.Trial:
  memory estimate:  448 bytes
  allocs estimate:  5
  --------------
  minimum time:     61.246 μs (0.00% GC)
  median time:      72.259 μs (0.00% GC)
  mean time:        71.841 μs (0.00% GC)
  maximum time:     101.282 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The @spawn overhead can be very large on some computers. On a different computer of mine it’s more like 7 microseconds median, which still means a single @spawn is slower than sin and cos on 40_000 Float64.

My point here is that there’s an advantage to simple (non-existant?) schedulers, opening many operations as profitable to thread that were not previously.

Consistency is another advantage of static scheduling. If there is only path, they’re all lucky :wink: .

2 Likes

I don’t think @spawn sets jl_in_threaded_region. Also, consider

t = @spawn f()
g()
wait(t)

Arguably, g is “threaded” in some sense. However, it runs on the “root” task in which this code is executed. It’s indistinguishable from any other “single-thread” code. So, I don’t think there is a valid way to detect “threaded code.”

Also, using single threaded code for the inner computation is the opposite of depth-first scheduling which is how Julia’s parallel runtime was designed; i.e., it’s better for the innermost computation to take available cores since they can share the cache.

This depends on what function is passed to the map. Each invocation can take an arbitrarily long time (and can use very small cache/memory). I feel it’d be limiting to assume that maps do not utilize multiple cores.

I think a better approach is to set the size of the base case so that you don’t @spawn anything when it’d slow things down. This yields much more composable parallel code since the code you don’t know that can run in parallel can take usable cores because Julia’s runtime can choose to schedule available tasks. Since, IIUC, LoopVectorization.jl has a cost model, can’t it infer the base size? I think it’d be really awesome if LoopVectorization.jl can auto-magically determine the appropriate base size.

(Or does it do so already? I didn’t check.)

1 Like

FWIW, Tullio has an extremely crude “cost model” which tries to do exactly this. (It counts 1 for any combination of *, +, abs etc, and 10 per log or other unrecognised function.) Or you can provide a size at which you claim it is worth dividing the work among two threads.

This uses @spawn, and when applied to fairly simple operations (including matrix multiplication) there is a regime where the overhead of this is significant.

If @spawn is an overhead, does it mean the base size takes around a few hundreds/tens of microseconds? I guess it’s imaginable to parallelize even such problem size. But, my point is that Tullio could be used in the context where parallelizing at an outer level is much more beneficial than squeezing out the performance at this level. Of course, I wouldn’t be making the noise if such optimizations can be done without a significant overhead (compared to parallelization at the outer level) and without loss of composability. But, if it is conceivable that manual management of task pool could be harmful when embedded in a large application, I think it’d be better to avoid making it the default behavior. I see no problem in providing scheduling options (e.g., managed pools) if it’s up to the end-user to specify this. In fact, I’m playing with a similar idea, too.

1 Like

Even lower, MKL benefits from multiple threads on 50x50 matrix multiplication, which takes about 5μs. For Tullio this is below the (default) threshold to turn on any threading.

On larger problems, it just calls @spawn at present, so ought to be composable as usual. I guess there’s still a regime where, if it’s only just beneficial to spawn threads when working alone, it would be significantly better not to when working within some outer-level parallel loop. I don’t know if anything can be done about that (automatically) apart from being fairly conservative about deciding when to turn on threading. On very large problems of course these costs wash out entirely.

I haven’t properly investigated the Octavian / ThreadingUtilities story. As you say perhaps it ought to be opt-in.

2 Likes