Threads.@threads with ONE thread: how to remove the overhead?

If there is only one thread set, the overhead of Threads.@threads is significant.

julia> Threads.nthreads()
1

julia> function single_thread(v)
           s = 0.0
           @inbounds for i = 1:length(v)
               s += sin(v[i])
           end
           return s
       end
single_thread (generic function with 1 method)

julia> function single_thread_mt(v)
           s = 0.0
           @inbounds Threads.@threads for i = 1:length(v)
               s += sin(v[i])
           end
           return s
       end
single_thread_mt (generic function with 1 method)

julia> using BenchmarkTools

julia> const v = randn(100000);

julia> @btime single_thread($v)
  837.122 μs (0 allocations: 0 bytes)
39.45839937127804

julia> @btime single_thread_mt($v)
  2.119 ms (200007 allocations: 3.05 MiB)
39.45839937127804

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)

We see that Threads.@threads with even a single thread incurs many allocations and runs much slower than a non-threaded version. It thus seems that setting JULIA_NUM_THREADS=1 is not a good way to turn off multi-threading.

My question is: how can we easily switch on/off multi-threading in an API, e.g., via a parameter parallel::Bool?
A workaround is of course to write two versions internally: one single-thread version if parallel=false and a Threads.@threads version otherwise. However, it seems boring to repeat the code since the only difference is the existence of Threads.@threads (we may extract the for-loop body into another function though). Is there any easy or recommended way for this purpose?

Related: Overhead of `Threads.@threads`

This issue was already discussed in in the thread Overhead of Threads.@threads you linked above. I was the one starting that thread and ended up implementing my own macro @threaded, that checks whether only a single thread is used. If so, it just executes the code without multithreading. Otherwise, it uses Threads.@threads, see introduce at-threaded macro wrapping Base at-threads by ranocha · Pull Request #426 · trixi-framework/Trixi.jl · GitHub. It’s not perfect but it removed some overhead for us. In particular, it removed all allocations in our use cases when using only a single thread, which is really helpful for debugging performance problems. We should probably switch to something like ThreadingUtilities.jl, CheapThreads.jl, but these packages do not have a public API yet (according to the dev docs of both).

Sidenote: Your example code does not work with multiple threads since they all try to update the same s.

2 Likes

Many thanks. The linked thread is very long and sorry that I did not notice the @threaded solution :joy:. However, I still cannot make it work. Is there anything wrong?

(continuing the above code)

# copied from https://github.com/trixi-framework/Trixi.jl/pull/426/files
julia> macro threaded(expr)
         # esc(quote ... end) as suggested in https://github.com/JuliaLang/julia/issues/23221
         return esc(quote
           if Threads.nthreads() == 1
             $(expr)
           else
             Threads.@threads $(expr)
           end
         end)
       end
@threaded (macro with 1 method)


julia> function f_threaded(v)
           s = 0.0
           @inbounds @threaded for i = 1:length(v)
               s += sin(v[i])
           end
           return s
       end
f_threaded (generic function with 1 method)

julia> @btime f_threaded($v)
  2.067 ms (200001 allocations: 3.05 MiB)
39.45839937127804

julia> Threads.nthreads()
1

Yes, it is used simply to show the overhead.

Well, the @threaded solution was not posted there explicitly :sweat_smile:
Please, try to use another example code, e.g. something along the lines of my example in the other thread. The problem with your code makes this example very hard for any threading attempt.

There were a few problems with your @threads benchmark.
A minor problem is that @inbounds doesn’t penetrate the closure, so you need to write it inside.
A much bigger problem is that the closure introduces a type instability, making it slow when single threaded and wrong when multilthreaded.

So lets fix those, and run a few benchmarks (throwing CheapThreads.jl and LoopVectorization.jl into the mix):

julia> using BenchmarkTools, CheapThreads, LoopVectorization

julia> Threads.nthreads()
1

julia> function single_thread(v)
           s = 0.0
           @inbounds for i = 1:length(v)
               s += sin(v[i])
           end
           return s
       end
single_thread (generic function with 1 method)

julia> function single_thread_mt(v)
           s = zeros(8,Threads.nthreads())
           Threads.@threads for i = 1:length(v)
               @inbounds s[1,Threads.threadid()] += sin(v[i])
           end
           return sum(view(s, 1, :))
       end
single_thread_mt (generic function with 1 method)

julia> function single_batch(v)
           s = zeros(8,Threads.nthreads())
           @batch for i = 1:length(v)
               s[1,Threads.threadid()] += sin(v[i])
           end
           return sum(view(s, 1, :))
       end
single_batch (generic function with 1 method)

julia> function single_avxt(v)
           s = 0.0
           @avxt for i = 1:length(v)
               s += sin(v[i])
           end
           return s
       end
single_avxt (generic function with 1 method)

julia> v = randn(100000);

julia> @btime single_thread($v)
  977.804 μs (0 allocations: 0 bytes)
102.03161503504539

julia> @btime single_thread_mt($v)
  991.968 μs (7 allocations: 704 bytes)
102.03161503504539

julia> @btime single_batch($v)
  1.016 ms (1 allocation: 144 bytes)
102.03161503504539

julia> @btime single_avxt($v)
  70.348 μs (0 allocations: 0 bytes)
102.03161503504623

julia> 977.8 / 70.348
13.899471200318416

julia> versioninfo()
Julia Version 1.7.0-DEV.802
Commit 8d998dc8ec* (2021-04-02 13:34 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

The @threads version’s overhead was now fairly negligible.

@avxt should impose no overhead vs @avx if you started Julia with a single thread.
On top of that, it’s compatible with the original loop, not requiring the adjustments necessary to make @threads or CheapThreads.@batch work correctly.

On top of that, @avxt it is over 13x faster than the original loop on my AVX512 system :wink: .
Happened to have slightly lower error, too:

julia> sum(sin ∘ big, v) |> Float64
102.03161503504595

julia> 102.03161503504539 - ans, 102.03161503504623 - ans
(-5.542233338928781e-13, 2.8421709430404007e-13)

This is expected, as LoopVectorization should generally accumulate less rounding errors through the reduction vs the naive serial sum.

Multithreaded benchmarks:

julia> using BenchmarkTools, CheapThreads, LoopVectorization

julia> Threads.nthreads()
36

julia> function single_thread(v)
           s = 0.0
           @inbounds for i = 1:length(v)
               s += sin(v[i])
           end
           return s
       end
single_thread (generic function with 1 method)

julia> function single_thread_mt(v)
           s = zeros(8,Threads.nthreads())
           Threads.@threads for i = 1:length(v)
               @inbounds s[1,Threads.threadid()] += sin(v[i])
           end
           return sum(view(s, 1, :))
       end
single_thread_mt (generic function with 1 method)

julia> function single_batch(v)
           s = zeros(8,Threads.nthreads())
           @batch for i = 1:length(v)
               s[1,Threads.threadid()] += sin(v[i])
           end
           return sum(view(s, 1, :))
       end
single_batch (generic function with 1 method)

julia> function single_avxt(v)
           s = 0.0
           @avxt for i = 1:length(v)
               s += sin(v[i])
           end
           return s
       end
single_avxt (generic function with 1 method)

julia> v = randn(100000);

julia> @btime single_thread($v)
  986.643 μs (0 allocations: 0 bytes)
152.32081377185875

julia> @btime single_thread_mt($v)
  99.950 μs (182 allocations: 18.50 KiB)
152.32081377185966

julia> @btime single_batch($v)
  40.756 μs (1 allocation: 2.38 KiB)
152.32081377185966

julia> @btime single_avxt($v)
  4.938 μs (0 allocations: 0 bytes)
152.32081377185906

@batch is well ahead of @threads, and of course @avxt leaves the others far behind as before – 5 microseconds is pretty good for multi-threaded code!! – while again being the most accurate:

julia> sum(sin ∘ big, v) |> Float64
152.32081377185915

julia> 152.32081377185875 - ans, 152.32081377185966 - ans, 152.32081377185906 - ans
(-3.979039320256561e-13, 5.115907697472721e-13, -8.526512829121202e-14)

I’d also be interested in seeing how the Ryzen 5950X compares to the Intel 10980XE here.

5 Likes

Wow, thanks for all these information. I will digest them later. For now, let’s focus on the original comparison. Additionally, I have also tested “A minor problem is that @inbounds doesn’t penetrate the closure”. I used a straightforward test case this time to avoid the mistakes mentioned by @Elrod.
(Code in mt.jl)

using BenchmarkTools

macro threaded(expr)
    # esc(quote ... end) as suggested in https://github.com/JuliaLang/julia/issues/23221
    return esc(quote
    if Threads.nthreads() == 1
        $(expr)
    else
        Threads.@threads $(expr)
    end
    end)
end

function single_thread!(out, input)
    @inbounds for i = 1:length(input)
        out[i] = sin(input[i]) * cos(exp(input[i]))
    end
end

function multi_threaded!(out, input)
    @inbounds @threaded for i = 1:length(input)
        out[i] = sin(input[i]) * cos(exp(input[i]))
    end
end

function multi_threaded_2!(out, input)
    @threaded for i = 1:length(input)
        @inbounds out[i] = sin(input[i]) * cos(exp(input[i]))
    end
end


function multi_thread!(out, input)
    @inbounds Threads.@threads for i = 1:length(input)
        out[i] = sin(input[i]) * cos(exp(input[i]))
    end
end

function multi_thread_2!(out, input)
    Threads.@threads for i = 1:length(input)
        @inbounds out[i] = sin(input[i]) * cos(exp(input[i]))
    end
end

@show Threads.nthreads()
N = 100000
const input = randn(N)
const out = zeros(N)
println("no threading")
@btime single_thread!($out, $input)
println("Threads.@threads: outer @inbounds")
@btime multi_thread!($out, $input)
println("Threads.@threads: inner @inbounds")
@btime multi_thread_2!($out, $input)
println("@threaded: outer @inbounds")
@btime multi_threaded!($out, $input)
println("@threaded: inner @inbounds")
@btime multi_threaded_2!($out, $input)

I then investigated a different number of threads with julia --threads 4 mt.jl (for 4 threads).
Results are

shuhua@WS3080:~/github/GEP.jl/tmp$ julia --threads 1 mt.jl
Threads.nthreads() = 1
no threading
  1.935 ms (0 allocations: 0 bytes)
Threads.@threads: outer @inbounds
  2.022 ms (6 allocations: 560 bytes)
Threads.@threads: inner @inbounds
  1.954 ms (6 allocations: 560 bytes)
@threaded: outer @inbounds
  1.940 ms (0 allocations: 0 bytes)
@threaded: inner @inbounds
  1.935 ms (0 allocations: 0 bytes)
shuhua@WS3080:~/github/GEP.jl/tmp$ julia --threads 2 mt.jl
Threads.nthreads() = 2
no threading
  1.850 ms (0 allocations: 0 bytes)
Threads.@threads: outer @inbounds
  1.004 ms (10 allocations: 976 bytes)
Threads.@threads: inner @inbounds
  971.992 μs (11 allocations: 1008 bytes)
@threaded: outer @inbounds
  1.003 ms (11 allocations: 1008 bytes)
@threaded: inner @inbounds
  974.262 μs (11 allocations: 1008 bytes)
shuhua@WS3080:~/github/GEP.jl/tmp$ julia --threads 4 mt.jl
Threads.nthreads() = 4
no threading
  1.846 ms (0 allocations: 0 bytes)
Threads.@threads: outer @inbounds
  510.161 μs (21 allocations: 1.88 KiB)
Threads.@threads: inner @inbounds
  483.981 μs (21 allocations: 1.88 KiB)
@threaded: outer @inbounds
  505.291 μs (21 allocations: 1.88 KiB)
@threaded: inner @inbounds
  487.732 μs (21 allocations: 1.88 KiB)
shuhua@WS3080:~/github/GEP.jl/tmp$ julia --threads 8 mt.jl
Threads.nthreads() = 8
no threading
  1.926 ms (0 allocations: 0 bytes)
Threads.@threads: outer @inbounds
  251.040 μs (41 allocations: 3.66 KiB)
Threads.@threads: inner @inbounds
  241.420 μs (41 allocations: 3.66 KiB)
@threaded: outer @inbounds
  249.431 μs (41 allocations: 3.62 KiB)
@threaded: inner @inbounds
  242.331 μs (41 allocations: 3.66 KiB)

Conclusion

  1. The following statement is true: do not write @inbounds before Threads.@threads; write it inside the for loop instead.
  1. The overhead of Threads.@threads in the case of one thread seems negligible, though Threads.@threads indeed incurs a small number of allocations. The extra custom @threaded may or may not be required.

An additional question

Could you please give more explanation? There are quite a lot of allocations when Threads.@threads is used in my original example.

Type unstable code generally allocates, and updating variables (i.e., changing bindings, like s = x where x may or may not be s + ...) captured in closures is normally type unstable (issue).

Got it. It seems to be caused by boxing. The original function is

image
However, change it to

function single_thread_mt2(v)
    s = [0.0]
    @inbounds Threads.@threads for i = 1:length(v)
        s[1] += sin(v[i])
    end
    return s[1]
end


Of course, the above code is still logically wrong for true multi-threading. Compared with the single_thread_mt in the first post, this type stable version reduces time from 2.119 ms to 869.633 μs.

Threads.@threads creates a closure for us. A more explicit example:
image
The behavior of Julia differs from Python here: I thought the above x remains unchanged.

1 Like

In case you’d like to read more, here is a recent thread about this behavior (and the difference vs Python).

1 Like

Ryzen 5950X:
16 CPU Cores
32 Threads

julia> Threads.nthreads()
32

julia> @btime single_avxt($v)
  9.090 μs (0 allocations: 0 bytes)

If we require 36 threads:

julia> Threads.nthreads()
36

julia> @btime single_avxt($v)
  8.970 μs (0 allocations: 0 bytes)

It seems that Ryzen 5950x takes double time. (perhaps to due to only AVX256 :sweat_smile:)

As for the built-in Threads.@threads, there is

julia> Threads.nthreads()
32

julia> @btime single_thread_mt($v)
  68.100 μs (161 allocations: 16.44 KiB)

@avxt is indeed much faster.

By the way, can @avx or @avxt help if its target is a function that includes more complicated logic?

FYI, FLoops.jl has executor system for decoupling the loop execution mechanism from the algorithm written in your code. So, you can write the loop once and then use ThreadedEx for multi-threaded loop and SequentialEx for single-threaded loop. These can be specified at run-time, without changing your code at all. There are other executors such as DistributedEx for distributed programming and FoldsCUDA.CUDAEx for GPU, FoldsThreads.jl for various thread-based parallelisms, and so on.

4 Likes