Understanding nested Threads.@threads scheduling

When I nest Threads.@threads loops, scheduling looks strange:

using Base.Threads
using Dates

function tmap(f, arr)
    out = similar(arr, Any)
    Threads.@threads for i in eachindex(arr)
        out[i] = f(arr[i])
    end
    return [out...]
end

reltime() = now() - t0
function stamp(args...)
    println("$(reltime()) $args")
end

@show Threads.nthreads()
t0 = now()
tmap(1:2) do i
    tmap(10:10:100) do j
        stamp(i,j)
        sleep(1)
    end
end
Threads.nthreads() = 8
97 milliseconds (1, 60)  # 9 threads used
97 milliseconds (1, 70)  # 9 threads used
97 milliseconds (1, 80)  # 9 threads used
97 milliseconds (1, 100)  # 9 threads used
97 milliseconds (2, 10)  # 9 threads used
97 milliseconds (1, 90)  # 9 threads used
97 milliseconds (1, 50)  # 9 threads used
97 milliseconds (1, 30)  # 9 threads used
97 milliseconds (1, 10)  # 9 threads used
1098 milliseconds (2, 20)  # 3 threads used
1098 milliseconds (1, 20)  # 3 threads used
1099 milliseconds (1, 40)  # 3 threads used
2100 milliseconds (2, 30)  # 1 threads used
3101 milliseconds (2, 40)  # 1 threads used
4102 milliseconds (2, 50)  # 1 threads used
5103 milliseconds (2, 60)  # 1 threads used
6105 milliseconds (2, 70)  # 1 threads used
7106 milliseconds (2, 80)  # 1 threads used
8107 milliseconds (2, 90)  # 1 threads used
9108 milliseconds (2, 100)  # 1 threads used

So initially this exploits 9 threads (8 is what I had expected?). Then it uses only 3 threads and then it is single threaded for a long time. Is this expected? Is this a bug? Why is it like this?

1 Like

I don’t believe nested threads are really “handled”. My understanding of the @threads macro is that it breaks the loop up into N chunks based on the items in the loop. Each chunk is processed on it’s own thread. So if there are only 2 items in the loop this will only use 2 threads, thread 1 get’s offset 1 and thread 2 get’s offset 2.

If you have 8 threads and 12 items, then I believe the distribution would be something like:

  1. [1, 2]
  2. [3, 4]
  3. [5, 6]
  4. [7, 8]
  5. [9]
  6. [10]
  7. [11]
  8. [12]

In which case the last 4 threads are doing half the work of the first 4, so you will have all 8 threads running to start, then only 4 when the others complete their chunk. The difference can be increased if processing the items in 10 and 11 are quick then those could drop out real fast.

This is in the process of changing; it used to be that the inner nested @threads would only schedule work on threads if on thread 1 (that’s version 1.4). The outer loop splits into two threads — and only one of those two will see multithreading inside it. So the “three threads” portion is simply the straggling 2 elements from the inner for loop on that first iteration. This is changing in 1.5 and will likely change again to fully participate in the depth-first queue. Version 1.5:

julia> @show Threads.nthreads()
Threads.nthreads() = 8
8

julia> t0 = now()
2020-05-29T10:28:55.45

julia> tmap(1:2) do i
           tmap(10:10:100) do j
               stamp(i,j)
               sleep(1)
           end
       end
167 milliseconds (2, 10)
167 milliseconds (1, 10)
1202 milliseconds (1, 20)
1202 milliseconds (2, 20)
2206 milliseconds (2, 30)
2206 milliseconds (1, 30)
3209 milliseconds (1, 40)
3209 milliseconds (2, 40)
4210 milliseconds (2, 50)
4210 milliseconds (1, 50)
5217 milliseconds (1, 60)
5217 milliseconds (2, 60)
6220 milliseconds (2, 70)
6220 milliseconds (1, 70)
7222 milliseconds (2, 80)
7222 milliseconds (1, 80)
8225 milliseconds (1, 90)
8225 milliseconds (2, 90)
9230 milliseconds (2, 100)
9230 milliseconds (1, 100)

https://docs.julialang.org/en/v1.6-dev/base/multi-threading/#Base.Threads.@threads

2 Likes

There is a discussion about this on the github issue tracker : https://github.com/JuliaLang/julia/pull/35646#issuecomment-622012366 so the behaviour will change in future

1 Like

Thanks a lot your response explained very well whats going on!