When I nest Threads.@threads loops, scheduling looks strange:
using Base.Threads
using Dates
function tmap(f, arr)
out = similar(arr, Any)
Threads.@threads for i in eachindex(arr)
out[i] = f(arr[i])
end
return [out...]
end
reltime() = now() - t0
function stamp(args...)
println("$(reltime()) $args")
end
@show Threads.nthreads()
t0 = now()
tmap(1:2) do i
tmap(10:10:100) do j
stamp(i,j)
sleep(1)
end
end
Threads.nthreads() = 8
97 milliseconds (1, 60) # 9 threads used
97 milliseconds (1, 70) # 9 threads used
97 milliseconds (1, 80) # 9 threads used
97 milliseconds (1, 100) # 9 threads used
97 milliseconds (2, 10) # 9 threads used
97 milliseconds (1, 90) # 9 threads used
97 milliseconds (1, 50) # 9 threads used
97 milliseconds (1, 30) # 9 threads used
97 milliseconds (1, 10) # 9 threads used
1098 milliseconds (2, 20) # 3 threads used
1098 milliseconds (1, 20) # 3 threads used
1099 milliseconds (1, 40) # 3 threads used
2100 milliseconds (2, 30) # 1 threads used
3101 milliseconds (2, 40) # 1 threads used
4102 milliseconds (2, 50) # 1 threads used
5103 milliseconds (2, 60) # 1 threads used
6105 milliseconds (2, 70) # 1 threads used
7106 milliseconds (2, 80) # 1 threads used
8107 milliseconds (2, 90) # 1 threads used
9108 milliseconds (2, 100) # 1 threads used
So initially this exploits 9 threads (8 is what I had expected?). Then it uses only 3 threads and then it is single threaded for a long time. Is this expected? Is this a bug? Why is it like this?
I don’t believe nested threads are really “handled”. My understanding of the @threads macro is that it breaks the loop up into N chunks based on the items in the loop. Each chunk is processed on it’s own thread. So if there are only 2 items in the loop this will only use 2 threads, thread 1 get’s offset 1 and thread 2 get’s offset 2.
If you have 8 threads and 12 items, then I believe the distribution would be something like:
[1, 2]
[3, 4]
[5, 6]
[7, 8]
[9]
[10]
[11]
[12]
In which case the last 4 threads are doing half the work of the first 4, so you will have all 8 threads running to start, then only 4 when the others complete their chunk. The difference can be increased if processing the items in 10 and 11 are quick then those could drop out real fast.
This is in the process of changing; it used to be that the inner nested @threads would only schedule work on threads if on thread 1 (that’s version 1.4). The outer loop splits into two threads — and only one of those two will see multithreading inside it. So the “three threads” portion is simply the straggling 2 elements from the inner for loop on that first iteration. This is changing in 1.5 and will likely change again to fully participate in the depth-first queue. Version 1.5: