I am having a simple test program to learn how to divide work over threads. If I launch with one thread, the output is:
────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 15.2s / 22.9% 0.96GiB / 0.1%
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────
do_work 10 3.47s 100.0% 347ms 670KiB 100.0% 67.0KiB
multi 10 1.74s 50.1% 174ms 320B 0.0% 32.0B
single 10 1.73s 49.8% 173ms 8.02KiB 1.2% 821B
────────────────────────────────────────────────────────────────────
If I run with two threads, the output is:
────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 16.1s / 14.4% 1.07GiB / 0.2%
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────
do_work 10 2.32s 100.0% 232ms 1.91MiB 100.0% 195KiB
multi 10 1.76s 76.0% 176ms 1.89MiB 99.2% 194KiB
single 10 1.72s 74.0% 172ms 9.03KiB 0.5% 925B
────────────────────────────────────────────────────────────────────
The benchmark above shows that LoopVectorization does not play nicely with the thread already being in use by the task that runs do_single!
, because the speedup is not a factor 2. If I manually set the number of threads to use in do_multi!
to one less than the total, the 2-threads benchmark comes out well with a factor 2 speedup:
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 12.4s / 12.8% 0.96GiB / 0.1%
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────
do_work 10 1.58s 100.0% 158ms 685KiB 100.0% 68.5KiB
multi 10 1.57s 99.5% 157ms 8.58KiB 1.3% 878B
single 10 1.57s 99.5% 157ms 17.5KiB 2.6% 1.75KiB
────────────────────────────────────────────────────────────────────
I can fix this for any thread number, by manually setting the thread number one less, but I would like Julia to pick this up automatically, by keeping track of how many threads are already in use. How can I do this?
## Load packages.
using LoopVectorization
using TimerOutputs
## Define functions.
function do_single!(a)
@turbo a .+= sin.(a)
end
function do_multi!(a)
@tturbo a .+= sin.(a)
# Line below solves the problem if Julia is launched with -t2.
# @turbo thread=1 a .+= sin.(a)
# Line below solves the problem if Julia is launched with -t3.
# @turbo thread=2 a .+= sin.(a)
end
function do_work(a, b)
@sync begin
Threads.@spawn @timeit to "single" do_single!(a)
Threads.@spawn @timeit to "multi" do_multi!(b)
end
end
## Do benchmark and print output to screen.
n = 512
a = rand(n, n, n)
b = rand(n, n, n)
to = TimerOutput()
@notimeit to do_work(a, b)
for i in 1:10
@timeit to "do_work" do_work(a, b)
end
show(TimerOutputs.flatten(to))
println("")