I am having a simple test program to learn how to divide work over threads. If I launch with one thread, the output is:
────────────────────────────────────────────────────────────────────
                            Time                    Allocations      
                   ───────────────────────   ────────────────────────
 Tot / % measured:      15.2s /  22.9%           0.96GiB /   0.1%    
 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
 do_work       10    3.47s  100.0%   347ms    670KiB  100.0%  67.0KiB
 multi         10    1.74s   50.1%   174ms      320B    0.0%    32.0B
 single        10    1.73s   49.8%   173ms   8.02KiB    1.2%     821B
 ────────────────────────────────────────────────────────────────────
If I run with two threads, the output is:
────────────────────────────────────────────────────────────────────
                            Time                    Allocations      
                   ───────────────────────   ────────────────────────
 Tot / % measured:      16.1s /  14.4%           1.07GiB /   0.2%    
 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
 do_work       10    2.32s  100.0%   232ms   1.91MiB  100.0%   195KiB
 multi         10    1.76s   76.0%   176ms   1.89MiB   99.2%   194KiB
 single        10    1.72s   74.0%   172ms   9.03KiB    0.5%     925B
 ────────────────────────────────────────────────────────────────────
The benchmark above shows that LoopVectorization does not play nicely with the thread already being in use by the task that runs do_single!, because the speedup is not a factor 2. If I manually set the number of threads to use in do_multi! to one less than the total, the 2-threads benchmark comes out well with a factor 2 speedup:
                            Time                    Allocations      
                   ───────────────────────   ────────────────────────
 Tot / % measured:      12.4s /  12.8%           0.96GiB /   0.1%    
 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
 do_work       10    1.58s  100.0%   158ms    685KiB  100.0%  68.5KiB
 multi         10    1.57s   99.5%   157ms   8.58KiB    1.3%     878B
 single        10    1.57s   99.5%   157ms   17.5KiB    2.6%  1.75KiB
 ────────────────────────────────────────────────────────────────────
I can fix this for any thread number, by manually setting the thread number one less, but I would like Julia to pick this up automatically, by keeping track of how many threads are already in use. How can I do this?
## Load packages.
using LoopVectorization
using TimerOutputs
## Define functions.
function do_single!(a)
    @turbo a .+= sin.(a)
end
function do_multi!(a)
    @tturbo a .+= sin.(a)
    # Line below solves the problem if Julia is launched with -t2.
    # @turbo thread=1 a .+= sin.(a)
    # Line below solves the problem if Julia is launched with -t3.
    # @turbo thread=2 a .+= sin.(a)
end
function do_work(a, b)
    @sync begin
        Threads.@spawn @timeit to "single" do_single!(a)
        Threads.@spawn @timeit to "multi" do_multi!(b)
    end
end
## Do benchmark and print output to screen.
n = 512
a = rand(n, n, n)
b = rand(n, n, n)
to = TimerOutput()
@notimeit to do_work(a, b)
for i in 1:10
    @timeit to "do_work" do_work(a, b)
end
show(TimerOutputs.flatten(to))
println("")