How to control threads in combination of LoopVectorization and @spawn

I am having a simple test program to learn how to divide work over threads. If I launch with one thread, the output is:

                            Time                    Allocations      
                   ───────────────────────   ────────────────────────
 Tot / % measured:      15.2s /  22.9%           0.96GiB /   0.1%    

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 do_work       10    3.47s  100.0%   347ms    670KiB  100.0%  67.0KiB
 multi         10    1.74s   50.1%   174ms      320B    0.0%    32.0B
 single        10    1.73s   49.8%   173ms   8.02KiB    1.2%     821B

If I run with two threads, the output is:

                            Time                    Allocations      
                   ───────────────────────   ────────────────────────
 Tot / % measured:      16.1s /  14.4%           1.07GiB /   0.2%    

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 do_work       10    2.32s  100.0%   232ms   1.91MiB  100.0%   195KiB
 multi         10    1.76s   76.0%   176ms   1.89MiB   99.2%   194KiB
 single        10    1.72s   74.0%   172ms   9.03KiB    0.5%     925B

The benchmark above shows that LoopVectorization does not play nicely with the thread already being in use by the task that runs do_single!, because the speedup is not a factor 2. If I manually set the number of threads to use in do_multi! to one less than the total, the 2-threads benchmark comes out well with a factor 2 speedup:

                            Time                    Allocations      
                   ───────────────────────   ────────────────────────
 Tot / % measured:      12.4s /  12.8%           0.96GiB /   0.1%    

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 do_work       10    1.58s  100.0%   158ms    685KiB  100.0%  68.5KiB
 multi         10    1.57s   99.5%   157ms   8.58KiB    1.3%     878B
 single        10    1.57s   99.5%   157ms   17.5KiB    2.6%  1.75KiB

I can fix this for any thread number, by manually setting the thread number one less, but I would like Julia to pick this up automatically, by keeping track of how many threads are already in use. How can I do this?

## Load packages.
using LoopVectorization
using TimerOutputs

## Define functions.
function do_single!(a)
    @turbo a .+= sin.(a)

function do_multi!(a)
    @tturbo a .+= sin.(a)
    # Line below solves the problem if Julia is launched with -t2.
    # @turbo thread=1 a .+= sin.(a)
    # Line below solves the problem if Julia is launched with -t3.
    # @turbo thread=2 a .+= sin.(a)

function do_work(a, b)
    @sync begin
        Threads.@spawn @timeit to "single" do_single!(a)
        Threads.@spawn @timeit to "multi" do_multi!(b)

## Do benchmark and print output to screen.
n = 512
a = rand(n, n, n)
b = rand(n, n, n)

to = TimerOutput()
@notimeit to do_work(a, b)

for i in 1:10
    @timeit to "do_work" do_work(a, b)


Threads.nthreads() will tell you how many total threads there are

But how do I know how many threads are available, rather than how many in total there are? In my example it looks like @tturbo uses nthreads, but one thread is already in use with the @spawn that runs the do_single! function, with the result that the do_work function loses performance. I would need nthreads-1 in this case, and that I do not know how to give that number to @turbo.

That seems to be difficult, indeed.

If someone wants to make a PR to Polyester to add support for an @spawn-like macro based on PolyesterWeave.jl, I could describe how to do this/answer any questions.
Because LoopVectorization.jl also uses PolyesterWeave.jl, the two should achieve the desired behavior here.

If someone wants to get started, I suggest looking through the ThreadingUtilities.jl tests, e.g.:
the static array tests to see how to launch a single thread using ThreadingUtilities.
You can see how Polyester uses PolyesterThreads to request threads here, as well as how it sets up calls. You’d of course only be requesting threads.
The correct use would then look something like

function do_work(a, b)
    t = Polyester.@spawn @timeit to "single" do_single!(a)
    # It would be important to NOT `@spawn` for the last work item you wish to execute
    # therefore the second one below is not behind an `@spawn`.
    @timeit to "multi" do_multi!(b)

It sounds like an interesting option to have, if that prevents the problem I have. In this example it is easy to oversee, but in a complex code, it would be nice if thread management including spawn could give some manual control how many threads are used where.

I could give it a try, but before, I would like to understand the relation between Polyester, PolyesterWeave, and ThreadingUtilities better.

ThreadingUtilities.jl launches tasks on __init__(), and lets you run code on these tasks.
If the tasks are actively looking for work, the latency between submitting work to a task and the task getting started is very low. Maybe one or two orders of magnitude lower than when spawning a new thread (whether this matters depends on how long the function actually takes to run; by the time something takes a millisecond the overhead from @spawn starts getting pretty negligible).

PolyesterWeave.jl provides some code that can track workers. Most signficantly, it lets you request a batch of workers, or free a batch. When requesting a batch, it will mark all those it provides as busy. It will only provide worker tasks that are not busy.

Polyester.jl uses ThreadingUtilities.jl to run code on tasks PolyesterWeave.jl says aren’t busy.
LoopVectorization.jl’s @tturbo does the same. Using PolyesterWeave.jl makes it compatible.
Ditto Octavian.jl.