How to enforce spawned tasks to be run on different threads?

Hey!
I would like to run a job several times in parallel (as many times as the number of available threads), and I would like to do this several times in a row. MWE:

function mytask() # The task to be parallelized
    println(Threads.threadid())
    x = 1.0
    for _ in 1:1e9
        x += rand()
    end
end

for i in 1:2 # Say we do two iterations
    println("Iteration $i")
    @sync for _ in 1:Threads.nthreads()
        Threads.@spawn begin
            mytask()
        end
    end
end

Unfortunately, when I do this, say with 4 threads, the first iteration is perfectly parallelized and starting from the second iteration one thread will run the task several times sequentially, yielding the following output:

Iteration 1
1
3
2
4
Iteration 2
1
3
1
1

Thus, iteration 2 takes three times as much time as iteration 1 to be completed. If I draw a sketch of the execution of threads vs time, this can be seen as the following (colored cell means thread is running):

How can I enforce the task to be run in different threads in parallel for the 2nd iteration (and the next ones)?

Thanks for reading!

what is this for? I get you want two iterations but is your real task number depend on number of threads? are you studying something related to threading?

btw it just does whatever scheduler thinks is the best:

julia> for i in 1:2 # Say we do two iterations
           println("Iteration $i")
           @sync for _ in 1:Threads.nthreads()
               Threads.@spawn begin
                   mytask()
               end
           end
       end
Iteration 1
1
4
3
2
Iteration 2
1
4
2
3

unless you’re seeing specific issues with this (e.g. if you have thread-local cache), I’d say this is fine

My real task number does not depend on the number of threads. I am working on Machine Learning and the task is actually some kind of Monte Carlo simulation. In my real application I should run 25 tasks in parallel and I have up to 36 available threads on a remote machine.

Notice that the parallelized tasks are embarrassingly parallel computing: they do not share any memory / data.

You mean that there is some kind of optimization behind? Any idea how to be 100% sure to reproduce the same “perfect parallelization” of iteration 1 in iteration 2?

if your real task is slow enough such that by the time next @spawn starts non of the previous threads have finished, it should already be doing that.

Basically, if by the time you’re spawning your 24th task, the 1st thread is available again, why don’t you want the 1st thread to work?

ThreadPools.tmap
#or
ThreadPools.tforeach

maybe

I could not achieve full parallelism with ThreadPools.tmap or ThreadPools.tforeach, though I succeeded using Threads.@threads for (see below).

for i in 1:2 # Say we do two iterations
    println("Iteration $i")
    Threads.@threads for _ in 1:4
        mytask()
    end
end

However, this forces the user to have a single outer for loop encompassing the multithreaded code. Less convenient than @spawn IMO. This would be a problem with nested multithreaded for loops for instance. I’d be interested in achieving fully parallelized code using something like @spawn. I remember being able to do that with apply_async in python, by creating a pool of jobs wherever I wanted in nested for loops and “getting” them afterwards (executing the threads and fetching the results).

Yes I would like this! However my tasks are pretty long and of similar length. I’d be happy just to be able to run the 25 threads in parallel.