Number of threads available seems random (using @spawn)

I am trying to use @spawn to parallelize a computation, but I am experiencing some problems which I do not understand. Here is the simplest code with which I could reproduce the issue:

using Dates

function f()
  s = 0.
  for i in 1:10^10
    s += rand()
  end
  return s
end

function t()
  for i in 1:Threads.nthreads()
    Threads.@spawn f()
    println(i," ",Dates.now())
  end
  println("end")
end

println(Threads.nthreads())
t()


The output is, of course:

4
1 2020-08-08T20:41:15.736
2 2020-08-08T20:41:15.851
3 2020-08-08T20:41:55.117
4 2020-08-08T20:41:55.117
end

What I expected, however, is that at least nthreads-1 dates were printed almost instantaneously (-1 because perhaps there is one thread being used by the main interpreter, that is part of the question).

However, I do not get that. I get that dates take a while to be printed, in such a way that it appears that the execution is waiting for the spawned tasks to finish to continue in the loop.

I recorded a video of that behavior, which is available here. You will see that the date of the second thread takes a long time to be printed (even though the “2” is printed). Perhaps this is just a buffer thing, but my problem is that the program seems to be taking much longer than it should if the threads were actually being scheduled and the computations done in parallel.

https://drive.google.com/file/d/1yJUKA1gCGga7-R2r-s_7VwpTIzL5VBPh/view?usp=sharing

It seems that the number of free threads varies, sometimes threads are run 2 at a time, sometimes 4. It is not clear to me what to expect here. The same code if run again has a different behavior each time.

1 Like

I get similar behavior and am confused as well. It seems to be a locking issue related to IO. If the println is removed from the for loop, everything runs in parallel as expected. I believe IO is currently not thread safe, but as you wrote the function, IO should only occur on the REPL thread, so I don’t understand why it’s not working.

Also confusing is that actually printing from the worker threads seems to make everything work as expected with all threads working in parallel:

julia> function t()
         Threads.@threads for i in 1:Threads.nthreads()
           f()
           println(i," ",Dates.now())
         end
         println("end")
       end
t (generic function with 1 method)

julia> t()
3 2020-08-09T01:43:34.900
6 2020-08-09T01:43:34.966
7 2020-08-09T01:43:35.017
5 2020-08-09T01:43:35.050
2 2020-08-09T01:43:35.062
1 2020-08-09T01:43:35.139
8 2020-08-09T01:43:35.141
4 2020-08-09T01:43:35.413
end

Indeed, effectively in this case the printing is part of the issue. I changed the code to:

function t()
  dates = Vector{DateTime}(undef,Threads.nthreads()-1)
  task = Vector{Task}(undef,Threads.nthreads()-1)
  for i in 1:Threads.nthreads()-1
    task[i] = Threads.@spawn f()
    dates[i] = Dates.now()
  end
  println("end")
  println(dates)
end

And all threads are executed almost at the same time:

[DateTime("2020-08-09T10:07:56.21"), 
 DateTime("2020-08-09T10:07:56.21"),  
 DateTime("2020-08-09T10:07:56.21"), 
 DateTime("2020-08-09T10:07:56.21")

So I have to revise things because that was not my problem (I do not have printing stuff in my actual code, I was printing to debug).

However, and this might be more related to the actual issue, the time the program takes to compute these tasks scales very badly and, in principle, I thought that these executions should be almost perfect scaled (the complete code is at the end - I am leaving one thread free always, so I start with -t 2):

# executes f() once:
% time julia -t 2 error.jl
real    0m18,986s
user    0m19,674s
sys     0m0,625s


# executes f() 5 times:
% julia -t 6 error.jl

real    1m8,329s
user    5m39,041s
sys     0m0,612s

Using “top” at the same time shows 500% usage of CPU, as expected, so all calls to f() appear to be effectively occurring simultaneously. The final time varies from ~38s to 1m8s. Also, from time to time, the “dates” vector takes a while to be printed, which resembles what we see with the “println” above.

The complete code is:

using Dates

function f()
  s = 0.
  for i in 1:(5*10^9)
    s += rand()
  end
  return s
end

function t()
  dates = Vector{DateTime}(undef,Threads.nthreads()-1)
  task = Vector{Task}(undef,Threads.nthreads()-1)
  for i in 1:Threads.nthreads()-1
    task[i] = Threads.@spawn f()
    dates[i] = Dates.now()
  end
  println(dates)
  for i in 1:Threads.nthreads()-1
    println(fetch(task[i]))
  end
end


println(Threads.nthreads())
t()


One final example: Here I run twice the code with -t 5, and observe two very different timings.

Interestingly, in the first run CPU usage by Julia went to 500%, and in the second run, when scaling was ok, it went to the expected 400%. Something else appears to be running eventually and blocking the expected execution of the spawned tasks.

% time julia -t 5 error.jl 
5
[DateTime("2020-08-09T10:33:48.181"), DateTime("2020-08-09T10:33:48.181"), DateTime("2020-08-09T10:33:48.181"), Dat
eTime("2020-08-09T10:33:48.181")]
2.5000036444522486e9
2.4999844243323035e9
2.5000208883679285e9
2.499980171936116e9

real    0m46,548s  # <---- CPU usage went to 500%
user    3m2,679s
sys     0m0,565s

% time julia -t 5 error.jl 
 Loading .julia/config/startup.jl 
5
[DateTime("2020-08-09T10:36:49.577"), DateTime("2020-08-09T10:36:49.577"), DateTime("2020-08-09T10:36:49.577"), DateTime("2020-08-09T10:36:49.577")]
2.500011615544247e9
2.499975409017432e9
2.500016674809891e9
2.499981206562218e9

real    0m25,679s # <-- CPU usage went to 400%
user    1m41,856s
sys     0m0,613s


Just for note: in the same machine I can run 16 independent executions of the “f()” function called with “julia execute_f.jl”, and the scaling is perfect, I get all results in ~30 seconds, as expected.

I think the poor scaling that you’re seeing is because the default global rng can suffer from poor performance when called from multiple threads.

IIRC, the reason was due to cache invalidation (the default global rng for each thread is stored next to the others in an array without any padding, so when one thread changes the rng state, it forces the cache on all other cores to reload… or something like that.)

[me@redmi ~]$ time ~/julia/bin/julia -t2 /tmp/error.jl
2
[DateTime("2020-08-09T15:24:23.878")]
2.499996828176266e9

real    0m20.020s
user    0m39.258s
sys     0m0.802s
[me@redmi ~]$ time ~/julia/bin/julia --inline=yes --optimize=3  --math-mode=fast --check-bounds=no -t8 /tmp/error.jl
8
[DateTime("2020-08-09T15:24:54.298"), DateTime("2020-08-09T15:24:54.298"), DateTime("2020-08-09T15:24:54.298"), DateTime("2020-08-09T15:24:54.298"), DateTime("2020-08-09T15:24:54.298"), DateTime("2020-08-09T15:24:54.298"), DateTime("2020-08-09T15:24:54.298")]
2.500007301917459e9
2.500005161541428e9
2.4999788758251624e9
2.500002200890986e9
2.500027065633972e9
2.5000236820377455e9
2.499984928049075e9

real    0m39.881s
user    4m15.892s
sys     0m0.590s
[me@redmi ~]$ vim /tmp/error.jl  # I made the code change below to explicit rng
[me@redmi ~]$ time ~/julia/bin/julia -t8 /tmp/error.jl
8
[DateTime("2020-08-09T15:28:19.239"), DateTime("2020-08-09T15:28:19.239"), DateTime("2020-08-09T15:28:19.239"), DateTime("2020-08-09T15:28:19.239"), DateTime("2020-08-09T15:28:19.239"), DateTime("2020-08-09T15:28:19.239"), DateTime("2020-08-09T15:28:19.239")]
2.500025329225046e9
2.499977304118077e9
2.5000150674239006e9
2.4999986608286796e9
2.5000115835831914e9
2.4999994211369123e9
2.500023322898655e9

real    0m19.713s
user    2m5.908s
sys     0m0.546s
using Dates
import Random

function f(i)
  rng = Random.MersenneTwister(i)
  s = 0.
  for i in 1:(5*10^9)
    s += rand(rng)
  end
  return s
end

function t()
  dates = Vector{DateTime}(undef,Threads.nthreads()-1)
  task = Vector{Task}(undef,Threads.nthreads()-1)
  for i in 1:Threads.nthreads()-1
    task[i] = Threads.@spawn f(i)
    dates[i] = Dates.now()
  end
  println(dates)
  for i in 1:Threads.nthreads()-1
    println(fetch(task[i]))
  end
end


println(Threads.nthreads())
t()
1 Like

I had something similar going on in my recent attempts to parallelize Agents.jl models [solved] Agents.jl spending Âľ of its time doing something before it even starts to compute?

yes the initial question was about profiling, but in the end I was seeing failure to spread the load across available threads.

one thing I discovered was that @spawn tasks do not migrate across threads yet, so if you get unlucky and have two tasks on one thread, they serialize.

actually more related post: Threading usage patterns: Worker pools for Agents.jl and SQLite.jl

Thank you. That is clearly part of the solution.

One thing that I noticed, but not understood, is that the I cannot define the generator in a function outside the function that will be using it, at the expense of the code becoming terribly slow. I mean:



module test
  using Dates
  import Random
  random() = rand(Random.MersenneTwister()) # <--- This is what I mean

  function f()
    s = 0.
    for i in 1:(5*10^9)
      s += random()  # <-- Using the new random 
    end
    return s
  end
    
  function t()
    dates = Vector{DateTime}(undef,Threads.nthreads())
    task = Vector{Task}(undef,Threads.nthreads())
    for i in 1:Threads.nthreads()
      task[i] = Threads.@spawn f()
      dates[i] = Dates.now()
    end
    println(dates)
    for i in 1:Threads.nthreads()
      println(fetch(task[i]))
    end
  end

end

test.t()

If, instead, I do what you actually proposed, which is defining the generator inside f(), the code works quickly:

  function f()
    rng = Random.MersenneTwister()
    s = 0.
    for i in 1:(5*10^9)
      s += rand(rng)
    end
    return s
  end

Any idea why is that?

Each thread should have its own RNG. See Random number generation

In your case you create a new instance of MersenneTwister() each time you call random(), isn’t it. You need it only once per thread.

2 Likes

Actually reading the docs, from what I understand of that, it seems that what I had in the first place should not be problematic. I do not have to have the same random seed for every thread, and it seems that the default behavior should be to to create an independent RNG for each thread already. Thus, from what I understand from the docs, the initial use of rand() should be fine.

(at the same time, I do not know, if I needed, to create a single RNG for each thread if the function that uses the random number generator is an inner function. Worst still if the function is implemented by a third-party)

Random number generation

The approach we’ve taken with Julia’s default global random number generator ( rand() and friends) is to make it thread-specific. On first use, each thread will create an independent instance of the default RNG type (currently MersenneTwister ) seeded from system entropy. All operations that affect the random number state ( rand , randn , seed! , etc.) will then operate on only the current thread’s RNG state. This way, multiple independent code sequences that seed and then use random numbers will individually work as expected.

If you need all threads to use a known initial seed, you will need to set it up explicitly. For that kind of more precise control, or better performance, we recommend allocating and passing your own RNG objects (e.g. Random.MersenneTwister() ).

Yes, interestingly it doesn’t seem to depend on threads though:

function f()
  s = 0.
  for i in 1:(5*10^9)
    s += rand()
  end
  return s
end

function f1(x)
    rng = Random.MersenneTwister(x)
    s = 0.
    for i in 1:(5*10^9)
        s += rand(rng)
    end
    return s
end

const RNG = Random.MersenneTwister(123)

function f2()
    s = 0.
    for i in 1:(5*10^9)
        s += rand(RNG)
    end
    return s
end

Then:

julia> using BenchmarkTools

julia> @btime f()
  13.792 s (0 allocations: 0 bytes)
2.500002498069774e9

julia> @btime (t=Threads.@spawn f(); fetch(t)) 
  13.724 s (6 allocations: 704 bytes)
2.4999988897968483e9

julia> @btime f1(123)
  6.784 s (8 allocations: 19.47 KiB)
2.499990043303408e9

julia> @btime f2()
  6.743 s (0 allocations: 0 bytes)
2.5000122507541533e9

julia> @btime (t=Threads.@spawn f2(); fetch(t))
  6.808 s (6 allocations: 704 bytes)
2.5000292963576336e9

For some reason your initial version of f() without a RNG variable is much slower.

Uhm, not really. I stil have the problem with Threads. Actually, using the const RNG = ... the execution became much, much slower with multiple threads

Two consecutive executions of the same code (below), one with 1 thread, the second with 5:

leandro@pitico:~/.julia/dev/MDDF/tests/NAMD% time julia error.jl 
 Loading .julia/config/startup.jl 
[Dates.DateTime("2020-08-09T18:03:10.862")]
2.499990043303408e9

real    0m8,871s
user    0m9,232s
sys     0m0,557s
leandro@pitico:~/.julia/dev/MDDF/tests/NAMD% time julia -t 5 error.jl 
 Loading .julia/config/startup.jl 
[Dates.DateTime("2020-08-09T18:03:25.662"), Dates.DateTime("2020-08-09T18:03:25.662"), Dates.DateTime("2020-08-09T18:03:25.662"), Dates.DateTime("2020-08-09T18:03:25.662"), Dates.DateTime("2020-08-09T18:03:25.662")]
2.50001696159703e9
2.5000359192534356e9
2.499998749739077e9
2.4999978390060587e9
2.4999963139227867e9

real    2m25,868s
user    11m7,271s
sys     0m0,769s

The code is:

module test
  using Dates
  import Random

  const RNG = Random.MersenneTwister(123)

  function f()
    s = 0.
    for i in 1:(5*10^9)
      s += rand(RNG)
    end
    return s
  end
  
  
  function t()
    dates = Vector{DateTime}(undef,Threads.nthreads())
    task = Vector{Task}(undef,Threads.nthreads())
    for i in 1:Threads.nthreads()
      task[i] = Threads.@spawn f()
      dates[i] = Dates.now()
    end
    println(dates)
    for i in 1:Threads.nthreads()
      println(fetch(task[i]))
    end
  end
end

test.t()


The best result I get is with RNG = ... inside f(). Still, it scales very badly. Now it takes 24s instead of the 8s with a single thread, when using 5 threads.

Well, to summarize what I understood up to now, I have written the following code which runs the three alternatives: using rand() directly, defining RNG = ... inside the function that is started by each thread, and using const RNG = ... :

using BenchmarkTools

module test
  using Dates
  import Random

  function f0()
    s = 0.
    for i in 1:10^8
      s += rand()
    end
    return s
  end

  function f1()
    RNG1 = Random.MersenneTwister()
    s = 0.
    for i in 1:10^8
      s += rand(RNG1)
    end
    return s
  end

  const RNG = Random.MersenneTwister()
  function f2()
    s = 0.
    for i in 1:10^8
      s += rand(RNG)
    end
    return s
  end

  function t(func)
    nthreads = Threads.nthreads()
    dates = Vector{DateTime}(undef,nthreads)
    task = Vector{Task}(undef,nthreads)
    for i in 1:nthreads
      task[i] = Threads.@spawn func()
      dates[i] = Dates.now()
    end
    for i in 1:nthreads
      fetch(task[i])
    end
  end
end

println("f0: rand()")
@btime test.t(test.f0)

println("f1: RNG inside f")
@btime test.t(test.f1)

println("f2: using const RNG = ...")
@btime test.t(test.f2)

The results are as follows:

Single thread:
Obs: rand() is very slow as it is. Explicit definition of RNG increases performance.

f0: rand()
  505.392 ms (7 allocations: 912 bytes)
f1: RNG inside f
  186.138 ms (14 allocations: 20.34 KiB)
f2: using const RNG = ...
  186.426 ms (7 allocations: 912 bytes)

2 threads:
Obs: rand() and RNG inside f scaled fine, the use of const ... results in much slower
code.

f0: rand()
  505.555 ms (14 allocations: 1.66 KiB)
f1: RNG inside f
  186.667 ms (28 allocations: 40.56 KiB)
f2: using const RNG = ...
  719.425 ms (14 allocations: 1.66 KiB)

4 threads:
Obs: f0, while slow, is scaling fine. f1 didn’t (but this result is not deterministic)
and f2 is clearly out of question.

f0: rand()
  513.237 ms (26 allocations: 3.16 KiB)
f1: RNG inside f
  366.962 ms (55 allocations: 81.00 KiB)
f2: using const RNG = ...
  1.687 s (26 allocations: 3.16 KiB)

8 threads:
Obs: f0 and f1 scaled fine. f2 does not. Not clear what happened with f1 in the
previous run.

f0: rand()
  510.787 ms (51 allocations: 6.19 KiB)
f1: RNG inside f
  186.583 ms (106 allocations: 161.78 KiB)
f2: using const RNG = ...
  3.982 s (50 allocations: 6.16 KiB)

Therefore, the const RNG option is not useful. There is an important performance gain in declaring RNG inside f(). Both this option and the standard rand() scaled well, yet sometimes the times vary quite a lot, which makes things hard to debug.

Concerning my original post: there was something locking the threads related to the printing after the thread spawn. It is to be checked if it is printing is the only function that leads to that behavior.

Remark: the use of the RNG = ... is somewhat limited except if rand is called from the first level of functions of the thread. Otherwise the inner functions would all have to have these definitions, and if they are called multiple times it is not clear to me yet if that initialization of the RNG will come with an additional cost. To be tested.

Well, tested the fact that RNG is defined in inner functions. Effectivelly, that comes with a HUGE cost (I had to kill a calculation that took before 16 s). Therefore, the alternative of declaring RNG = ... inside the function thread is limited to the cases where the rand() function is not called within a function that is called by itself multiple times.

This was the code, which was run with julia error3.jl, just single-threaded:

using BenchmarkTools

module test
  using Dates
  import Random
  const N = 10^10

  function g1()
    RNG = Random.MersenneTwister()
    return rand(RNG)
  end

  function f1()
    s = 0.
    for i in 1:N
      s += g1()
    end
    return s
  end
  
  function t(func)
    nthreads = Threads.nthreads()
    dates = Vector{DateTime}(undef,nthreads)
    task = Vector{Task}(undef,nthreads)
    for i in 1:nthreads
      task[i] = Threads.@spawn func()
      dates[i] = Dates.now()
    end
    s = 0.
    for i in 1:nthreads
      s += fetch(task[i])
    end
    #println(s)
  end
end

println("f1: RNG inside f")
test.t(test.f1)

An alternative is, if possible, to pass RNG as a parameter to inner functions. That works fine, with similar timings than the best choice above:


using BenchmarkTools

module test
  using Dates
  import Random
  const N = 10^10

  function g1(RNG)
    return rand(RNG)
  end

  function f1()
    RNG = Random.MersenneTwister()
    s = 0.
    for i in 1:N
      s += g1(RNG)
    end
    return s
  end

  function t(func)
    nthreads = Threads.nthreads()
    dates = Vector{DateTime}(undef,nthreads)
    task = Vector{Task}(undef,nthreads)
    for i in 1:nthreads
      task[i] = Threads.@spawn func()
      dates[i] = Dates.now()
    end
    s = 0.
    for i in 1:nthreads
      s += fetch(task[i])
    end
    #println(s)
  end
end

println("f1: RNG inside f")
@btime test.t(test.f1)

Try this:

julia> using Threads; using Future: randjump

julia> const RNGS = [randjump(MersenneTwister(),big(10)^20)]; foreach(_ -> push!(RNGS, randjump(last(RNGS),big(10)^20)), 2:Threads.nthreads())

julia> function f3()
         s = 0.
         RNG = RNGS[Threads.threadid()]
         for i in 1:10^8
           s += rand(RNG)
         end
         return s
       end
 f3 (generic function with 1 method)

This is what I get:

f0: rand()
  286.134 ms (112 allocations: 9.22 KiB)
f1: RNG inside f
  131.979 ms (238 allocations: 359.38 KiB)
f2: using const RNG = ...
  9.223 s (111 allocations: 9.19 KiB)
f3: using const RNGS = ...
  132.231 ms (111 allocations: 9.19 KiB)

The cost of creating the MersenneTwister is amortized over the long 10^8 iterations. f3 should be the fastest with fewer iterations.
Using 10^3 iterations instead, I get this:

f0: rand()
  22.492 ÎĽs (110 allocations: 9.16 KiB)
f1: RNG inside f
  27.087 ÎĽs (236 allocations: 359.31 KiB)
f2: using const RNG = ...
  78.150 ÎĽs (108 allocations: 9.09 KiB)
f3: using const RNGS = ...
  21.343 ÎĽs (110 allocations: 9.16 KiB)
5 Likes

Thank you. That completely solved the issue, and apparently I am getting the expected scaling for my package.

Just for my own records:

The solution of @Elrod solves the problem of not having to pass the “RNG” definition as an argument to the inner functions.

I have implemented that by introducing a random() function, as:

  const RNGS = [randjump(Random.MersenneTwister(),big(10)^20)] 
  init_random() = foreach(_ -> push!(RNGS, randjump(last(RNGS),big(10)^20)), 2:Threads.nthreads()) 
  random() = rand(RNGS[Threads.threadid()])
  random(arg) = rand(RNGs[Threads.threadid()],arg)

with which one can use random() or random(1:10) in the inner functions instead of rand().

The init_random() function must be called once, therefore it must be called in the main function (not for each thread).

The final code with all tests is:


using BenchmarkTools

module test
  using Dates
  using Future: randjump
  import Random
  const N = 10^4

  function f0()
    s = 0.
    for i in 1:N
      s += rand()
    end
    return s
  end
  
  function f1()
    RNG1 = Random.MersenneTwister()
    s = 0.
    for i in 1:N
      s += rand(RNG1)
    end
    return s
  end
  
  const RNG = Random.MersenneTwister()
  function f2()
    s = 0.
    for i in 1:N
      s += rand(RNG)
    end
    return s
  end

  const RNGS = [randjump(Random.MersenneTwister(),big(10)^20)] 
  init_random() = foreach(_ -> push!(RNGS, randjump(last(RNGS),big(10)^20)), 2:Threads.nthreads()) 
  random() = rand(RNGS[Threads.threadid()])
  random(arg) = rand(RNGs[Threads.threadid()],arg)

  function g3()
    return random()
  end
  function f3()
    s = 0.
    for i in 1:N
      s += g3()
    end
    return s
  end

  function t(func)
    nthreads = Threads.nthreads()
    dates = Vector{DateTime}(undef,nthreads)
    task = Vector{Task}(undef,nthreads)
    for i in 1:nthreads
      task[i] = Threads.@spawn func()
      dates[i] = Dates.now()
    end
    s = 0.
    for i in 1:nthreads
      s += fetch(task[i])
    end
    #println(s)
  end

end

test.init_random()

println("f0: rand()")
@btime test.t(test.f0)

println("f1: RNG inside f")
@btime test.t(test.f1)

println("f2: using const RNG = ...")
@btime test.t(test.f2)

println("f3: using const RNGS = ...")
@btime test.t(test.f3)

println("init_random():")
@btime test.init_random()
      

Which results, with 1 thread, in the timings:

f0: rand()
  51.963 ÎĽs (8 allocations: 928 bytes)
f1: RNG inside f
  38.287 ÎĽs (15 allocations: 20.36 KiB)
f2: using const RNG = ...
  20.060 ÎĽs (8 allocations: 928 bytes)
f3: using const RNGS = ...
  31.521 ÎĽs (8 allocations: 928 bytes)
init_random():
  3.039 ns (0 allocations: 0 bytes)


With 4 threads:

f0: rand()
  56.566 ÎĽs (30 allocations: 3.22 KiB)
f1: RNG inside f
  43.925 ÎĽs (57 allocations: 81.00 KiB)
f2: using const RNG = ...
  154.745 ÎĽs (30 allocations: 3.22 KiB)
f3: using const RNGS = ...
  48.100 ÎĽs (30 allocations: 3.22 KiB)
init_random():
  42.068 ms (33 allocations: 67.92 KiB)


8 threads:

f0: rand()
  59.375 ÎĽs (58 allocations: 6.28 KiB)
f1: RNG inside f
  48.022 ÎĽs (114 allocations: 161.91 KiB)
f2: using const RNG = ...
  377.848 ÎĽs (58 allocations: 6.28 KiB)
f3: using const RNGS = ...
  41.980 ÎĽs (58 allocations: 6.28 KiB)
init_random():
  98.689 ms (77 allocations: 158.48 KiB)

16 threads:

f0: rand()
  71.219 ÎĽs (113 allocations: 12.38 KiB)
f1: RNG inside f
  62.876 ÎĽs (226 allocations: 323.66 KiB)
f2: using const RNG = ...
  1.111 ms (114 allocations: 12.41 KiB)
f3: using const RNGS = ...
  52.891 ÎĽs (114 allocations: 12.41 KiB)
init_random():
  210.067 ms (165 allocations: 339.61 KiB)

the init_random() call takes some time and must be removed from the benchmark, but it is irrelevant for the actual code.

1 Like

Concerning back my original post:

What was blocking the execution was not the println by itself. It was the fact that @spawn launches calculations at the main thread (thread 1), and then the outer loop gets blocked. The solution I found was to use the ThreadPools package with @tspawnat, with which one can explicitly set to which thread one wants to assign the calculation, avoiding blocking thread 1.

This is discussed in detail in: https://github.com/JuliaLang/julia/issues/34267

using Dates
using ThreadPools

function f()
  s = 0.
  for i in 1:10^9
    s += rand()
  end
  return s
end

function t()
  nspawn = Threads.nthreads()-1
  t = Vector{Task}(undef,nspawn)
  for i in 1:nspawn
    t[i] = ThreadPools.@tspawnat i+1 f()
    println(i," ",Dates.now())
  end
  println("end spawn")
  for i in 1:nspawn
    fetch(t[i])
  end
  println("end fetching")
end

println(Threads.nthreads())
t()

You beat me to the punch. I was reading through this issue, prepping to comment on the foreground thread thing, until that last post. If this is a major issue for you, the ThreadPools.jl package has tools to deal with this - it duplicates the standard julia mechanisms, but with “background only” versions if needed.

1 Like

Yes, exactly. I needed, in principle, a “background-spawn” macro, which it seems that is not available. But actually having control to which thread each computation goes is nice, in particular because I am modifying shared-memory arrays, and having more control what is going on is a good thing. Thus, I @tspawnat was perfect.

Thank you very much for the package!

1 Like

It mimics @threads rather than @spawn, but the ThreadPools.@bthreads will also give background-only usage.

Sure, but I needed bspawn in that case, because I was controlling the release of new calculations by myself in an asynchronous manner and having to be sure that no two calculations were submitted at the same time because they need to access sequentially the data in single file. It was not the case of parallelizing a simple loop.

1 Like