How to Maximize CPU Utilization - @spawn Assigning to Busy Workers - Use pmap Instead

pbayer · January 23, 2021, 12:53pm

@spawn actually assigns tasks to threads that are not busy. Let’s say I have a function slow(n::Int) calibrated such that it keeps a core on my machine busy around 1 ms for each n:

using .Threads, BenchmarkTools

function slow(n)
    res = 0
    for _ in 1:n*2310
        res += sum(sin(1/rand()).^rand(1:5) for _ in 1:10)
    end
    return res
end

julia> @benchmark slow(1)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     967.756 μs (0.00% GC)
  median time:      996.769 μs (0.00% GC)
  mean time:        1.026 ms (0.00% GC)
  maximum time:     2.520 ms (0.00% GC)
  --------------
  samples:          4872
  evals/sample:     1

julia> @benchmark slow(1000)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.022 s (0.00% GC)
  median time:      1.028 s (0.00% GC)
  mean time:        1.034 s (0.00% GC)
  maximum time:     1.058 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1

Then I can check sequential vs parallel execution:

julia> @time foreach(_->slow(1_000), 1:nthreads())
  9.477004 seconds (104.81 k allocations: 5.884 MiB)

julia> @time @sync foreach(_->(Threads.@spawn slow(1_000)), 1:nthreads())
  1.385115 seconds (117.54 k allocations: 6.567 MiB)

julia> @time @sync @threads for _ in 1:nthreads()
           slow(1_000)
       end
  1.376207 seconds (118.42 k allocations: 6.578 MiB)

You see that in a case where all tasks take equally long there is no difference between @spawn and @threads. You can check also with htop that all CPUs are employed.

Let’s check dynamic scheduling. If I randomize the n argument to slow uniformly between 1:1000, a task should take on average 0.5 seconds. Such if I spawn 160 such randomized tasks on 8 cores it should take roughly 10 seconds if all cores are employed:

julia> @time @sync for _ in 1:160
           Threads.@spawn slow(rand(1:1000))
       end
 13.303092 seconds (19.02 k allocations: 1.198 MiB)

htop shows that all 8 cores are employed equally well:

Let’s check with @threads:

julia> @time @sync @threads for _ in 1:160
           slow(rand(1:1000))
       end
 14.933920 seconds (35.20 k allocations: 1.976 MiB)

This takes a bit longer since at the end of the computation the CPU load gets more unbalanced:

Now let’s check with Distributed pmap as suggested by @marius311 :

julia> using Distributed

julia> addprocs();

julia> nprocs()
17

julia> @everywhere function slow(n)
           res = 0
           for _ in 1:n*2310
               res += sum(sin(1/rand()).^rand(1:5) for _ in 1:10)
           end
           return res
       end

julia> @time pmap(_->slow(rand(1:1000)), 1:160);
 10.296406 seconds (236.70 k allocations: 12.383 MiB, 0.09% gc time)

OK, now that beats the above two with htop showing also the hyper-threads employed:

Two notes of caution

You speak of @spawn and workers, @everywhere and Threads intermingled. Please note that those are actually two different concepts of parallel computing in Julia. Better to not mix them together.
There are pathological cases where the scheduling of tasks to threads with @spawn does not work properly.

Topic		Replies	Views
Examples of when to use @async over @spawn? Performance question	8	757	May 28, 2024
Limit number of @spawn'ed threads General Usage	13	2285	June 19, 2024
Limiting the maximum number of parallel threads with @spawn, as with @threads General Usage parallel , multithreading	2	673	May 8, 2020
Multithreaded task spreading New to Julia multithreading	6	746	January 5, 2020
Multithreading balancing Performance multithreading	6	933	August 28, 2021

How to Maximize CPU Utilization - @spawn Assigning to Busy Workers - Use pmap Instead

Two notes of caution

Related topics