How to Maximize CPU Utilization - @spawn Assigning to Busy Workers - Use pmap Instead

@spawn actually assigns tasks to threads that are not busy. Let’s say I have a function slow(n::Int) calibrated such that it keeps a core on my machine busy around 1 ms for each n:

using .Threads, BenchmarkTools

function slow(n)
    res = 0
    for _ in 1:n*2310
        res += sum(sin(1/rand()).^rand(1:5) for _ in 1:10)
    end
    return res
end

julia> @benchmark slow(1)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     967.756 μs (0.00% GC)
  median time:      996.769 μs (0.00% GC)
  mean time:        1.026 ms (0.00% GC)
  maximum time:     2.520 ms (0.00% GC)
  --------------
  samples:          4872
  evals/sample:     1

julia> @benchmark slow(1000)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.022 s (0.00% GC)
  median time:      1.028 s (0.00% GC)
  mean time:        1.034 s (0.00% GC)
  maximum time:     1.058 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1

Then I can check sequential vs parallel execution:

julia> @time foreach(_->slow(1_000), 1:nthreads())
  9.477004 seconds (104.81 k allocations: 5.884 MiB)

julia> @time @sync foreach(_->(Threads.@spawn slow(1_000)), 1:nthreads())
  1.385115 seconds (117.54 k allocations: 6.567 MiB)

julia> @time @sync @threads for _ in 1:nthreads()
           slow(1_000)
       end
  1.376207 seconds (118.42 k allocations: 6.578 MiB)

You see that in a case where all tasks take equally long there is no difference between @spawn and @threads. You can check also with htop that all CPUs are employed.

Let’s check dynamic scheduling. If I randomize the n argument to slow uniformly between 1:1000, a task should take on average 0.5 seconds. Such if I spawn 160 such randomized tasks on 8 cores it should take roughly 10 seconds if all cores are employed:

julia> @time @sync for _ in 1:160
           Threads.@spawn slow(rand(1:1000))
       end
 13.303092 seconds (19.02 k allocations: 1.198 MiB)

htop shows that all 8 cores are employed equally well:

Let’s check with @threads:

julia> @time @sync @threads for _ in 1:160
           slow(rand(1:1000))
       end
 14.933920 seconds (35.20 k allocations: 1.976 MiB)

This takes a bit longer since at the end of the computation the CPU load gets more unbalanced:

Now let’s check with Distributed pmap as suggested by @marius311 :

julia> using Distributed

julia> addprocs();

julia> nprocs()
17

julia> @everywhere function slow(n)
           res = 0
           for _ in 1:n*2310
               res += sum(sin(1/rand()).^rand(1:5) for _ in 1:10)
           end
           return res
       end

julia> @time pmap(_->slow(rand(1:1000)), 1:160);
 10.296406 seconds (236.70 k allocations: 12.383 MiB, 0.09% gc time)

OK, now that beats the above two with htop showing also the hyper-threads employed:

Two notes of caution

  1. You speak of @spawn and workers, @everywhere and Threads intermingled. Please note that those are actually two different concepts of parallel computing in Julia. Better to not mix them together.
  2. There are pathological cases where the scheduling of tasks to threads with @spawn does not work properly.
4 Likes