Why don't remotecalls to a WorkerPool block when all workers are busy?

Based on the docs for remotecall I expect the WorkerPool variant to block until a worker in the pool is available for a new job. From docs:

  remotecall(f, pool::AbstractWorkerPool, args...; kwargs...) -> Future

  WorkerPool variant of remotecall(f, pid, ....). Wait for and take a free worker from pool and perform a remotecall on it.

So I’m confused as to why when I run the script below 10 “started” messages are printed, and then a few seconds later all the “finished” messages are printed.

using Distributed

nwork = 2
addprocs(max(0, nwork-nworkers()))
njobs = 10

ws = workers()
wp = CachingPool(ws)
rrefs = Vector{Any}(undef, njobs)

@everywhere function rfunc(i::Int)
    print("starting #$i\n")
    sleep(3)
    print("finished #$i\n")
    return i
end

for i = 1:njobs
    rrefs[i] = remotecall(rfunc, wp, i)
end

Here’s the output:

julia> include("demo_pool.jl")

julia>       From worker 2:	starting #1
      From worker 2:	starting #3
      From worker 2:	starting #5
      From worker 2:	starting #7
      From worker 2:	starting #9
      From worker 3:	starting #2
      From worker 3:	starting #4
      From worker 3:	starting #6
      From worker 3:	starting #8
      From worker 3:	starting #10
      From worker 2:	finished #1
      From worker 2:	finished #3
      From worker 2:	finished #5
      From worker 2:	finished #7
      From worker 2:	finished #9
      From worker 3:	finished #2
      From worker 3:	finished #4
      From worker 3:	finished #6
      From worker 3:	finished #8
      From worker 3:	finished #10

If I do an @async remotecall_fetch it behaves like I intended, but I didn’t think that should be necessary. Here’s the output in that case:

julia> include("demo_pool.jl")

julia>       From worker 3:	starting #2
      From worker 2:	starting #1
      From worker 2:	finished #1
      From worker 3:	finished #2
      From worker 3:	starting #3
      From worker 2:	starting #4
      From worker 3:	finished #3
      From worker 2:	finished #4
      From worker 3:	starting #5
      From worker 2:	starting #6
      From worker 3:	finished #5
      From worker 2:	finished #6
      From worker 3:	starting #7
      From worker 2:	starting #8
      From worker 3:	finished #7
      From worker 2:	finished #8
      From worker 3:	starting #9
      From worker 2:	starting #10
      From worker 3:	finished #9
      From worker 2:	finished #10
1 Like

I had the same problem with remote_do

The worker is free during sleep. If it would be a long calculation taking time, things would look different. I think sleep yields to other tasks.

That still wouldn’t explain why 10 tasks are apparently running simultaneously and yielding to one another when the size of the WorkerPool is only 2. Based on the docs I would have expected remotecall to block the main thread until the pool has available workers.

Also note that the behavior doesn’t change when sleep is replaced with computation. If you replace the sleep statement with this loop you’ll see what I mean.

    s = 0
    for i in 1:10^9
        s+=rand()
    end

(I understand that at the OS level the process may yield even in the middle of this loop, but regardless I’m still puzzled as to why Julia starts so many jobs in the first place.)

I’d also be interested in what is going on here. In my case I run out of memory when all jobs are queued without blocking.

Thanks for the @async remotecall_fetch() idea. It’s not an intuitive construction!

1 Like