Why don't remotecalls to a WorkerPool block when all workers are busy?

Based on the docs for remotecall I expect the WorkerPool variant to block until a worker in the pool is available for a new job. From docs:

  remotecall(f, pool::AbstractWorkerPool, args...; kwargs...) -> Future

  WorkerPool variant of remotecall(f, pid, ....). Wait for and take a free worker from pool and perform a remotecall on it.

So I’m confused as to why when I run the script below 10 “started” messages are printed, and then a few seconds later all the “finished” messages are printed.

using Distributed

nwork = 2
addprocs(max(0, nwork-nworkers()))
njobs = 10

ws = workers()
wp = CachingPool(ws)
rrefs = Vector{Any}(undef, njobs)

@everywhere function rfunc(i::Int)
    print("starting #$i\n")
    sleep(3)
    print("finished #$i\n")
    return i
end

for i = 1:njobs
    rrefs[i] = remotecall(rfunc, wp, i)
end

Here’s the output:

julia> include("demo_pool.jl")

julia>       From worker 2:	starting #1
      From worker 2:	starting #3
      From worker 2:	starting #5
      From worker 2:	starting #7
      From worker 2:	starting #9
      From worker 3:	starting #2
      From worker 3:	starting #4
      From worker 3:	starting #6
      From worker 3:	starting #8
      From worker 3:	starting #10
      From worker 2:	finished #1
      From worker 2:	finished #3
      From worker 2:	finished #5
      From worker 2:	finished #7
      From worker 2:	finished #9
      From worker 3:	finished #2
      From worker 3:	finished #4
      From worker 3:	finished #6
      From worker 3:	finished #8
      From worker 3:	finished #10

If I do an @async remotecall_fetch it behaves like I intended, but I didn’t think that should be necessary. Here’s the output in that case:

julia> include("demo_pool.jl")

julia>       From worker 3:	starting #2
      From worker 2:	starting #1
      From worker 2:	finished #1
      From worker 3:	finished #2
      From worker 3:	starting #3
      From worker 2:	starting #4
      From worker 3:	finished #3
      From worker 2:	finished #4
      From worker 3:	starting #5
      From worker 2:	starting #6
      From worker 3:	finished #5
      From worker 2:	finished #6
      From worker 3:	starting #7
      From worker 2:	starting #8
      From worker 3:	finished #7
      From worker 2:	finished #8
      From worker 3:	starting #9
      From worker 2:	starting #10
      From worker 3:	finished #9
      From worker 2:	finished #10
1 Like

I had the same problem with remote_do

The worker is free during sleep. If it would be a long calculation taking time, things would look different. I think sleep yields to other tasks.

That still wouldn’t explain why 10 tasks are apparently running simultaneously and yielding to one another when the size of the WorkerPool is only 2. Based on the docs I would have expected remotecall to block the main thread until the pool has available workers.

Also note that the behavior doesn’t change when sleep is replaced with computation. If you replace the sleep statement with this loop you’ll see what I mean.

    s = 0
    for i in 1:10^9
        s+=rand()
    end

(I understand that at the OS level the process may yield even in the middle of this loop, but regardless I’m still puzzled as to why Julia starts so many jobs in the first place.)

1 Like

I’d also be interested in what is going on here. In my case I run out of memory when all jobs are queued without blocking.

Thanks for the @async remotecall_fetch() idea. It’s not an intuitive construction!

1 Like

If anyone has any insight, I’m still curious about the behavior of remotecall with a WorkerPool here and why it doesn’t block. Are the docs just incorrect? Or are we missing something about why @Cody-G needed a @async remotecall_fetch to get the process to block correctly and only have nwork functions running simultaneously?

I can answer this question now having dived a bit into the Distributed.jl code.

The WorkerPool version of remotecall does a take! on the WorkerPool to get a free worker, runs the remotecall, and then adds the worker back to the WorkerPool. Since remotecall just schedules rfunc to run on the worker asynchronously, it returns immediately and thus the worker is immediately added back to the WorkerPool. Thus, all the jobs are scheduled immediately on the available workers.

In contrast, remotecall_fetch waits for the result of rfunc running on the worker. So the WorkerPool version of remotecall_fetch doesn’t add the worker back to the WorkerPool until rfunc completes on the worker. Thus, only nwork number of jobs are run on the workers at any one time.

Considering this, it makes me think that maybe a WorkerPool version of remote_call isn’t that useful. After all, it just pulls a worker from the pool and immediately puts it back. In the case above, you end up with an equal number of tasks on each worker running simultaneously, but there is no guarantee of that. Anyway, hope this helps!

3 Likes