Call to `Distributed.remotecall_fetch` never returns?

I am writing a Julia script for a scientific experiment and I want to put a time limit in calls made by the script. To solve this I tried to use this small package from @ararslan. The package works by creating a new worker process that will execute the call, while the main process will be pooling the channel associated to the remote call to check if it has finished, and will interrupt it if it is not finished by the time limit. However, it seems to not be working for me, and I reduced it to a MWE that I do not understand why it is not working. Maybe it is something basic, but I am not well versed in distributed computing.

import Distributed

Distributed.@everywhere function f()
    return 42
end

function ptimeout(f::Function, secs::Real; worker=1, poll=0.5, verbose=true)
    channel = Channel(1)
    @async begin
        println("start of @async block")
        value = remotecall_fetch(f, worker)
        println("remotecall_fetch returned")
        put!(channel, value)
        println("end of @async block")
    end
    println("passed the @async")
    timedwait(
        () -> (status = isready(channel); @show status; status),
        float(secs), pollint=float(poll)
    )
    isready(channel) && return true
    verbose && @warn "Time limit for computation exceeded."
    Distributed.interrupt(worker)
    close(channel)
    println("reached ptimeout end")
    false
end

function main()
    worker_id = only(Distributed.addprocs(1))
    @show ptimeout(f, 10; worker = worker_id)
end

main()

gives

passed the @async
status = false
start of @async block
status = false
status = false
status = false
status = false
...
┌ Warning: Time limit for computation exceeded.
└ @ Main ~/AreaDeTrabalho/mwe_timeout/mwe_timeout.jl:22

So, why my call to Distributed.remotecall_fetch never returns, if the function I called is so simple that it just returns a constant value?

One-time bump (not sure if this is discouraged here or not).

If someone have at least a hypothesis about why it blocks at that point I would be happy to know. Or if someone can run the code in their machine and confirm I am not going insane. I am really out of ideas of what to test next.

There’s a really simple mistake in your example but unfortunately you never get a useful error message to help you track it down. The problem is that your function f is not defined on the worker you try to run it on. You run

Distributed.@everywhere function f()
    return 42
end

at the top of your script but then start a new worker later on in the main function. You then try to run f on the new worker. When you run something with @everywhere it only runs it on the worker processes that are currently running. Julia doesn’t keep track of previous uses of @everywhere and automatically run them when new worker processes are started.

If you change your example to

using Distributed
addprocs(1)
Distributed.@everywhere function f()
    return 42
end
...
function main()
    @show ptimeout(f, 10; worker = 2)
end

then the example works for me.

Having said all that, I’m not sure why you don’t see an error printed to the screen about the function not being defined.

2 Likes

I am very thankful for your help. I did find the bug with your help, but it was not just what you pointed out, XD.

I changed to

import Distributed
const worker_id = only(Distributed.addprocs(1))
Distributed.@everywhere function f() 
    return 42
end
... # ptimeout definition is the same
function main()
    @show ptimeout(f, 10; worker = worker_id)
end

main()

and the same problem kept happening to me. Then I wrapped the @async and the timedwait in a @sync and started getting the exceptions (that ended up latent in the Task created by @async and never @sync'ed). The problem was not only that I should use @everywhere after creating the new process, but that as I used just import Distributed I should have written Distributed.remotecall_fetch(f, worker) inside ptimout not just remotecall_fetch. It was a incredibly rookie mistake, I was just bitten by the fact I was not aware I needed to @sync to see the exceptions bubble up.