Where is error message on the remote worker?

Chong_Wang · September 18, 2019, 3:42pm

The following code should give an error since foo is not defined on process 2.

using Distributed

addprocs()

function foo()
    return 1
end

remote_do(foo, 2)

I expect some error message somewhere, but the above code just runs fine. Where can I get the error message?

zgornel · September 19, 2019, 8:20am

remote_do will just trigger the remote computation but will not fetch the exception.
remotecall_fetch will.

Chong_Wang · September 19, 2019, 10:58pm

I understand that remotecall_fetch will return the error. However, the function that is going to be executed on the remote process will never return and will always be waiting for tasks from the master process:

mutable struct ParallelFunction
    jobs::RemoteChannel
    results::RemoteChannel
    nreceived::Int64
    ntaken::Int64
end


function do_work(f, jobs, results, args...)
    while true
        x = take!(jobs)
        result = f(x, args...)
        put!(results, (x, result))
    end
end


function ParallelFunction(f, args...; len=128)
    jobs = RemoteChannel(()->Channel{Any}(len))
    results = RemoteChannel(()->Channel{Any}(len))
    for p in workers()
        remote_do(do_work, p, f, jobs, results, args...)
    end
    return ParallelFunction(jobs, results, 0, 0)
end


function give!(pf::ParallelFunction, x)
    put!(pf.jobs, x)
    pf.nreceived += 1
end


function get!(pf::ParallelFunction)
    result = take!(pf.results)
    if result[2] isa ErrorException
        throw(result[2])
    end
    pf.ntaken += 1
    return result
end

Suppose a function f has four arguments x, a, b and c. The purpose is to evaluate the function f at several different xs with fixed argument a, b and c.
A ParallelFunction can be constructed as

pf = ParallelFunction(f, a, b, c; len=128)

To evaluate the function f at x1, x2 and x3, the following statement can be used

give!(pf, x1)
give!(pf, x2)
give!(pf, x3)

The function will be evaluated in parallel on any possible processes. To obtain the result:

for i in 1:3
    println(get!(pf))
end

get!(pf) returns a tuple, the first element of the tuple is the argument x and the second argument is the corresponding result.

However, it is possible that the user forget to define the function f on the remote process. In this case, the get!(pf) will hang indefinitely, which is not a desirable result. I want to throw an error if this is the case.

jling · September 20, 2019, 1:19am

why it will hang indefinitely? The thing is, the workers won’t know if the function is defined or not unless actually trying to execute them, which defies the purpose of future-fetch model.

zgornel · September 20, 2019, 5:57am

The program hangs because no result is put on the remote channel and taking a value blocks the main thread. You could wrap the call to the function f in do_work in a try ... catch and return any caught exception as the result and put it on the channel…

Chong_Wang · September 20, 2019, 5:34pm

I tried the following implementation

function do_work(f, jobs, results, args...)
    while true
        x = take!(jobs)
        result = try
            f(x, args...)
        catch e
            ErrorException(sprint(showerror, e))
        end
        put!(results, (x, result))
    end
end

But this do_work will not catch any exception is f is simply not defined on the remote process. What I was asking is simply how to catch the undefined f error.

jling · September 20, 2019, 5:39pm

as I said, you should NOT be able to before fetch, because you won’t know if a function is defined for a worker unless that worker has tried to execute the function.

Chong_Wang · September 20, 2019, 5:42pm

The thing is, how can I fix such a problem for my program. I tried the code in the here and was wishing that the error can be catched, but it wasn’t. Maybe there is some middleware somewhere but I cannot find it in the documentation.

jling · September 20, 2019, 5:48pm

for the record, when do you want the error msg, when you @spawnat or when you fetch()?

Chong_Wang · September 20, 2019, 5:51pm

I can accept that the error is thrown to me anywhere as long as I get access to it. As I explained here, my worker is designed to wait on the remote process for input from the master process forever, and should not be fetched.

jling · September 20, 2019, 5:57pm

well in that case you can let your remote workers to put the error into your result queue, so your get! will pop an error msg from the queue?

Chong_Wang · September 20, 2019, 5:58pm

That was my plan, see here, but it didn’t work. The error was not catched if f is not defined.

jling · September 20, 2019, 6:11pm

you can certainly catch the error, as you already know, so something happened in your program when putting the thing into the results queue.

julia> er = try
           remotecall_fetch(hello, 2)
       catch e
           e
       end
RemoteException(2, CapturedException(UndefVarError(Symbol("#hello")), Any[(deserialize_datatype at Serialization.jl:1186, 1), (handle_deserialize at Serialization.jl:775, 1), (deserialize at Serialization.jl:722, 1), (handle_deserialize at Serialization.jl:782, 1), (deserialize_msg at Serialization.jl:722, 1), (#invokelatest#1 at essentials.jl:790 [inlined], 1), (invokelatest at essentials.jl:789 [inlined], 1), (message_handler_loop at process_messages.jl:183, 1), (process_tcp_streams at process_messages.jl:140, 1), (#105 at task.jl:268, 1)]))

zgornel · September 20, 2019, 6:17pm

That’s the way to do it. Although I am not replicating all of your code

r=try
    bar()  # bar is not defined
catch e
    e
end

results in an UndefVarError(:bar) being assigned to r. The problem is probably somewhere else… Try reducing a bit the code to the local worker (thread) and debug it.

Chong_Wang · September 20, 2019, 6:28pm

Thank you, but I don’t think these will work. I believe (not sure though) the issue is that in calling the function

function do_work(f, jobs, results, args...)
  ...
end

with

remote_do(do_work, worker_id, f, jobs, results, args...)

f is evaluated before being fed into the function do_work. Therefore I cannot catch the error inside the function do_work.

Topic		Replies	Views
Catching errors thrown by worker processes General Usage question , error , distributed	9	1260	June 28, 2019
Call to `Distributed.remotecall_fetch` never returns? General Usage question , parallel , distributed	3	560	May 15, 2020
Problem with @async and remotecall_fetch() General Usage question , distributed	1	369	July 21, 2021
Workers cannot access to RemoteChanel data during Parallel processing of dataframes General Usage parallel , dataframes	1	456	December 2, 2019
Communication error between local processes and remote processes General Usage parallel	1	341	June 17, 2022

Where is error message on the remote worker?

Related topics