Where is error message on the remote worker?

The following code should give an error since foo is not defined on process 2.

using Distributed

addprocs()

function foo()
    return 1
end

remote_do(foo, 2)

I expect some error message somewhere, but the above code just runs fine. Where can I get the error message?

remote_do will just trigger the remote computation but will not fetch the exception.
remotecall_fetch will.

I understand that remotecall_fetch will return the error. However, the function that is going to be executed on the remote process will never return and will always be waiting for tasks from the master process:

mutable struct ParallelFunction
    jobs::RemoteChannel
    results::RemoteChannel
    nreceived::Int64
    ntaken::Int64
end


function do_work(f, jobs, results, args...)
    while true
        x = take!(jobs)
        result = f(x, args...)
        put!(results, (x, result))
    end
end


function ParallelFunction(f, args...; len=128)
    jobs = RemoteChannel(()->Channel{Any}(len))
    results = RemoteChannel(()->Channel{Any}(len))
    for p in workers()
        remote_do(do_work, p, f, jobs, results, args...)
    end
    return ParallelFunction(jobs, results, 0, 0)
end


function give!(pf::ParallelFunction, x)
    put!(pf.jobs, x)
    pf.nreceived += 1
end


function get!(pf::ParallelFunction)
    result = take!(pf.results)
    if result[2] isa ErrorException
        throw(result[2])
    end
    pf.ntaken += 1
    return result
end

Suppose a function f has four arguments x, a, b and c. The purpose is to evaluate the function f at several different xs with fixed argument a, b and c.
A ParallelFunction can be constructed as

pf = ParallelFunction(f, a, b, c; len=128)

To evaluate the function f at x1, x2 and x3, the following statement can be used

give!(pf, x1)
give!(pf, x2)
give!(pf, x3)

The function will be evaluated in parallel on any possible processes. To obtain the result:

for i in 1:3
    println(get!(pf))
end

get!(pf) returns a tuple, the first element of the tuple is the argument x and the second argument is the corresponding result.

However, it is possible that the user forget to define the function f on the remote process. In this case, the get!(pf) will hang indefinitely, which is not a desirable result. I want to throw an error if this is the case.

why it will hang indefinitely? The thing is, the workers won’t know if the function is defined or not unless actually trying to execute them, which defies the purpose of future-fetch model.

The program hangs because no result is put on the remote channel and taking a value blocks the main thread. You could wrap the call to the function f in do_work in a try ... catch and return any caught exception as the result and put it on the channel…

I tried the following implementation

function do_work(f, jobs, results, args...)
    while true
        x = take!(jobs)
        result = try
            f(x, args...)
        catch e
            ErrorException(sprint(showerror, e))
        end
        put!(results, (x, result))
    end
end

But this do_work will not catch any exception is f is simply not defined on the remote process. What I was asking is simply how to catch the undefined f error.

as I said, you should NOT be able to before fetch, because you won’t know if a function is defined for a worker unless that worker has tried to execute the function.

The thing is, how can I fix such a problem for my program. I tried the code in the here and was wishing that the error can be catched, but it wasn’t. Maybe there is some middleware somewhere but I cannot find it in the documentation.

for the record, when do you want the error msg, when you @spawnat or when you fetch()?

I can accept that the error is thrown to me anywhere as long as I get access to it. As I explained here, my worker is designed to wait on the remote process for input from the master process forever, and should not be fetched.

well in that case you can let your remote workers to put the error into your result queue, so your get! will pop an error msg from the queue?

That was my plan, see here, but it didn’t work. The error was not catched if f is not defined.

you can certainly catch the error, as you already know, so something happened in your program when putting the thing into the results queue.

julia> er = try
           remotecall_fetch(hello, 2)
       catch e
           e
       end
RemoteException(2, CapturedException(UndefVarError(Symbol("#hello")), Any[(deserialize_datatype at Serialization.jl:1186, 1), (handle_deserialize at Serialization.jl:775, 1), (deserialize at Serialization.jl:722, 1), (handle_deserialize at Serialization.jl:782, 1), (deserialize_msg at Serialization.jl:722, 1), (#invokelatest#1 at essentials.jl:790 [inlined], 1), (invokelatest at essentials.jl:789 [inlined], 1), (message_handler_loop at process_messages.jl:183, 1), (process_tcp_streams at process_messages.jl:140, 1), (#105 at task.jl:268, 1)]))

That’s the way to do it. Although I am not replicating all of your code

r=try
    bar()  # bar is not defined
catch e
    e
end

results in an UndefVarError(:bar) being assigned to r. The problem is probably somewhere else… Try reducing a bit the code to the local worker (thread) and debug it.

Thank you, but I don’t think these will work. I believe (not sure though) the issue is that in calling the function

function do_work(f, jobs, results, args...)
  ...
end

with

remote_do(do_work, worker_id, f, jobs, results, args...)

f is evaluated before being fed into the function do_work. Therefore I cannot catch the error inside the function do_work.