How to debug remotecall_wait/deserialization failures?

jonas-schulze · March 15, 2022, 2:54pm

After reading a bit of Failure-resilient parallel computing and the documentation of pmap, I wrapped by code in retry:

tasks = map(workers) do w
    @async retry(
        () -> remotecall_wait(w, ...);
        relays = ExponentialBackOff(n=3),
        check = #= that the error is not mine =#,
    )()
end

On the workers, everything is already wrapped in a try catch block. That makes the check part of retry a bit easier. Overall, I get over the first error described in the original post. Some stages are re-tried twice, so n=3 might not even be enough.

However, when writing the data back to disk (using HDF5.jl v0.15.7 and Julia v1.6.1, writing to one file per process), I almost allways hit the Slurm timeout. The logs of stdout/stderr are then full of

...
signal (15): Terminated
in expression starting at none:0
epoll_wait at /lib64/libc.so.6 (unknown line)

signal (15): Terminated
in expression starting at none:0
...

without any of my own prints/logs. The other logfiles that I write (one per process) are ok and show that my algorithm executed properly. As I don’t handle errors that thoroughly on the “store my results” part, I suspected that maybe some worker process died for a failed remotecall_fetch (similar to the issue in the original post). So I issued the job once more but with a stupidly big timeout to check whether the workers are still alive after the actual algorithm was done. Unfortunately, they are:

$ cat dead-d44.nodes
node043
node044
node045
...

$ cat dead-d44.nodes | xargs -P0 -I{} -n1 ssh {} 'ps -U `whoami` -ocmd= | ...grep for workers... | wc -l' > dead-d44.counts

$ awk '{sum+=$1;} END{print sum;}' dead-d44.counts
450

(which I should have guessed, because Slurm caught the failures and aborted the whole job in the original issue … it did not this time)

The epoll_wait hints at HDF5.jl or maybe libuv? Julia v1.6.1 is pretty dated, so I am trying v1.6.5 now. What else can I do?

Topic		Replies	Views
Debugging RemoteChannel memory error during distributed computation General Usage parallel , distributed	0	612	December 9, 2019
"ERROR: LoadError: TaskFailedException: IOError: stream is closed or unusable" with @distributed on multinode cluster Julia at Scale cluster	3	1294	March 19, 2020
Call to `Distributed.remotecall_fetch` never returns? General Usage question , parallel , distributed	3	580	May 15, 2020
Help with closures and remotecall_fetch General Usage	2	1316	May 2, 2017
How to debug "Local instance of remote reference not found"? New to Julia question , distributed	0	247	March 17, 2021

How to debug remotecall_wait/deserialization failures?

Related topics