After reading a bit of Failure-resilient parallel computing and the documentation of pmap
, I wrapped by code in retry
:
tasks = map(workers) do w
@async retry(
() -> remotecall_wait(w, ...);
relays = ExponentialBackOff(n=3),
check = #= that the error is not mine =#,
)()
end
On the workers, everything is already wrapped in a try catch
block. That makes the check
part of retry
a bit easier. Overall, I get over the first error described in the original post. Some stages are re-tried twice, so n=3
might not even be enough.
However, when writing the data back to disk (using HDF5.jl
v0.15.7 and Julia v1.6.1, writing to one file per process), I almost allways hit the Slurm timeout. The logs of stdout/stderr are then full of
...
signal (15): Terminated
in expression starting at none:0
epoll_wait at /lib64/libc.so.6 (unknown line)
signal (15): Terminated
in expression starting at none:0
...
without any of my own prints/logs. The other logfiles that I write (one per process) are ok and show that my algorithm executed properly. As I don’t handle errors that thoroughly on the “store my results” part, I suspected that maybe some worker process died for a failed remotecall_fetch
(similar to the issue in the original post). So I issued the job once more but with a stupidly big timeout to check whether the workers are still alive after the actual algorithm was done. Unfortunately, they are:
$ cat dead-d44.nodes
node043
node044
node045
...
$ cat dead-d44.nodes | xargs -P0 -I{} -n1 ssh {} 'ps -U `whoami` -ocmd= | ...grep for workers... | wc -l' > dead-d44.counts
$ awk '{sum+=$1;} END{print sum;}' dead-d44.counts
450
(which I should have guessed, because Slurm caught the failures and aborted the whole job in the original issue … it did not this time)
The epoll_wait
hints at HDF5.jl
or maybe libuv
? Julia v1.6.1 is pretty dated, so I am trying v1.6.5 now. What else can I do?