How to debug remotecall_wait/deserialization failures?

After reading a bit of Failure-resilient parallel computing and the documentation of pmap, I wrapped by code in retry:

tasks = map(workers) do w
    @async retry(
        () -> remotecall_wait(w, ...);
        relays = ExponentialBackOff(n=3),
        check = #= that the error is not mine =#,

On the workers, everything is already wrapped in a try catch block. That makes the check part of retry a bit easier. Overall, I get over the first error described in the original post. Some stages are re-tried twice, so n=3 might not even be enough.

However, when writing the data back to disk (using HDF5.jl v0.15.7 and Julia v1.6.1, writing to one file per process), I almost allways hit the Slurm timeout. The logs of stdout/stderr are then full of

signal (15): Terminated
in expression starting at none:0
epoll_wait at /lib64/ (unknown line)

signal (15): Terminated
in expression starting at none:0

without any of my own prints/logs. The other logfiles that I write (one per process) are ok and show that my algorithm executed properly. As I don’t handle errors that thoroughly on the “store my results” part, I suspected that maybe some worker process died for a failed remotecall_fetch (similar to the issue in the original post). So I issued the job once more but with a stupidly big timeout to check whether the workers are still alive after the actual algorithm was done. Unfortunately, they are:

$ cat dead-d44.nodes

$ cat dead-d44.nodes | xargs -P0 -I{} -n1 ssh {} 'ps -U `whoami` -ocmd= | ...grep for workers... | wc -l' > dead-d44.counts

$ awk '{sum+=$1;} END{print sum;}' dead-d44.counts

(which I should have guessed, because Slurm caught the failures and aborted the whole job in the original issue … it did not this time)

The epoll_wait hints at HDF5.jl or maybe libuv? Julia v1.6.1 is pretty dated, so I am trying v1.6.5 now. What else can I do?