How to debug remotecall_wait/deserialization failures?

When running code on 450 cores / 29 nodes in a Slurm environment (using the handy SlurmClusterManager.jl), sometimes everything works fine, sometimes I get failures on 1-7 workers. How do I investigate/debug/fix this?

My usage pattern looks something like

# issue work
tasks = map(workers) do w
    @async remotecall_wait(w, ...)
end

# wait for completion / collect failures
@sync for (n, t) in enumerate(tasks)
    # This try-catch-block is technically not needed for the current design.
    # However, it proved to be handy if the package author screwed up again.
    @async try
        stage = fetch_stage(t)
        isfailed(stage) || return
        @warn "Cancelling pipeline due to failure on stage $n"
        cancel_pipeline!(pl)
    catch
        @error "Cancelling pipeline due to hard failure, maybe on stage $n"
        cancel_pipeline!(pl)
        rethrow()
    end
end

where

fetch_stage(t::Union{Task,Future}) = fetch_stage(fetch(t))
fetch_stage(s::StageRef) = s

Here is the error message:

[ Info: Launching solver
┌ Error: Cancelling pipeline due to hard failure, maybe on stage 376
└ @ ...
┌ Error: Cancelling pipeline due to hard failure, maybe on stage 406
└ @ ...
┌ Error: Cancelling pipeline due to hard failure, maybe on stage 422
└ @ ...
ERROR: LoadError: LoadError: TaskFailedException

    nested task error: TaskFailedException
    Stacktrace:
     [1] wait
       @ ./task.jl:322 [inlined]
     [2] fetch
       @ ./task.jl:337 [inlined]
     [3] fetch_stage(t::Task)
[...]
        nested task error: On worker 377:
        TypeError: in new, expected ParaReal.LazyFormatLogger, got a value of type Int64
        Stacktrace:
          [1] deserialize
            @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:1409
          [2] handle_deserialize
            @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:846
          [3] deserialize
            @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:782
          [4] #5
            @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:941
          [5] ntupleany
            @ ./ntuple.jl:43
          [6] deserialize_tuple
            @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:941
          [7] handle_deserialize
            @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:825
          [8] deserialize
            @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:782 [inlined]
          [9] deserialize_msg
            @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/messages.jl:87
         [10] #invokelatest#2
            @ ./essentials.jl:708 [inlined]
         [11] invokelatest
            @ ./essentials.jl:706 [inlined]
         [12] message_handler_loop
            @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:169
         [13] process_tcp_streams
            @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:126
         [14] #99
            @ ./task.jl:411
        Stacktrace:
         [1] remotecall_wait(::Function, ::Distributed.Worker, ::GDREProblem{Matrix{Float64}}, ::Vararg{Any, N} where N; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
           @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:436
         [2] remotecall_wait(::Function, ::Distributed.Worker, ::GDREProblem{Matrix{Float64}}, ::Vararg{Any, N} where N)
           @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:427
         [3] remotecall_wait(::Function, ::Int64, ::GDREProblem{Matrix{Float64}}, ::Vararg{Any, N} where N; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
[...]

Could the messages being sent by remotecall_wait be corrupted? By [13] process_tcp_streams I guess that remotecall_wait uses TCP, which I had hoped would handle failed messages for me.

(Edit: sorry for posting early and then editing multiple times. I pushed the wrong button on my keyboard and published the post before it was done.)

After reading a bit of Failure-resilient parallel computing and the documentation of pmap, I wrapped by code in retry:

tasks = map(workers) do w
    @async retry(
        () -> remotecall_wait(w, ...);
        relays = ExponentialBackOff(n=3),
        check = #= that the error is not mine =#,
    )()
end

On the workers, everything is already wrapped in a try catch block. That makes the check part of retry a bit easier. Overall, I get over the first error described in the original post. Some stages are re-tried twice, so n=3 might not even be enough.

However, when writing the data back to disk (using HDF5.jl v0.15.7 and Julia v1.6.1, writing to one file per process), I almost allways hit the Slurm timeout. The logs of stdout/stderr are then full of

...
signal (15): Terminated
in expression starting at none:0
epoll_wait at /lib64/libc.so.6 (unknown line)

signal (15): Terminated
in expression starting at none:0
...

without any of my own prints/logs. The other logfiles that I write (one per process) are ok and show that my algorithm executed properly. As I don’t handle errors that thoroughly on the “store my results” part, I suspected that maybe some worker process died for a failed remotecall_fetch (similar to the issue in the original post). So I issued the job once more but with a stupidly big timeout to check whether the workers are still alive after the actual algorithm was done. Unfortunately, they are:

$ cat dead-d44.nodes
node043
node044
node045
...

$ cat dead-d44.nodes | xargs -P0 -I{} -n1 ssh {} 'ps -U `whoami` -ocmd= | ...grep for workers... | wc -l' > dead-d44.counts

$ awk '{sum+=$1;} END{print sum;}' dead-d44.counts
450

(which I should have guessed, because Slurm caught the failures and aborted the whole job in the original issue … it did not this time)

The epoll_wait hints at HDF5.jl or maybe libuv? Julia v1.6.1 is pretty dated, so I am trying v1.6.5 now. What else can I do?

It doesn’t work on Julia v1.6.5 either. I’ve heard of strace but don’t know much about it. According to it, the manager and worker processes are stuck at epoll_wait(3, , but I don’t know what that means:

# manager
$ strace -p193409
strace: Process 193409 attached
epoll_wait(3, ^Cstrace: Process 193409 detached
 <detached ...>
# worker
$ strace -p197125
strace: Process 197125 attached
epoll_wait(3, 

Locally, I would press Ctrl-T to send a SIGINFO to maybe get some hint on where the program is stuck at. I tried doing the same on the running job, but it then failed:

$ kill -s 29 193409
Connection to node006 closed by remote host.

At least the slurm logs contain the result of the signal, but nothing else:

$ cat d44.out
$ cat d44.err
/var/spool/slurm/d/.../slurm_script: line 34: 193409 I/O possible            julia dense-par.jl

Maybe my signal caused it to fail, but the algorithm was done and the IO should have been as well. This is frustrating. Do you have any idea what I should try?

I found the culprit of my second issue: a good old deadlock that’s completely my fault. Thank goodness I wrote logfiles from all the workers (yes, all 450), otherwise I would never have found this.

This leads us back to the first problem. How do I debug this? Is that an issue worth reporting? Where?