When running code on 450 cores / 29 nodes in a Slurm environment (using the handy SlurmClusterManager.jl), sometimes everything works fine, sometimes I get failures on 1-7 workers. How do I investigate/debug/fix this?
My usage pattern looks something like
# issue work
tasks = map(workers) do w
@async remotecall_wait(w, ...)
end
# wait for completion / collect failures
@sync for (n, t) in enumerate(tasks)
# This try-catch-block is technically not needed for the current design.
# However, it proved to be handy if the package author screwed up again.
@async try
stage = fetch_stage(t)
isfailed(stage) || return
@warn "Cancelling pipeline due to failure on stage $n"
cancel_pipeline!(pl)
catch
@error "Cancelling pipeline due to hard failure, maybe on stage $n"
cancel_pipeline!(pl)
rethrow()
end
end
where
fetch_stage(t::Union{Task,Future}) = fetch_stage(fetch(t))
fetch_stage(s::StageRef) = s
Here is the error message:
[ Info: Launching solver
┌ Error: Cancelling pipeline due to hard failure, maybe on stage 376
└ @ ...
┌ Error: Cancelling pipeline due to hard failure, maybe on stage 406
└ @ ...
┌ Error: Cancelling pipeline due to hard failure, maybe on stage 422
└ @ ...
ERROR: LoadError: LoadError: TaskFailedException
nested task error: TaskFailedException
Stacktrace:
[1] wait
@ ./task.jl:322 [inlined]
[2] fetch
@ ./task.jl:337 [inlined]
[3] fetch_stage(t::Task)
[...]
nested task error: On worker 377:
TypeError: in new, expected ParaReal.LazyFormatLogger, got a value of type Int64
Stacktrace:
[1] deserialize
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:1409
[2] handle_deserialize
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:846
[3] deserialize
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:782
[4] #5
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:941
[5] ntupleany
@ ./ntuple.jl:43
[6] deserialize_tuple
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:941
[7] handle_deserialize
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:825
[8] deserialize
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:782 [inlined]
[9] deserialize_msg
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/messages.jl:87
[10] #invokelatest#2
@ ./essentials.jl:708 [inlined]
[11] invokelatest
@ ./essentials.jl:706 [inlined]
[12] message_handler_loop
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:169
[13] process_tcp_streams
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:126
[14] #99
@ ./task.jl:411
Stacktrace:
[1] remotecall_wait(::Function, ::Distributed.Worker, ::GDREProblem{Matrix{Float64}}, ::Vararg{Any, N} where N; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:436
[2] remotecall_wait(::Function, ::Distributed.Worker, ::GDREProblem{Matrix{Float64}}, ::Vararg{Any, N} where N)
@ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:427
[3] remotecall_wait(::Function, ::Int64, ::GDREProblem{Matrix{Float64}}, ::Vararg{Any, N} where N; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
[...]
Could the messages being sent by remotecall_wait
be corrupted? By [13] process_tcp_streams
I guess that remotecall_wait
uses TCP, which I had hoped would handle failed messages for me.
(Edit: sorry for posting early and then editing multiple times. I pushed the wrong button on my keyboard and published the post before it was done.)