Occasionally I get a message like this. The last time it happened, I lost all the work from a two-week computation. What causes it and how can I prevent it?
Update: I have found this post and have tried setting Distributed.worker_timeout()
to greater than the default 60.0. In my case 120.0 was successful. I can see how a load imbalance would develop in my application, but I cannot predict how long the remaining process will take to finish. So I’ll rephrase the question: is there a way to specify a timeout-return value to a process so that the other results from a potentially lengthy calculation can be saved?
ERROR: On worker 7:
peer 9 didn't connect to 7 within 59.99997520446777 seconds
Stacktrace:
[1] error
@ ./error.jl:35
[2] wait_for_conn
@ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:196
[3] check_worker_state
@ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:170
[4] send_msg_
@ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/messages.jl:172
[5] send_msg
@ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/messages.jl:122 [inlined]
[6] #remotecall_fetch#159
@ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:460
[7] remotecall_fetch
@ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:454
[8] remotecall_fetch
@ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:492 [inlined]
[9] call_on_owner
@ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:565 [inlined]
[10] fetch
@ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:619
[11] iterate
@ ./generator.jl:48 [inlined]
[12] collect_to!
@ ./array.jl:849
[13] collect_to_with_first!
@ ./array.jl:827
[14] _collect
@ ./array.jl:821
[15] collect_similar
@ ./array.jl:720 [inlined]
[16] map
@ ./abstractarray.jl:3371 [inlined]
[17] #194
@ ~/Desktop/language/julia/installed/packages/Transducers/fnznF/src/dreduce.jl:91
[18] #invokelatest#2
@ ./essentials.jl:1055
[19] invokelatest
@ ./essentials.jl:1052
[20] #110
@ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:287
[21] run_work_thunk
@ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:70
[22] #109
@ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:287
Stacktrace:
[1] remotecall_fetch(f::Function, w::Distributed.Worker, args::Vector{Future}; kwargs::@Kwargs{})
@ Distributed ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:465
[2] remotecall_fetch(f::Function, w::Distributed.Worker, args::Vector{Future})
@ Distributed ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:454
[3] remotecall_fetch(f::Function, id::Int64, args::Vector{Future})
@ Distributed ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:492
[4] remotecall_pool(rc_f::Function, f::Function, pool::WorkerPool, args::Vector{Future}; kwargs::@Kwargs{})
@ Distributed ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/workerpool.jl:126
[5] remotecall_pool
@ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/workerpool.jl:123 [inlined]
[6] remotecall_fetch(f::Function, pool::WorkerPool, args::Vector{Future})
@ Distributed ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/workerpool.jl:232
[7] dtransduce(xform::Transducers.Composition{…}, step::Function, init::BangBang.NoBang.Empty{…}, coll0::Vector{…}; simd::Val{…}, basesize::Nothing, threads_basesize::Nothing, pool::WorkerPool, _remote_reduce::Function)
@ Transducers ~/Desktop/language/julia/installed/packages/Transducers/fnznF/src/dreduce.jl:90
[8] dtransduce(xform::Transducers.Composition{…}, step::Function, init::BangBang.NoBang.Empty{…}, coll0::Vector{…})
@ Transducers ~/Desktop/language/julia/installed/packages/Transducers/fnznF/src/dreduce.jl:50
[9] foldxd
@ ~/Desktop/language/julia/installed/packages/Transducers/fnznF/src/dreduce.jl:40 [inlined]
[10] dcopy
@ ~/Desktop/language/julia/installed/packages/Transducers/fnznF/src/dreduce.jl:129 [inlined]
[11] dcollect
@ ~/Desktop/language/julia/installed/packages/Transducers/fnznF/src/dreduce.jl:162 [inlined]
[12] collect(itr::Base.Generator{Vector{Phiesiode.LambertSet}, OrbitIvpBvp.var"#39#42"{OrbitIvpBvp.var"#allres#40"{…}}}, ex::Transducers.DistributedEx{@NamedTuple{}})
@ Folds.Implementations ~/Desktop/language/julia/installed/packages/Folds/qbSal/src/collect.jl:14
[13] parvintibvp(lsseq::OrbitIvpBvp.var"#nextlp#35"{…}, num::Int64, planet::Gravity.Planet, solverparam::Phiesiode.SolverParameters; includels::Bool)
@ OrbitIvpBvp ~/Desktop/satellite/eod/epod/phiesiode/OrbitIvpBvp/src/bvp/validate.jl:41
[14] top-level scope
@ REPL[74]:1
Some type information was truncated. Use `show(err)` to see complete types.