Error: peer didn't connect

Occasionally I get a message like this. The last time it happened, I lost all the work from a two-week computation. What causes it and how can I prevent it?

Update: I have found this post and have tried setting Distributed.worker_timeout() to greater than the default 60.0. In my case 120.0 was successful. I can see how a load imbalance would develop in my application, but I cannot predict how long the remaining process will take to finish. So I’ll rephrase the question: is there a way to specify a timeout-return value to a process so that the other results from a potentially lengthy calculation can be saved?

ERROR: On worker 7:
peer 9 didn't connect to 7 within 59.99997520446777 seconds
Stacktrace:
  [1] error
    @ ./error.jl:35
  [2] wait_for_conn
    @ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:196
  [3] check_worker_state
    @ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:170
  [4] send_msg_
    @ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/messages.jl:172
  [5] send_msg
    @ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/messages.jl:122 [inlined]
  [6] #remotecall_fetch#159
    @ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:460
  [7] remotecall_fetch
    @ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:454
  [8] remotecall_fetch
    @ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:492 [inlined]
  [9] call_on_owner
    @ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:565 [inlined]
 [10] fetch
    @ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:619
 [11] iterate
    @ ./generator.jl:48 [inlined]
 [12] collect_to!
    @ ./array.jl:849
 [13] collect_to_with_first!
    @ ./array.jl:827
 [14] _collect
    @ ./array.jl:821
 [15] collect_similar
    @ ./array.jl:720 [inlined]
 [16] map
    @ ./abstractarray.jl:3371 [inlined]
 [17] #194
    @ ~/Desktop/language/julia/installed/packages/Transducers/fnznF/src/dreduce.jl:91
 [18] #invokelatest#2
    @ ./essentials.jl:1055
 [19] invokelatest
    @ ./essentials.jl:1052
 [20] #110
    @ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:287
 [21] run_work_thunk
    @ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:70
 [22] #109
    @ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:287
Stacktrace:
  [1] remotecall_fetch(f::Function, w::Distributed.Worker, args::Vector{Future}; kwargs::@Kwargs{})
    @ Distributed ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:465
  [2] remotecall_fetch(f::Function, w::Distributed.Worker, args::Vector{Future})
    @ Distributed ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:454
  [3] remotecall_fetch(f::Function, id::Int64, args::Vector{Future})
    @ Distributed ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:492
  [4] remotecall_pool(rc_f::Function, f::Function, pool::WorkerPool, args::Vector{Future}; kwargs::@Kwargs{})
    @ Distributed ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/workerpool.jl:126
  [5] remotecall_pool
    @ ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/workerpool.jl:123 [inlined]
  [6] remotecall_fetch(f::Function, pool::WorkerPool, args::Vector{Future})
    @ Distributed ~/Desktop/language/julia/installed/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/workerpool.jl:232
  [7] dtransduce(xform::Transducers.Composition{…}, step::Function, init::BangBang.NoBang.Empty{…}, coll0::Vector{…}; simd::Val{…}, basesize::Nothing, threads_basesize::Nothing, pool::WorkerPool, _remote_reduce::Function)
    @ Transducers ~/Desktop/language/julia/installed/packages/Transducers/fnznF/src/dreduce.jl:90
  [8] dtransduce(xform::Transducers.Composition{…}, step::Function, init::BangBang.NoBang.Empty{…}, coll0::Vector{…})
    @ Transducers ~/Desktop/language/julia/installed/packages/Transducers/fnznF/src/dreduce.jl:50
  [9] foldxd
    @ ~/Desktop/language/julia/installed/packages/Transducers/fnznF/src/dreduce.jl:40 [inlined]
 [10] dcopy
    @ ~/Desktop/language/julia/installed/packages/Transducers/fnznF/src/dreduce.jl:129 [inlined]
 [11] dcollect
    @ ~/Desktop/language/julia/installed/packages/Transducers/fnznF/src/dreduce.jl:162 [inlined]
 [12] collect(itr::Base.Generator{Vector{Phiesiode.LambertSet}, OrbitIvpBvp.var"#39#42"{OrbitIvpBvp.var"#allres#40"{…}}}, ex::Transducers.DistributedEx{@NamedTuple{}})
    @ Folds.Implementations ~/Desktop/language/julia/installed/packages/Folds/qbSal/src/collect.jl:14
 [13] parvintibvp(lsseq::OrbitIvpBvp.var"#nextlp#35"{…}, num::Int64, planet::Gravity.Planet, solverparam::Phiesiode.SolverParameters; includels::Bool)
    @ OrbitIvpBvp ~/Desktop/satellite/eod/epod/phiesiode/OrbitIvpBvp/src/bvp/validate.jl:41
 [14] top-level scope
    @ REPL[74]:1
Some type information was truncated. Use `show(err)` to see complete types.