I’ve been running into IOError: connect: connection timed out (ETIMEDOUT)
with about half my jobs using ClusterManagers on a slurm cluster. This issue started appearing about a month ago with julia 1.3 and ClusterManagers 0.4 and I’ve updated to 1.5.4 to see if that fixes things - it doesn’t. I think there have been some changes to the cluster that caused this and I’ve written their support. I do have a question for Julia side as well though.
One thing I tried is setting ENV["JULIA_WORKER_TIMEOUT"] = 600.0
at the very start (before loading any packages) of my simulation. The change was visible in Distributed.worker_timeout()
but my jobs still failed in less than 10min (2-4min in reports that update every 2min). Why is that the case? Is there another worker timeout variable in effect here?
Maybe it’s also important to note that I requeued many of the jobs that failed with same setup (no changes in Julia, no changes in the batch file) and they ran just fine. I’ve been using 4-8 nodes, 48 cores x 2 threads each.
Full Error
TaskFailedException:
IOError: connect: connection timed out (ETIMEDOUT)
Stacktrace:
[1] worker_from_id(::Distributed.ProcessGroup, ::Int64) at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1074
[2] worker_from_id at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1071 [inlined]
[3] #remote_do#154 at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:486 [inlined]
[4] remote_do at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:486 [inlined]
[5] kill at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/managers.jl:598 [inlined]
[6] create_worker(::SlurmManager, ::WorkerConfig) at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:585
[7] setup_launched_worker(::SlurmManager, ::WorkerConfig, ::Array{Int64,1}) at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:526
[8] (::Distributed.var"#41#44"{SlurmManager,Array{Int64,1},WorkerConfig})() at ./task.jl:356
...and 35 more exception(s).
Stacktrace:
[1] sync_end(::Channel{Any}) at ./task.jl:314
[2] macro expansion at ./task.jl:333 [inlined]
[3] addprocs_locked(::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:topology,),Tuple{Symbol}}}) at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:480
[4] addprocs(::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:topology,),Tuple{Symbol}}}) at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:444
[5] #addprocs_slurm#14 at .../julia_packages/packages/ClusterManagers/Mq0H0/src/slurm.jl:100 [inlined]
[6] top-level scope at timing.jl:174
[7] top-level scope at .../run_sim.jl:18
[8] include(::Function, ::Module, ::String) at ./Base.jl:380
[9] include(::Module, ::String) at ./Base.jl:368
[10] exec_options(::Base.JLOptions) at ./client.jl:296
[11] _start() at ./client.jl:506