Interminnent errors throwing when adding procs

I am on a cluster using Slurm as a workload scheduler. When I try to add processors:

 addprocs(SlurmManager(60), N=2, verbose="", topology=:master_worker, exeflags="--project=.")

it works about 2, 3 out of 10 tries. Most of the time the command fails with the error below

ERROR: LoadError: TaskFailedException

    nested task error: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:1092
     [2] worker_from_id
       @ ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:1089 [inlined]
     [3] #remote_do#170
       @ ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:557 [inlined]
     [4] remote_do
       @ ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:557 [inlined]
     [5] kill
       @ ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/managers.jl:687 [inlined]
     [6] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:603
     [7] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:544
     [8] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:484

    caused by: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets ~/julia-1.8.0/share/julia/stdlib/v1.8/Sockets/src/Sockets.jl:529
     [2] connect
       @ ~/julia-1.8.0/share/julia/stdlib/v1.8/Sockets/src/Sockets.jl:564 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/managers.jl:651
     [4] connect(manager::SlurmManager, pid::Int64, config::WorkerConfig)
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/managers.jl:578
     [5] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:599
     [6] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:544
     [7] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:484

...and 59 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:436
 [2] macro expansion
   @ ./task.jl:455 [inlined]
 [3] addprocs_locked(manager::SlurmManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:N, :verbose, :topology, :exeflags), Tuple{Int64, String, Symbol, String}}})
   @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:490
 [4] addprocs(manager::SlurmManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:N, :verbose, :topology, :exeflags), Tuple{Int64, String, Symbol, String}}})
   @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:450
 [5] top-level scope
   @ ~/mmrvaccinedelay.jl/scripts/run.jl:22
 [6] include(fname::String)
   @ Base.MainInclude ./client.jl:476
 [7] top-level scope
   @ REPL[1]:1
in expression starting at /home/affans/mmrvaccinedelay.jl/scripts/run.jl:19

The errors are all over the place as well. I see stack traces for the following sometimes

Unhandled Task ERROR: Version read failed. Connection closed by peer.
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)

This seems to be a Julia problem rather than slurm or srun because if I remove the --worker argument, srun is able to launch those processes correctly. Not to mention, this code use to work flawlessly until upgrading to 1.8.

Does anyone know whats going on?

Turns out after a lot of debugging that my workers were simply not connecting within the default 60 seconds. Setting

ENV["JULIA_WORKER_TIMEOUT"] = 120

fixed it … atleast for now.

1 Like