Interminnent errors throwing when adding procs

I am on a cluster using Slurm as a workload scheduler. When I try to add processors:

 addprocs(SlurmManager(60), N=2, verbose="", topology=:master_worker, exeflags="--project=.")

it works about 2, 3 out of 10 tries. Most of the time the command fails with the error below

ERROR: LoadError: TaskFailedException

    nested task error: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:1092
     [2] worker_from_id
       @ ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:1089 [inlined]
     [3] #remote_do#170
       @ ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:557 [inlined]
     [4] remote_do
       @ ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:557 [inlined]
     [5] kill
       @ ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/managers.jl:687 [inlined]
     [6] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:603
     [7] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:544
     [8] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:484

    caused by: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets ~/julia-1.8.0/share/julia/stdlib/v1.8/Sockets/src/Sockets.jl:529
     [2] connect
       @ ~/julia-1.8.0/share/julia/stdlib/v1.8/Sockets/src/Sockets.jl:564 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/managers.jl:651
     [4] connect(manager::SlurmManager, pid::Int64, config::WorkerConfig)
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/managers.jl:578
     [5] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:599
     [6] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:544
     [7] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:484

...and 59 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:436
 [2] macro expansion
   @ ./task.jl:455 [inlined]
 [3] addprocs_locked(manager::SlurmManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:N, :verbose, :topology, :exeflags), Tuple{Int64, String, Symbol, String}}})
   @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:490
 [4] addprocs(manager::SlurmManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:N, :verbose, :topology, :exeflags), Tuple{Int64, String, Symbol, String}}})
   @ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:450
 [5] top-level scope
   @ ~/mmrvaccinedelay.jl/scripts/run.jl:22
 [6] include(fname::String)
   @ Base.MainInclude ./client.jl:476
 [7] top-level scope
   @ REPL[1]:1
in expression starting at /home/affans/mmrvaccinedelay.jl/scripts/run.jl:19

The errors are all over the place as well. I see stack traces for the following sometimes

Unhandled Task ERROR: Version read failed. Connection closed by peer.
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)

This seems to be a Julia problem rather than slurm or srun because if I remove the --worker argument, srun is able to launch those processes correctly. Not to mention, this code use to work flawlessly until upgrading to 1.8.

Does anyone know whats going on?

1 Like

Turns out after a lot of debugging that my workers were simply not connecting within the default 60 seconds. Setting

ENV["JULIA_WORKER_TIMEOUT"] = 120

fixed it … atleast for now.

1 Like