I am on a cluster using Slurm as a workload scheduler. When I try to add processors:
addprocs(SlurmManager(60), N=2, verbose="", topology=:master_worker, exeflags="--project=.")
it works about 2, 3 out of 10 tries. Most of the time the command fails with the error below
ERROR: LoadError: TaskFailedException
nested task error: IOError: connect: connection refused (ECONNREFUSED)
Stacktrace:
[1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
@ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:1092
[2] worker_from_id
@ ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:1089 [inlined]
[3] #remote_do#170
@ ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:557 [inlined]
[4] remote_do
@ ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:557 [inlined]
[5] kill
@ ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/managers.jl:687 [inlined]
[6] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
@ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:603
[7] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:544
[8] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
@ Distributed ./task.jl:484
caused by: IOError: connect: connection refused (ECONNREFUSED)
Stacktrace:
[1] wait_connected(x::Sockets.TCPSocket)
@ Sockets ~/julia-1.8.0/share/julia/stdlib/v1.8/Sockets/src/Sockets.jl:529
[2] connect
@ ~/julia-1.8.0/share/julia/stdlib/v1.8/Sockets/src/Sockets.jl:564 [inlined]
[3] connect_to_worker(host::String, port::Int64)
@ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/managers.jl:651
[4] connect(manager::SlurmManager, pid::Int64, config::WorkerConfig)
@ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/managers.jl:578
[5] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
@ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:599
[6] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:544
[7] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
@ Distributed ./task.jl:484
...and 59 more exceptions.
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base ./task.jl:436
[2] macro expansion
@ ./task.jl:455 [inlined]
[3] addprocs_locked(manager::SlurmManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:N, :verbose, :topology, :exeflags), Tuple{Int64, String, Symbol, String}}})
@ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:490
[4] addprocs(manager::SlurmManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:N, :verbose, :topology, :exeflags), Tuple{Int64, String, Symbol, String}}})
@ Distributed ~/julia-1.8.0/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:450
[5] top-level scope
@ ~/mmrvaccinedelay.jl/scripts/run.jl:22
[6] include(fname::String)
@ Base.MainInclude ./client.jl:476
[7] top-level scope
@ REPL[1]:1
in expression starting at /home/affans/mmrvaccinedelay.jl/scripts/run.jl:19
The errors are all over the place as well. I see stack traces for the following sometimes
Unhandled Task ERROR: Version read failed. Connection closed by peer.
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
This seems to be a Julia problem rather than slurm
or srun
because if I remove the --worker
argument, srun
is able to launch those processes correctly. Not to mention, this code use to work flawlessly until upgrading to 1.8.
Does anyone know whats going on?