Hello,
I have trouble running Julia on a cluster. It randomly crashes when trying to connect the workers with addprocs()
with error “connection refused”.
From what I understand, the problem comes from the fact that several workers discuss with the master using the port.
The port is defined with port_hint = 9000 + (getpid() % 1000)
on the cluster, as defined here.
Does it mean that we can’t use more than 1000 processes ? Is there a way to change the port range, to avoid overlap ? Sometimes several processes on different nodes have the same PID modulo 1000, for example when I use 256 processes (10 nodes).
The script used to launch the job:
using ClusterManagers
using Distributed
using LinearAlgebra
LinearAlgebra.BLAS.set_num_threads(1)
ncores = parse(Int, ENV["SLURM_NTASKS"])
@info "Setting up for SLURM, $ncores tasks detected"; flush(stdout)
addprocs(SlurmManager(ncores))
And the full error
nested task error: IOError: connect: connection refused (ECONNREFUSED)
Stacktrace:
[1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
@ Distributed /home/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1089
[2] worker_from_id
@ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1086 [inlined]
[3] #remote_do#170
@ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
[4] remote_do
@ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
[5] kill
@ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:668 [inlined]
[6] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
@ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:600
[7] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
[8] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
@ Distributed ./task.jl:429
caused by: IOError: connect: connection refused (ECONNREFUSED)
Stacktrace:
[1] wait_connected(x::Sockets.TCPSocket)
@ Sockets /home/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:532
[2] connect
@ /home/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:567 [inlined]
[3] connect_to_worker(host::String, port::Int64)
@ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:632
[4] connect(manager::SlurmManager, pid::Int64, config::WorkerConfig)
@ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:559
[5] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
@ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:596
[6] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
[7] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
@ Distributed ./task.jl:429
...and 212 more exceptions.
Thank you for your help.
Matthias