Port conflict when running several nodes on a Slurm cluster

matthiasbe · July 30, 2022, 9:49am

Hello,

I have trouble running Julia on a cluster. It randomly crashes when trying to connect the workers with addprocs() with error “connection refused”.

From what I understand, the problem comes from the fact that several workers discuss with the master using the port.
The port is defined with port_hint = 9000 + (getpid() % 1000) on the cluster, as defined here.

Does it mean that we can’t use more than 1000 processes ? Is there a way to change the port range, to avoid overlap ? Sometimes several processes on different nodes have the same PID modulo 1000, for example when I use 256 processes (10 nodes).

The script used to launch the job:


  using ClusterManagers
  using Distributed
  using LinearAlgebra

  LinearAlgebra.BLAS.set_num_threads(1)

  ncores = parse(Int, ENV["SLURM_NTASKS"])
  @info "Setting up for SLURM, $ncores tasks detected"; flush(stdout)
  addprocs(SlurmManager(ncores))

And the full error

    nested task error: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /home/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1089
     [2] worker_from_id
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1086 [inlined]
     [3] #remote_do#170
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
     [4] remote_do
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
     [5] kill
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:668 [inlined]
     [6] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:600
     [7] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
     [8] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:429
    
    caused by: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets /home/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:532
     [2] connect
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:567 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:632
     [4] connect(manager::SlurmManager, pid::Int64, config::WorkerConfig)
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:559
     [5] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:596
     [6] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
     [7] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:429

...and 212 more exceptions.

Thank you for your help.
Matthias

DrChainsaw · July 30, 2022, 10:46am

I have had the same issue on an LSF cluster. I think it should only happen when the same port is used with the same ip, something which can happen when a single machine has many slots.

Here is a post of how I worked around the above issue on LSF: How to configure which port workers listen - #6 by DrChainsaw

Dont know if slurm does things differently so that other constraints come to play.

matthiasbe · July 31, 2022, 6:20pm

Thank you for your suggestion.

Deducting from your comment I realized I had the wrong conclusion. In my case the ports are all different on a given IP (node). They only overlap across nodes, but as you pointed out this actually should not be a problem. Practice supported this idea, I had several runs with overlapping ports across nodes without the “connection refused” error.

Then I don’t know where the connection refused error come from. Most probably some nodes can not connect together over a given port. I should be related to the cluster configuration, and administrators could not give me answers yet.

I could not implement the suggestion in the link you provided, because the perl workaround also generates unwanted quotes (maybe it’s new in recent version ?). Actually after reading about the Cmd Object, it seems that it is the expected behavior. Indeed it prevents code injection from Julia on the cluster [1].
Then I don’t know how to control the port range parameters when using ClusterManager.jl. It could solve my problem if this problem is related to ports.

Still investigating about this. If anyone has pointers to port configuration in julia Distributed and ClusterManager.jl, and TCP socket error connection refused cause, any reference is helpful.

[1] See this thread

Other reference very much related

matthiasbe · August 4, 2022, 1:36pm

Investing port numbers did not give anymore clues on the problem of “connection refused”. See below a table of port used and success of process connection.

Each experiment is run of 64 processors across 2 nodes, having 32 processors each.

Node number	Port range success	Port range fail
2487-2488	9299-9330	9747-9778
1096,1098	9476-9507	9980-9011
1552-1553		9528-9558
2345-2346		9456-9487
1984-1985	9969-9000	9460-9491
1059-1060	9984-9015	9471-9502
2491-2492	9666-9699	9239-9270

You can see that the successful and failing port ranges overlap, so their is no obvious rules to select the right port range.

Topic		Replies	Views
How to configure which port workers listen Julia at Scale	8	816	September 24, 2020
Interminnent errors throwing when adding procs General Usage	1	295	September 5, 2022
Code that works fine distributed across processes on one node using slurm seems to fail when trying to generate workers across many Julia at Scale question	2	1400	May 19, 2022
Addprocs_slurm not connecting to all available workers Julia at Scale question , parallel , cluster , distributed , high-performance	7	117	December 4, 2024
I am unable to run a simple distributed.jl code on my slurm cluster Julia at Scale parallel , distributed , slurm	11	644	February 10, 2024

Port conflict when running several nodes on a Slurm cluster

Related topics