Port conflict when running several nodes on a Slurm cluster

Hello,

I have trouble running Julia on a cluster. It randomly crashes when trying to connect the workers with addprocs() with error “connection refused”.

From what I understand, the problem comes from the fact that several workers discuss with the master using the port.
The port is defined with port_hint = 9000 + (getpid() % 1000) on the cluster, as defined here.

Does it mean that we can’t use more than 1000 processes ? Is there a way to change the port range, to avoid overlap ? Sometimes several processes on different nodes have the same PID modulo 1000, for example when I use 256 processes (10 nodes).

The script used to launch the job:


  using ClusterManagers
  using Distributed
  using LinearAlgebra

  LinearAlgebra.BLAS.set_num_threads(1)

  ncores = parse(Int, ENV["SLURM_NTASKS"])
  @info "Setting up for SLURM, $ncores tasks detected"; flush(stdout)
  addprocs(SlurmManager(ncores))

And the full error

    nested task error: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /home/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1089
     [2] worker_from_id
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1086 [inlined]
     [3] #remote_do#170
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
     [4] remote_do
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
     [5] kill
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:668 [inlined]
     [6] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:600
     [7] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
     [8] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:429
    
    caused by: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets /home/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:532
     [2] connect
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:567 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:632
     [4] connect(manager::SlurmManager, pid::Int64, config::WorkerConfig)
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:559
     [5] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:596
     [6] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
     [7] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:429

...and 212 more exceptions.

Thank you for your help.
Matthias

I have had the same issue on an LSF cluster. I think it should only happen when the same port is used with the same ip, something which can happen when a single machine has many slots.

Here is a post of how I worked around the above issue on LSF: How to configure which port workers listen - #6 by DrChainsaw

Dont know if slurm does things differently so that other constraints come to play.

Thank you for your suggestion.

Deducting from your comment I realized I had the wrong conclusion. In my case the ports are all different on a given IP (node). They only overlap across nodes, but as you pointed out this actually should not be a problem. Practice supported this idea, I had several runs with overlapping ports across nodes without the “connection refused” error.

Then I don’t know where the connection refused error come from. Most probably some nodes can not connect together over a given port. I should be related to the cluster configuration, and administrators could not give me answers yet.

I could not implement the suggestion in the link you provided, because the perl workaround also generates unwanted quotes (maybe it’s new in recent version ?). Actually after reading about the Cmd Object, it seems that it is the expected behavior. Indeed it prevents code injection from Julia on the cluster [1].
Then I don’t know how to control the port range parameters when using ClusterManager.jl. It could solve my problem if this problem is related to ports.

Still investigating about this. If anyone has pointers to port configuration in julia Distributed and ClusterManager.jl, and TCP socket error connection refused cause, any reference is helpful.

[1] See this thread

Other reference very much related

Investing port numbers did not give anymore clues on the problem of “connection refused”. See below a table of port used and success of process connection.

Each experiment is run of 64 processors across 2 nodes, having 32 processors each.

Node number Port range success Port range fail
2487-2488 9299-9330 9747-9778
1096,1098 9476-9507 9980-9011
1552-1553 9528-9558
2345-2346 9456-9487
1984-1985 9969-9000 9460-9491
1059-1060 9984-9015 9471-9502
2491-2492 9666-9699 9239-9270

You can see that the successful and failing port ranges overlap, so their is no obvious rules to select the right port range.