Port conflict when running several nodes on a Slurm cluster

Thank you for your suggestion.

Deducting from your comment I realized I had the wrong conclusion. In my case the ports are all different on a given IP (node). They only overlap across nodes, but as you pointed out this actually should not be a problem. Practice supported this idea, I had several runs with overlapping ports across nodes without the “connection refused” error.

Then I don’t know where the connection refused error come from. Most probably some nodes can not connect together over a given port. I should be related to the cluster configuration, and administrators could not give me answers yet.

I could not implement the suggestion in the link you provided, because the perl workaround also generates unwanted quotes (maybe it’s new in recent version ?). Actually after reading about the Cmd Object, it seems that it is the expected behavior. Indeed it prevents code injection from Julia on the cluster [1].
Then I don’t know how to control the port range parameters when using ClusterManager.jl. It could solve my problem if this problem is related to ports.

Still investigating about this. If anyone has pointers to port configuration in julia Distributed and ClusterManager.jl, and TCP socket error connection refused cause, any reference is helpful.

[1] See this thread

Other reference very much related