Running Julia with multiple workers via ssh tunnel

Hello,

I have a number of HPC servers that I could use to parallelise my Julia computations. Each server has multiple cores. These servers are on a separate network, so I have to use a jump host to ssh to them. When I try to add a single worker on one of the nodes as below, it works without problems:

addprocs([("node", 1)], tunnel=true, max_parallel=100, topology=:master_worker, sshflags="-J jumphost")

The node in question has 80 cores. I try to utilise all the cores like this:

addprocs([("node", :auto)], tunnel=true, max_parallel=100, topology=:master_worker, sshflags="-J jumphost")

In this scenario, the call blocks forever (not returned after an hour), and there are many messages Worker x terminated. printed on stdout. Using ("node", 80) instead of ("node", :auto) makes no difference.

When I run ps on a separate shell, I see many ssh .... -J jumphost -Lport1:node_ip:port2 node sleep 60 processes.

My question is what operating system primitives are used by Julia to communicate between the master and the workers? So far it looks as if it uses TCP sockets, forwarded via ssh. I am a bit confused by sleep 60, though. Does it set keepalive on the connection?

There are some postings on stackoverflow that link issues with distributed Julia workers to exhausted ulimit’s, but I think mine are ok. My open file limit is set to 1024, so should not be exhausted by 80 connections. Any advice on what other limits I should check and which processes might be exhausting them is appreciated.

I am running Julia 1.0.3 on CentOS7, x86_64

It’s a bit of a stab in the dark, but you could try setting JULIA_WORKER_TIMEOUT

https://docs.julialang.org/en/v1/manual/environment-variables/#JULIA_WORKER_TIMEOUT-1

Set it to 300. Made no difference.

Can you try reducing the number of processes started to something like 20, and see if that works? You might still be hitting that 1024 files open ulimit, as a single julia processes opens multiple files (shared libraries, sys.so, etc.)

EDIT: 1024 is a pretty small ulimit for the number of open files on a cluster. You or your cluster admin should consider increasing that (assuming you have lots of RAM to make use of, you could safely increase it by at least an order of magnitude).

1 Like

1024 is the value of ulimit -n, so I believe it is set per process. When I check the number of files opened by Julia via lsof -p <julia_pid>, I get nowhere near this limit. Is lsof the right way to check this? Should I be looking at other processes (perhaps, ssh)?

I’m rather sure ulimit -n is per-user, not per process, but I could be wrong. Also, lsof -p <julia_pid> only works for a single process, but I assume you’re launching multiple processes (80 of them). I’d multiply whatever number you get with that command by 80 to get a rough idea of how many open files you have.

Also, this could totally be memory exhaustion, so if you can, check dmesg | tail to see if the OOM killer was invoked on some of your Julia processes.

Yes, but you are outside the network which the compute nodes are on.
Could this be a DNS resolution problem?

Hmmm… you are using node_ip so you use IP addresses not hostnames.

Quick test - can you ssh -J jumphost -L node_ip

On some clusters ssh access to nodes is locked down, but in this case I think probably not.

To clarify, the ssh processes are spawned by julia. Access is not a problem, because adding a single worker works.

Have you resolved the issue?

If not, perhaps try using a lower setting for max_parallel (or leave at default 10).

See also this discussion: ClusterManager should use dispatch instead of function pointers · Issue #8168 · JuliaLang/julia · GitHub

Have you tried to increase MaxSessions in sshd_config, the default is 10.

See e.g.:
https://unix.stackexchange.com/a/28085

Hey,

Thanks for all the links, this is really helpful.

I suspect the original issue was indeed with constrained resources on the jump host. The thing is, the jump host is shared by several clusters, one of which is production-critical. I am using the other one (the ‘research’ one). So, if I consume all file descriptors on the jump box, there will be many unhappy users. The upshot is that I could not persuade our admins to increase the ulimit -n quota on the jump box. They did increase it on the cluster nodes, though.

I managed a workaround by getting a little creative with my ~/.ssh/config:

Host jumphost
        ControlMaster auto
        ControlPath ~/.ssh/control/%n
        ControlPersist 10m

Host node*
        ProxyJump jumphost

Now, the problem is gone when I call addprocs(...) for each node one by one, with :auto workers on each node. However, the issue still manifests itself when I try to add them all in one go, supplying an array of node tuples. By looking at the code, I suspect there are (at least) two bugs here:

  • The ssh tunnel sleeps for 60 seconds, regardless of the value of JULIA_WORKER_TIMEOUT environment property. If my workers are slow, they will not manage to connect to the master in time, and there is no way for me to control the timeout.
  • In this scenario, rather than reporting an error, the addprocs() call just hangs.

Should I raise these as bugs, or has this already been reported?

I can imagine these issues could (possibly) be circumvented by tweaking the settings of sshd on the jump host, as suggested by @KajWiik, but with our shared setup I don’t think the admins will be easily persuaded.