I have a number of HPC servers that I could use to parallelise my Julia computations. Each server has multiple cores. These servers are on a separate network, so I have to use a jump host to ssh to them. When I try to add a single worker on one of the nodes as below, it works without problems:
addprocs([("node", 1)], tunnel=true, max_parallel=100, topology=:master_worker, sshflags="-J jumphost")
The node in question has 80 cores. I try to utilise all the cores like this:
addprocs([("node", :auto)], tunnel=true, max_parallel=100, topology=:master_worker, sshflags="-J jumphost")
In this scenario, the call blocks forever (not returned after an hour), and there are many messages
Worker x terminated. printed on stdout. Using
("node", 80) instead of
("node", :auto) makes no difference.
When I run
ps on a separate shell, I see many
ssh .... -J jumphost -Lport1:node_ip:port2 node sleep 60 processes.
My question is what operating system primitives are used by Julia to communicate between the master and the workers? So far it looks as if it uses TCP sockets, forwarded via ssh. I am a bit confused by
sleep 60, though. Does it set
keepalive on the connection?
There are some postings on stackoverflow that link issues with distributed Julia workers to exhausted
ulimit's, but I think mine are ok. My open file limit is set to 1024, so should not be exhausted by 80 connections. Any advice on what other limits I should check and which processes might be exhausting them is appreciated.
I am running Julia 1.0.3 on CentOS7, x86_64