Error when using @distributed for on cluster with multiple nodes

I have a code with some simple @sync @distributed for loops. It works fine when I run it on my computer with 4 processors, or on the cluster with 1 node and any number of processors per node.

But when I run it with more than one node, it gives me error:

ERROR: LoadError: On worker 7:
LoadError: peer 8 didn't connect to 7 within 59.99999... seconds

I am only using SharedArray and @distributed for loops. Any suggestions?

It’s hard to say what the issue is without more information. Have you checked that you have successfully launched Julia worker processes on multiple nodes? Just a guess but one potential culprit might be ssh tunneling.

Which version of Julia are you using? Haven’t tried out 1.0 in a multiple machine setting yet but on 0.64 on my research’s group’s cluster, I fail to connect to workers on nodes other than the one hosting the master process using addprocs if I don’t indicate that ssh tunneling is required. E.g. (where tera31 and tera32 are hostnames of two nodes) for me

procs = ["tera31","tera32"]
addprocs(procs)

fails but

addprocs(procs,tunnel=true)

works. If you’re in an environment where it takes a long time for the connections with remote workers to be established for whatever reason, you can also try setting the JULIA_WORKER_TIMEOUT environment variable on the master process before calling addprocs. This will make Julia wait longer before giving up on connecting to workers.

@Pbellive I was just doing

#PBS -l pmem=8gb,nodes=4:ppn=4,walltime=20:00:100 
julia -p 16 run.jl

on my PBS script. I am not adding procs later.

I am using julia1.0.0 but I think you have correctly pointed out my mistake. How do I get the hostnames? I thought the nodes are assigned randomly to my job depending on the availability.

Oh, yes, if you’re using a job management system you’ll have to manage things a bit differently. I’m not familiar with PBS. You might want to checkout the thread:

and also the ClusterManagers package.

I can say that just launching julia via julia -p n for some number n. is meant for launching multiple workers on a single machine. To launch workers on multiple machines you need to launch julia with a machine file. This post has an example of how to do that with PBS. That’s about all I know. If that doesn’t get you going I would look around for more resources on/ask for help with getting Julia working with cluster job schedulers.

Cheers, Patrick

1 Like

awesome, thanks! I”ll check that out.