Setting up distributed workers on seperate nodes of cluster - PBS and OpenMPI

gobs · May 12, 2021, 10:21am

I’m quite new to distributed computing and high performance computing so please be kind.

My university has access to an HPC which uses PBS/Torque. I’m currently running the ProgressiveHedging.jl package using all the cores of one node, but I have a lot of scenarios and I would like to run it on multiple cores on multiple nodes.

So far I figured out that I should pass --machine-file=$PBS_NODEFILE when I start up Julia, but I get the following error when trying to start julia on 2 nodes with 2 cores each:

✔ [May/12 12:08] <user_and_path> $ julia --machine-file=$PBS_NODEFILE
Joining job 50738709.tier2-p-moab-2.tier2.hpc.kuleuven.be
Joining job 50738709.tier2-p-moab-2.tier2.hpc.kuleuven.be
ERROR: TaskFailedException:
Unable to read host:port string from worker. Launch command exited with error?

So far I’ve managed to glean this information from my sysadmin:

The nodes communicate via infiniband network, which is a topology built on top of the hardware/power/storage setup. The using OpenMPI communicate wit one another via Torque, whereas the jobs using Intel MPI use SSH. When your account was create, an internal ssh key is also automatically generated inside your ~/.ssh folder; that is for this purpose.

So the nodes should be able to communicate and startup using SSH, so why does the above lead to an error? Also is the contents of PBS_NODEFILE enough for Julia to figure out what to do? (see below)

r25i13n16
r25i13n16
r25i13n17
r25i13n17

LeePhillips · May 12, 2021, 5:46pm

Look at Multi-processing and Distributed Computing · The Julia Language for the format of the machine-file.

Can you reach those nodes from the shell? What happens if you type ssh r25i13n16 hostname?

gobs · May 13, 2021, 6:08am

Thanks for the reference, I had missed that in the documentation. It would be good to have an actual example of a machine-file as well as typing it out. I will make a pull request / issue.

So I tried again, my $PBS_NODEFILE looks like:

r25i13n07
r25i13n07
r25i13n11
r25i13n11

If I type ssh r25i13n07 or ssh r25i13n11 it takes me to that node, so default user and ports work. If I understand it well, my machine file should then look like:

2*r25i13n07
2*r25i13n11

However I still get the same error, even when I make the hosts and ports explicit and drop the attempt to start 2 workers on each node:

<user>@r25i13n07:22
<user>@r25i13n11:22

LeePhillips · May 13, 2021, 2:16pm

Your machine file has the right syntax.

Things to check now might be: can you ssh into the main node from the worker nodes; are all ssh connections passwordless; after ssh logging in to the worker nodes, can you run julia on them?

gobs · May 13, 2021, 4:25pm

I can ssh into all of them, but I can’t start Julia. My setup is that my pbs script calls a setup.rc file which (among other things) adds the Julia binary to the path, but even after sourcing it through ssh I get the same error.

I noticed that even after sourcing it, if I go to that worker and try to start Julia it doesn’t work. So I think I have to somehow do run this setup script on all workers before my job starts.

gobs · May 17, 2021, 11:39am

@LeePhillips I contacted my ICT support team and they suggested adding the workers from within Julia:

addprocs([("r26i27n06", 1)]; exename=`/data/leuven/331/vsc33168/julia-1.5.2/bin/julia`, tunnel=true)

This still does not work, even if I specify user@host:port. I even tried specifying the bind address:

addprocs([("vsc33168@r25i27n22:22 r25i27n19:22", 1)]; exename=`/data/leuven/331/vsc33168/julia-1.5.2/bin/julia`, tunnel=true)

No luck. I’m really at a loss here.

healyp · May 17, 2021, 1:07pm

I seem to recall having a problem like that before, too. I was advised to run the same version of julia on all nodes (master and workers) and I did, but I’m not sure if that was what fixed it. When you ssh do you set StrictHostKeyChecking? If I recall that was my issue. The following works for me:

    params=(exename="/home/..../julia-1.6.0/bin/julia",
            dir=remoteDir,
            tunnel=true,
            sshflags=`-o StrictHostKeyChecking=no`)
    addprocs(procLimits; params...)

where procLimits is a vector of Tuple{String,Int} like the first argument of your call to addprocs() above.

gobs · May 17, 2021, 3:21pm

Nope, same problem:

addprocs([("r25i27n23", 1)]; exename=`/data/leuven/331/vsc33168/julia-1.5.2/bin/julia`, tunnel=true, sshflags=`-o StrictHostKeyChecking=no`)
Joining job 50746254.tier2-p-moab-2.tier2.hpc.kuleuven.be
ERROR: TaskFailedException:
Unable to read host:port string from worker. Launch command exited with error?

gobs · May 19, 2021, 9:55am

@healyp @LeePhillips do either of you have any tips on debugging this? The error message I get from Julia is not very helpful, and I’m at a loss for going forward.

EDIT: the IT person helping me suggested I add the flags -vv which gave the output below (screenshot since I can’t copy paste from our service desk ticket). He also noticed that Julia 0.4.7 was able to get the workers running by simply specifying --machinefile=$PBS_NODEFILE, but not so for versions >1.

gobs · May 19, 2021, 4:45pm

The solution ended up being very specific to the HPC I’m using. I’ve attached a screenshot below in case it is useful for someone else in the future, but it’s so particular that I somewhat doubt it:

healyp · May 19, 2021, 8:38pm

Happy to hear you have a way forward.

Your fix was most certainly not what did it for me, however J.

Topic		Replies	Views
Julia on Cluster with SSH Restriction General Usage question , cluster	18	3947	January 16, 2021
Help setting up Julia on a cluster Julia at Scale question , parallel , cluster	28	14947	March 4, 2020
Multiple Computer Example Julia at Scale	11	5202	April 3, 2020
Error when using @distributed for on cluster with multiple nodes Julia at Scale cluster	4	1769	August 31, 2018
Getting started with HPC and Julia General Usage distributed	23	1020	September 28, 2023

Setting up distributed workers on seperate nodes of cluster - PBS and OpenMPI

Related topics