I’m quite new to distributed computing and high performance computing so please be kind.
My university has access to an HPC which uses PBS/Torque. I’m currently running the ProgressiveHedging.jl package using all the cores of one node, but I have a lot of scenarios and I would like to run it on multiple cores on multiple nodes.
So far I figured out that I should pass --machine-file=$PBS_NODEFILE when I start up Julia, but I get the following error when trying to start julia on 2 nodes with 2 cores each:
✔ [May/12 12:08] <user_and_path> $ julia --machine-file=$PBS_NODEFILE
Joining job 50738709.tier2-p-moab-2.tier2.hpc.kuleuven.be
Joining job 50738709.tier2-p-moab-2.tier2.hpc.kuleuven.be
ERROR: TaskFailedException:
Unable to read host:port string from worker. Launch command exited with error?
So far I’ve managed to glean this information from my sysadmin:
The nodes communicate via infiniband network, which is a topology built on top of the hardware/power/storage setup. The using OpenMPI communicate wit one another via Torque, whereas the jobs using Intel MPI use SSH. When your account was create, an internal ssh key is also automatically generated inside your ~/.ssh folder; that is for this purpose.
So the nodes should be able to communicate and startup using SSH, so why does the above lead to an error? Also is the contents of PBS_NODEFILE enough for Julia to figure out what to do? (see below)
Thanks for the reference, I had missed that in the documentation. It would be good to have an actual example of a machine-file as well as typing it out. I will make a pull request / issue.
So I tried again, my $PBS_NODEFILE looks like:
r25i13n07
r25i13n07
r25i13n11
r25i13n11
If I type ssh r25i13n07 or ssh r25i13n11 it takes me to that node, so default user and ports work. If I understand it well, my machine file should then look like:
2*r25i13n07
2*r25i13n11
However I still get the same error, even when I make the hosts and ports explicit and drop the attempt to start 2 workers on each node:
Things to check now might be: can you ssh into the main node from the worker nodes; are all ssh connections passwordless; after ssh logging in to the worker nodes, can you run julia on them?
I can ssh into all of them, but I can’t start Julia. My setup is that my pbs script calls a setup.rc file which (among other things) adds the Julia binary to the path, but even after sourcing it through ssh I get the same error.
I noticed that even after sourcing it, if I go to that worker and try to start Julia it doesn’t work. So I think I have to somehow do run this setup script on all workers before my job starts.
I seem to recall having a problem like that before, too. I was advised to run the same version of julia on all nodes (master and workers) and I did, but I’m not sure if that was what fixed it. When you ssh do you set StrictHostKeyChecking? If I recall that was my issue. The following works for me:
@healyp@LeePhillips do either of you have any tips on debugging this? The error message I get from Julia is not very helpful, and I’m at a loss for going forward.
EDIT: the IT person helping me suggested I add the flags -vv which gave the output below (screenshot since I can’t copy paste from our service desk ticket). He also noticed that Julia 0.4.7 was able to get the workers running by simply specifying --machinefile=$PBS_NODEFILE, but not so for versions >1.
The solution ended up being very specific to the HPC I’m using. I’ve attached a screenshot below in case it is useful for someone else in the future, but it’s so particular that I somewhat doubt it: