I’m quite new to distributed computing and high performance computing so please be kind.
My university has access to an HPC which uses PBS/Torque. I’m currently running the ProgressiveHedging.jl package using all the cores of one node, but I have a lot of scenarios and I would like to run it on multiple cores on multiple nodes.
So far I figured out that I should pass
--machine-file=$PBS_NODEFILE when I start up Julia, but I get the following error when trying to start julia on 2 nodes with 2 cores each:
✔ [May/12 12:08] <user_and_path> $ julia --machine-file=$PBS_NODEFILE Joining job 50738709.tier2-p-moab-2.tier2.hpc.kuleuven.be Joining job 50738709.tier2-p-moab-2.tier2.hpc.kuleuven.be ERROR: TaskFailedException: Unable to read host:port string from worker. Launch command exited with error?
So far I’ve managed to glean this information from my sysadmin:
The nodes communicate via infiniband network, which is a topology built on top of the hardware/power/storage setup. The using OpenMPI communicate wit one another via Torque, whereas the jobs using Intel MPI use SSH. When your account was create, an internal ssh key is also automatically generated inside your ~/.ssh folder; that is for this purpose.
So the nodes should be able to communicate and startup using SSH, so why does the above lead to an error? Also is the contents of
PBS_NODEFILE enough for Julia to figure out what to do? (see below)
r25i13n16 r25i13n16 r25i13n17 r25i13n17