Julia parallel computing over multiple nodes in SGE cluster

parallel

#1

I actually already asked this on StackOverflow but did not find any answer that solved my problem.

Here is the link to the question:

I basically want to run Julia on multiple nodes on an SGE cluster with the opportunity to use the native @parallel parallelization macro. I tried with julia --machinefile $PE_HOSTFILE and also using the ClusterManagers.addprocs_sge(X) (X == number of procs) but they both failed.

Can anyone give me any advice?


#2

try addprocs_qrsh instead. if your file system is tuned for high throughput and the cost of high latency, inter-process communication through the filesystem can timeout.


#3

Ok I was told by the IT team that this methods need passwordless ssh, whereas the cluster I’m using doesn’t allow that. They’ll look into this in the following days, I’ll post as soon as I have a solution if I find one.


#4

The IT team got back to me and told me that the SGE does not allow passwordless ssh, that’s why addprocs_sge() or any other way I tried wouldn’t work. However they now added a file for the job that I can pass to Julia and told me to run the job with this script:

qlogin -pe mpi_28_tasks_per_node 56
module load julia/0.5.1
julia --machinefile $TMPDIR/machines

The machines file looks like this:


::::::::::::::
/scratch/8548498.1.u/machines
::::::::::::::
{hostname1}
{hostname1}
...
{hostname2}
{hostname2}