Hi. I used PBS qsub to get four nodes on our cluster. I put the contents of $PBS_NODEFILE in the array hosts below. After doing “using Distributed”, I tried using addprocs as below, and I don’t understand what to do about the error message. Suggestions welcome. Thanks.
**julia>** hosts = ["r409i4n16","r433i2n4","r433i2n8","r433i3n11"]
exception launching on machine r409i4n16 : TypeError(:if, "", Bool, "true")
exception launching on machine r433i2n8 : TypeError(:if, "", Bool, "true")
exception launching on machine r433i3n11 : TypeError(:if, "", Bool, "true")
exception launching on machine r433i2n4 : TypeError(:if, "", Bool, "true")
You could try using ClusterManagers.jl, though I think the PBS support there might not be working.
It is the boolean value true or false, not “true”
An SGI ICE cluster with over 400 racks! I am impressed!
Also I admit my ignorance. Why ssh tunnel to cluster nodes - you should be able to reach them directly?
Clue stick coming my way…
The ssh tunnel is needed on our cluster (under PBS) - otherwise the master has trouble connecting to workers.
Also, yesterday I tried addprocs_pbs in ClusterManagers, but couldn’t get that to work. It seemed to want to add an illegal qsub parameter -t with the node range (1-numnodes) that I was asking for.
@Jeff_Becker That ClusterManagers package uses an older version of PBS. I think I have a commen ton the issues page there.
Sorry to be pushy here - why would you need to tunnel within a cluster?
I have managed three SGI ICE clusters, using PBS.
And two very large clusters using PBS in the Netherlands.
There is a PBS configuration using a PAM module which prevents users sshing into cluster nodes if you are NOT running a job on them. I think you are using this?
You are right - the default interface for my PBS nodes was selected incorrectly - if I specify node IPoIB addresses in my host array, it works fine without tunneling. Thanks for pushing 8^)
@Jeff_Becker take a look at this guide: https://github.com/juliohm/julia-distributed-computing
If it doesn’t work for your PBS cluster, please open an issue. The idea is to centralize all issues related to distributed computing in a single place where people can suggest improvements and update the instructions.