I am having issues getting the ClusterManagers package to work on an SGI cluster. I have referenced the following discussion heavily but to no avail:
At this point, I have edited the ~/.julia/v0.6/ClusterManagers/src/qsub.jl file so that my qsub statement is working, and the new job launches and runs, but my current Julia session never connects to the new job. In reviewing the qsub.jl script, I noticed that it is waiting for a SET of files to get created with the base name defined by the line in the script:
filename(i) = isPBS ? “home/julia-(getpid()).o$id-$i” : “home/julia-(getpid()).o$id.$i”
However, when I go into my home directory, there is only one file that’s been created for the whole node - not individual files for each core. Since the name of the file is intended to change with the $i reflecting each core on the node, obviously this file can’t satisfy this dependency the script is looking for.
So at this point, without being enough of an expert to trace through the entire ClusterManagers program, I am not sure what to do next.
Is there some simple way to get my PBS job to write a file for each core?
Or is there a simple mod I can make to the qsub.jl file to get it to connect and move on without having individual files for each core?
Just for reference, my edits to the qsub.jl file are as follows:
Changed the line:
qsub -N $jobname -j oe -k o -t 1-$np $queue $qsub_env $res_list
:
to:
qsub -l select=1:ncpus=36:mpiprocs=36 -q debug -A ERDCV02221SPS -N $jobname -j oe -k o -l walltime=0:30:00
:
I will also note that I haven’t been able to figure out what the -k option is doing in the PBS line. It does not appear as an option in our machine’s documentation, but it does appear to take the command since the job does enter a run status.
I know parallel stuff on different systems can sometimes be tough to diagnose but I’d be very grateful for some help. I’m one of the few Julia users on our machine at this point and our machines typically rank in the top 50 worldwide, so there’s definitely potential for Julia expansion here if I can demonstrate to colleagues that this is a viable alternative.
V/R,
-Bob Browning