I have a run_jobs.jl
file that supports distributed memory parallelism. How should the beginning of run_jobs.jl
look like if I want to run it on a cluster (univa grid engine) via qsub
?
I can use ClusterManagers.jl to run my code interactively by
# addprocs_sge will request 16 cores (possibly on different nodes)
using ClusterManagers, Distributed
ClusterManagers.addprocs_sge(16; qsub_flags=`-l h_rt=24:00:00,h_data=4G,arch=intel-gold-61\*`)
# run my actual code that supports distributed memory
...
But this job will run for a long time, and I don’t want to wait until it finishes.
If I qsub
a single-core job to run the script above, I get this error:
Base.IOError("could not spawn `qsub -N julia-2229 -wd /u/home/b/biona001 -terse -j y -R y -t 1-4 -V -l 'h_rt=24:00:00,h_data=4G,arch=intel-gold-61*'`: no such file or directory (ENOENT)", -2)`
I also tried to submit a distributed memory job (specify -pe dc* 16
in my shell script), and in run_jobs.jl
the heading is simply using Distributed; addprocs(16)
, the job will crash and cause core dumps. If I change the job submission to shared memory node (specify -pe shared 16
), the job will run successfully on a single node with 16 cores, but this will massively increase queue time.
Any tip is appreciated. Thanks!