I have a
run_jobs.jl file that supports distributed memory parallelism. How should the beginning of
run_jobs.jl look like if I want to run it on a cluster (univa grid engine) via
I can use ClusterManagers.jl to run my code interactively by
# addprocs_sge will request 16 cores (possibly on different nodes) using ClusterManagers, Distributed ClusterManagers.addprocs_sge(16; qsub_flags=`-l h_rt=24:00:00,h_data=4G,arch=intel-gold-61\*`) # run my actual code that supports distributed memory ...
But this job will run for a long time, and I don’t want to wait until it finishes.
qsub a single-core job to run the script above, I get this error:
Base.IOError("could not spawn `qsub -N julia-2229 -wd /u/home/b/biona001 -terse -j y -R y -t 1-4 -V -l 'h_rt=24:00:00,h_data=4G,arch=intel-gold-61*'`: no such file or directory (ENOENT)", -2)`
I also tried to submit a distributed memory job (specify
-pe dc* 16 in my shell script), and in
run_jobs.jl the heading is simply
using Distributed; addprocs(16), the job will crash and cause core dumps. If I change the job submission to shared memory node (specify
-pe shared 16), the job will run successfully on a single node with 16 cores, but this will massively increase queue time.
Any tip is appreciated. Thanks!