How to submit distributed memory jobs to cluster?

biona001 · June 22, 2021, 5:32pm

I have a run_jobs.jl file that supports distributed memory parallelism. How should the beginning of run_jobs.jl look like if I want to run it on a cluster (univa grid engine) via qsub?

I can use ClusterManagers.jl to run my code interactively by

# addprocs_sge will request 16 cores (possibly on different nodes)
using ClusterManagers, Distributed
ClusterManagers.addprocs_sge(16; qsub_flags=`-l h_rt=24:00:00,h_data=4G,arch=intel-gold-61\*`)

# run my actual code that supports distributed memory
...

But this job will run for a long time, and I don’t want to wait until it finishes.

If I qsub a single-core job to run the script above, I get this error:

Base.IOError("could not spawn `qsub -N julia-2229 -wd /u/home/b/biona001 -terse -j y -R y -t 1-4 -V -l 'h_rt=24:00:00,h_data=4G,arch=intel-gold-61*'`: no such file or directory (ENOENT)", -2)`

I also tried to submit a distributed memory job (specify -pe dc* 16 in my shell script), and in run_jobs.jl the heading is simply using Distributed; addprocs(16), the job will crash and cause core dumps. If I change the job submission to shared memory node (specify -pe shared 16), the job will run successfully on a single node with 16 cores, but this will massively increase queue time.

Any tip is appreciated. Thanks!

Topic		Replies	Views
Running a process on several nodes on cluster Performance cluster , distributed	12	1521	December 6, 2022
[Ann] julia in parallel batch mode: job schedulers, etc Julia at Scale announcement	2	1742	November 26, 2018
I am unable to run a simple distributed.jl code on my slurm cluster Julia at Scale parallel , distributed , slurm	11	640	February 10, 2024
Distributed Computing with Slurm and Julia Julia at Scale	9	3514	February 10, 2022
How to run MPI jobs on a cluster Julia at Scale mpi , distributed	1	814	June 2, 2023

How to submit distributed memory jobs to cluster?

Related topics