Hi, I tried to use FluxMPI on Slurm, couldn’t figure out the configuration. Can anyone provide a minimum example?
Currently, I’m basically doing something like this:
#SBATCH --ntasks=3
export JULIA_CUDA_MEMORY_POOL=none
mpiexecjl --project=. -n 3 julia mpitraining.jl
where the mpitraining.jl
is the example code I get from FluxMPI repo. Currently, I got error like
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=15447183.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: skl-a-62: task 2: Out Of Memory
I’m not sure how to resolve this, since I’m quite new to the distributed computing; another question is: What argument should I provide via sbatch
file, and what arguments should I use ClusterManagers.jl
? What would be a good
Any suggestion is welcome and much appreciated.