A good way of running Julia MPI jobs with slurm on HPC platforms

For some reason the Julia MPI jobs that I submit end in error due
to a lack of instantiation / activation. I do run it as

mpiexec julia --project=~/a64fx/FinEtoolsDDParallel.jl/examples ~/a64fx/FinEtoolsDDParallel.jl/examples/heat/Poisson2D_cg_mpi_driver.jl

and when I do just julia --project ~/a64fx/FinEtoolsDDParallel.jl/examples/ it works fine.

How do you guys run Julia on HPC?

1 Like

For me Distributed.jl was always enough. But I only had embarassingly parallel problems anyways and MPI always looked too complicated for that :sweat_smile:

On NERSC, one of my recent SLURM scripts looks like this (I cd to the project directory but one could replace the --project=. with a specific path)

#!/bin/bash
#SBATCH -A mp107d
#SBATCH --qos=regular
#SBATCH -C cpu
#SBATCH -t 1:00:00
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=16
#SBATCH -J big_filter
#SBATCH --exclusive

cd YOUR_PROJECT_DIR
module load cray-mpich
module load cray-hdf5-parallel
export JULIA_NUM_THREADS=8
which julia

julia --project=. -e \
    'using Pkg; using InteractiveUtils;
     Pkg.instantiate(); Pkg.precompile(); Pkg.status(); versioninfo();
     using MPI; println("MPI: ", MPI.identify_implementation());'

mpiexecjl --project=. --cpu-bind=cores julia fft_filter_6144.jl
1 Like

On my university cluster I need to set some environment variables/flags to get a multi-node MPI setup to work. Here is a template for the slurm script I use for distributed training of a neural network.

#SBATCH --nodes=2
#SBATCH --ntasks=80
#SBATCH --cpus-per-task=1
#SBATCH --exclusive
#SBATCH --time=4:00:00
#SBATCH --mem=0

module load julia/1.9.2
module load mpi/openmpi-4.1.5

...

~/.julia/bin/mpiexecjl --project=$PROJECT --mca btl_tcp_if_include eth0 -n 80 julia --project=$PROJECT script.jl

1 Like

Thank you all who responded. The setup was weird in some way: removed the depot, things changed. Not for (much) better, though.

Now all the processes try to precompile and it is a huge mess of stale lock files and such. How do you handle that?

I would like to have a single depot on the lustre file system, and each process to use that depot. So now I need to prevent each process from trying to precompile stuff…

So you have a separate Julia depot on all nodes?

Using MPI is a pain! Could not agree more.

I figured out one way in which to run the sim with MPI, only to be stopped by a weird error: Ookami: MPI error opal_libevent2022_evthread_use_pthreads · Issue #835 · JuliaParallel/MPI.jl · GitHub

Edit: Looks like there is a solution.