For some reason the Julia MPI jobs that I submit end in error due
to a lack of instantiation / activation. I do run it as
mpiexec julia --project=~/a64fx/FinEtoolsDDParallel.jl/examples ~/a64fx/FinEtoolsDDParallel.jl/examples/heat/Poisson2D_cg_mpi_driver.jl
and when I do just julia --project ~/a64fx/FinEtoolsDDParallel.jl/examples/
it works fine.
How do you guys run Julia on HPC?
1 Like
For me Distributed.jl was always enough. But I only had embarassingly parallel problems anyways and MPI always looked too complicated for that
On NERSC, one of my recent SLURM scripts looks like this (I cd to the project directory but one could replace the --project=.
with a specific path)
#!/bin/bash
#SBATCH -A mp107d
#SBATCH --qos=regular
#SBATCH -C cpu
#SBATCH -t 1:00:00
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=16
#SBATCH -J big_filter
#SBATCH --exclusive
cd YOUR_PROJECT_DIR
module load cray-mpich
module load cray-hdf5-parallel
export JULIA_NUM_THREADS=8
which julia
julia --project=. -e \
'using Pkg; using InteractiveUtils;
Pkg.instantiate(); Pkg.precompile(); Pkg.status(); versioninfo();
using MPI; println("MPI: ", MPI.identify_implementation());'
mpiexecjl --project=. --cpu-bind=cores julia fft_filter_6144.jl
1 Like
lxvm
June 5, 2024, 9:19pm
4
On my university cluster I need to set some environment variables/flags to get a multi-node MPI setup to work. Here is a template for the slurm script I use for distributed training of a neural network.
#SBATCH --nodes=2
#SBATCH --ntasks=80
#SBATCH --cpus-per-task=1
#SBATCH --exclusive
#SBATCH --time=4:00:00
#SBATCH --mem=0
module load julia/1.9.2
module load mpi/openmpi-4.1.5
...
~/.julia/bin/mpiexecjl --project=$PROJECT --mca btl_tcp_if_include eth0 -n 80 julia --project=$PROJECT script.jl
1 Like
Thank you all who responded. The setup was weird in some way: removed the depot, things changed. Not for (much) better, though.
Now all the processes try to precompile and it is a huge mess of stale lock files and such. How do you handle that?
I would like to have a single depot on the lustre file system, and each process to use that depot. So now I need to prevent each process from trying to precompile stuff…
So you have a separate Julia depot on all nodes?
Using MPI is a pain! Could not agree more.
I figured out one way in which to run the sim with MPI, only to be stopped by a weird error: Ookami: MPI error opal_libevent2022_evthread_use_pthreads · Issue #835 · JuliaParallel/MPI.jl · GitHub
Edit: Looks like there is a solution.