Does anyone run Singularity images across many Nodes?

Noel_Araujo · May 3, 2021, 3:45pm

As I point out in here, I wish to connect ClusterManagers.jl using Singularity images, but as @kescobo says, the problem seems to be Singularity or my Cluster configuration.

Does anyone already have experience running addprocs() + Singularity across nodes to give me a complete example ? Therefore I can test where is problem is/should be.

Thanks

vmmhep · May 4, 2022, 9:10pm

Did you happen to get to the bottom of this? I’m now trying the same thing on our HPC and indeed it doesn’t seem to work across nodes

Noel_Araujo · May 4, 2022, 10:07pm

Hi, i solved in the hard way: trial and error.

I used MPIClusterManagers, with the option MPI_TRANSPORT_ALL

using Distributed , MPIClusterManagers
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)

# your code here

and my batch script is something like this:

#SBATCH --nodes=3  # node count
#SBATCH --ntasks-per-node=2
(...)
srun --mpi=pmi2 singularity run --bind=/scratch:/scratch \
    --bind=/var/spool/slurm:/var/spool/slurm \
    /home/user/folder1/work.simg /opt/julia/bin/julia -t 4 ~/folder1/folder2/example.jl

It is crucial to write the complete path for the singularity image and executed julia file.

If you need, i may try to write some minimal working example.

Topic		Replies	Views
How to parallel Julia on multiple nodes on HPC (slurm)? Julia at Scale question	11	3588	May 20, 2020
Distributed Computing with Slurm and Julia Julia at Scale	9	3553	February 10, 2022
How to run MPI jobs on a cluster Julia at Scale mpi , distributed	1	819	June 2, 2023
SLURM manager: one node with multiple tasks General Usage slurm	2	163	December 28, 2024
Code that works fine distributed across processes on one node using slurm seems to fail when trying to generate workers across many Julia at Scale question	2	1400	May 19, 2022

Does anyone run Singularity images across many Nodes?

Related topics