Does anyone run Singularity images across many Nodes?

As I point out in here, I wish to connect ClusterManagers.jl using Singularity images, but as @kescobo says, the problem seems to be Singularity or my Cluster configuration.

Does anyone already have experience running addprocs() + Singularity across nodes to give me a complete example ? Therefore I can test where is problem is/should be.

Thanks :heart:

Did you happen to get to the bottom of this? I’m now trying the same thing on our HPC and indeed it doesn’t seem to work across nodes :thinking:

Hi, i solved in the hard way: trial and error.

I used MPIClusterManagers, with the option MPI_TRANSPORT_ALL

using Distributed , MPIClusterManagers
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)

# your code here

and my batch script is something like this:

#SBATCH --nodes=3  # node count
#SBATCH --ntasks-per-node=2
srun --mpi=pmi2 singularity run --bind=/scratch:/scratch \
    --bind=/var/spool/slurm:/var/spool/slurm \
    /home/user/folder1/work.simg /opt/julia/bin/julia -t 4 ~/folder1/folder2/example.jl

It is crucial to write the complete path for the singularity image and executed julia file.

If you need, i may try to write some minimal working example.