As I point out in here, I wish to connect ClusterManagers.jl using Singularity images, but as @kescobo says, the problem seems to be Singularity or my Cluster configuration.
Does anyone already have experience running addprocs() + Singularity across nodes to give me a complete example ? Therefore I can test where is problem is/should be.
Thanks
Did you happen to get to the bottom of this? I’m now trying the same thing on our HPC and indeed it doesn’t seem to work across nodes
Hi, i solved in the hard way: trial and error.
I used MPIClusterManagers
, with the option MPI_TRANSPORT_ALL
using Distributed , MPIClusterManagers
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)
# your code here
and my batch script is something like this:
#SBATCH --nodes=3 # node count
#SBATCH --ntasks-per-node=2
(...)
srun --mpi=pmi2 singularity run --bind=/scratch:/scratch \
--bind=/var/spool/slurm:/var/spool/slurm \
/home/user/folder1/work.simg /opt/julia/bin/julia -t 4 ~/folder1/folder2/example.jl
It is crucial to write the complete path for the singularity image and executed julia file.
If you need, i may try to write some minimal working example.