MPI + Multithreading with ParallelStencil.jl + ImplicitGlobalGrid.jl

Hi,

I tried using ParallelStencil+ImplicitGlobalGrid with MPI on CPU clusters, but I cannot get the multiprocessing to work. I attempted the acoustic2D.jl example from the ParallelStencil repo and added ImplicitGlobalGrid to it. Additionally, I tried the diffusion3D_multicpu_novis.jl example from the ImplicitGlobalGrid repo and integrated ParallelStencil into it (both modified scripts are provided here).

In both cases, when I run the script using ‘julia --threads 16 $script,’ I can see that it utilizes multiprocessing. However, when I use ‘mpiexec -np 1 julia --threads 16 $script,’ it spawns the processes but doesn’t seem to utilize them (refer to the bottom htop picture for comparison). I used openmpi 4.1.1, and setting the following environment variables didn’t help either: ‘OMP_NUM_THREADS=16,’ ‘JULIA_NUM_THREADS=16.’

Given the scripts, what is the proper way to use ParallelStencil+ImplicitGlobalGrid to be able to use MPI and multithreading on each MPI? It’s worth mentioning that I was able to get ImplicitGlobalGrid to work on multiple GPUs using CUDA-aware MPI.

Thanks,

@luraess , @samo

@thoth291

Hi @ali-vaziri, thanks for reporting.

I cloned your repo on my machine and tried the 2D example you are referring to.

As you can see below, both version run fine and I could not reproduce the case where running on 8 threads but using MPI would lead to un-used spawned threads.

$ mpirun -n 1 --bind-to socket julia --project -t 8 acoustic2D_ImplicitGlobalGrid.
jl
Global grid: 4095x4095x1 (nprocs: 1, dims: 1x1x1)
rank=0 - nthreads = 8
rank=0 - time (s) = 5.351983070373535
$ julia --project -t 8 acoustic2D_ImplicitGlobalGrid.jl
Global grid: 4095x4095x1 (nprocs: 1, dims: 1x1x1)
rank=0 - nthreads = 8
rank=0 - time (s) = 5.9138031005859375

It could be that some issues arise when you try to share resources (CPU cores) among MPI and Julia threads though. To better assess what is going on, you could try reporting the number of threads used, and the wall time, timing only the work done in the iterations and not the initialisation.

What are the hardware specs of your device/server?

Thanks for bringing up the issue. However, at first glance, it does not look to me like a problem caused by the packages ParallelStencil and ImplicitGlobalGrid. It rather seems to me a more general problem, that also could be observed when just using MPI and Threads. Have you tried to reproduce it with MPI and Threads only?

Are you submitting your job with a job scheduler? If so, you might not be using the right job submission options. The following job script generator could be an inspiration even if you’re not using slurm:
https://user.cscs.ch/access/running/jobscript_generator/

@samo @luraess Thanks for the hints!

I’m using dedicated machines (with keyless ssh) without slurm, sge, etc. Each machine has 4 numa sockets, 24 cores per socket, and 1 thread per core (96 cores combined). I was able to utilize the entire machine using:

mpiexec -np 1 --map-by node:PE=96 julia -t 96 $script

But I’m getting seg fault for the cases below:

  • 2 mpi ranks on one machine (48 cores per mpi rank) using:

mpiexec -np 2 --map-by node:PE=48 julia -t 48 $script

  • 2 mpi ranks on two machines (one mpi rank per machine each using 96 cores) using:

mpiexec -np 2 --map-by node:PE=96 --prefix $mpi_dir --hostfile $h_file julia -t 96 $script

where host file is:
node_1_name slots=1
node_2_name slots=1

I’m probably missing something at the moment, and will update the correct configuration here if it works for me.