MPIClusterManagers task split across nodes

Hi all,

I’ve been using Julia on the cluster quite successfully with MPIClusterManagers.jl and MPI.jl with the MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL) option and using native julia constructs after that such as remotecall and friends. However, I just noticed something a little bit odd about how worker processes were getting distributed across cluster nodes where 1 node = 104 cores.

Specifically, if I ask for 208 cores through mpirun, i.e., across 2 nodes with 208 cpus, as expected, I get 207 workers and 1 manager process. However, the first 104 workers get put on the first node, and the next 103 get put on the second. This seems a little strange to me, as this means that if the manager in one sided communication is doing moderate work, then we are oversubscribing the first node (104+1 processes on 104 cpus) and undersubscribing the second (103 processes on 104 cpus). This is making my task parallelization a little wonky as I try not to cross nodes in parallel communication with minimal comms between nodes.

I’m on julia 1.8, MPI.jl v0.19.2, MPIClusterManagers v0.2.4

My question: Is there a way to get an even 104 processes on all nodes?

Here’s my MWE to figure out the node/task split

# mpitest.jl
## MPI Init
using MPIClusterManagers, Distributed
import MPI
rank = MPI.Comm_rank(MPI.COMM_WORLD)
sz = MPI.Comm_size(MPI.COMM_WORLD)
if rank == 0
    @info "size is $sz"
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)
@info "there are $(nworkers()) workers"
@everywhere using Distributed
@everywhere begin
    function getdetails()
        [myid() gethostname()]
using DelimitedFiles
r = reduce(vcat, pmap(x->getdetails(), 1:nworkers()))
idx = sortperm(r[:,1])
writedlm("test.txt", r[idx,:])

which I run from a login node on the cluster as

mpirun -np 208 julia mpitest.jl

The output is

size is 208
there are 207 workers

and test.txt looks like

2   blah-cpu-spr-0570.blah
3   blah-cpu-spr-0570.blah
4   blah-cpu-spr-0570.blah
5   blah-cpu-spr-0570.blah
6   blah-cpu-spr-0570.blah
7   blah-cpu-spr-0570.blah
104 blah-cpu-spr-0570.blah
105 blah-cpu-spr-0570.blah
106 blah-cpu-spr-0569.blah
107 blah-cpu-spr-0569.blah
207 blah-cpu-spr-0569.blah
208 blah-cpu-spr-0569.blah

What node is the main process (pid =1) on?

1 Like

Ah, I see … I modified the pmap above as follows:

r = reduce(vcat, pmap(x->getdetails(), WorkerPool(procs()), 1:nprocs()))

to go across all processes instead of workers …
lo and behold, I get

1   blah-cpu-spr-0693.blah
2   blah-cpu-spr-0697.blah
3   blah-cpu-spr-0697.blah
4   blah-cpu-spr-0697.blah
105 blah-cpu-spr-0697.blah
106 blah-cpu-spr-0693.blah
107 blah-cpu-spr-0693.blah
208 blah-cpu-spr-0693.blah

pids 2-105 are on node spr-697, which are 104 processes in number
pids 106-208 are on node spr-693, which are 103 processes in number
but, process 1, the manager is on a different node from process 2.
spr-693 has 103 processes + 1 manager which is 104 processes in number

Excellent, so it is 104 on each node!

The discontinuous numbering threw me off, as I had somehow expected pids 1-104 to be on the same node, not 2-105. This made me jump nodes in my work division. All cleared up now, maybe worth putting in the docs? Happy to do that if you point me to the right part of the docs.

Thanks for your help, cheers!