MPIClusterManagers task split across nodes

sparrowhawk · April 27, 2023, 4:46am

Hi all,

I’ve been using Julia on the cluster quite successfully with MPIClusterManagers.jl and MPI.jl with the MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL) option and using native julia constructs after that such as remotecall and friends. However, I just noticed something a little bit odd about how worker processes were getting distributed across cluster nodes where 1 node = 104 cores.

Specifically, if I ask for 208 cores through mpirun, i.e., across 2 nodes with 208 cpus, as expected, I get 207 workers and 1 manager process. However, the first 104 workers get put on the first node, and the next 103 get put on the second. This seems a little strange to me, as this means that if the manager in one sided communication is doing moderate work, then we are oversubscribing the first node (104+1 processes on 104 cpus) and undersubscribing the second (103 processes on 104 cpus). This is making my task parallelization a little wonky as I try not to cross nodes in parallel communication with minimal comms between nodes.

I’m on julia 1.8, MPI.jl v0.19.2, MPIClusterManagers v0.2.4

My question: Is there a way to get an even 104 processes on all nodes?

Here’s my MWE to figure out the node/task split

# mpitest.jl
## MPI Init
using MPIClusterManagers, Distributed
import MPI
MPI.Init()
rank = MPI.Comm_rank(MPI.COMM_WORLD)
sz = MPI.Comm_size(MPI.COMM_WORLD)
if rank == 0
    @info "size is $sz"
end
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)
@info "there are $(nworkers()) workers"
@everywhere using Distributed
@everywhere begin
    function getdetails()
        [myid() gethostname()]
    end
end
using DelimitedFiles
r = reduce(vcat, pmap(x->getdetails(), 1:nworkers()))
idx = sortperm(r[:,1])
writedlm("test.txt", r[idx,:])
MPIClusterManagers.stop_main_loop(manager)
rmprocs(workers())
exit()

which I run from a login node on the cluster as

mpirun -np 208 julia mpitest.jl

The output is

size is 208
there are 207 workers

and test.txt looks like

2   blah-cpu-spr-0570.blah
3   blah-cpu-spr-0570.blah
4   blah-cpu-spr-0570.blah
5   blah-cpu-spr-0570.blah
6   blah-cpu-spr-0570.blah
7   blah-cpu-spr-0570.blah
.
104 blah-cpu-spr-0570.blah
105 blah-cpu-spr-0570.blah
106 blah-cpu-spr-0569.blah
107 blah-cpu-spr-0569.blah
.
207 blah-cpu-spr-0569.blah
208 blah-cpu-spr-0569.blah

simonbyrne · April 27, 2023, 5:56pm

What node is the main process (pid =1) on?

sparrowhawk · April 28, 2023, 2:02am

Ah, I see … I modified the pmap above as follows:

r = reduce(vcat, pmap(x->getdetails(), WorkerPool(procs()), 1:nprocs()))

to go across all processes instead of workers …
lo and behold, I get

1   blah-cpu-spr-0693.blah
2   blah-cpu-spr-0697.blah
3   blah-cpu-spr-0697.blah
4   blah-cpu-spr-0697.blah
.
105 blah-cpu-spr-0697.blah
106 blah-cpu-spr-0693.blah
107 blah-cpu-spr-0693.blah
.
208 blah-cpu-spr-0693.blah

pids 2-105 are on node spr-697, which are 104 processes in number
pids 106-208 are on node spr-693, which are 103 processes in number
but, process 1, the manager is on a different node from process 2.
spr-693 has 103 processes + 1 manager which is 104 processes in number

Excellent, so it is 104 on each node!

The discontinuous numbering threw me off, as I had somehow expected pids 1-104 to be on the same node, not 2-105. This made me jump nodes in my work division. All cleared up now, maybe worth putting in the docs? Happy to do that if you point me to the right part of the docs.

Thanks for your help, cheers!

Topic		Replies	Views
Julia on cluster, only MPI transport allowed General Usage parallel , distributed	7	1200	October 10, 2020
Distributing parallel tasks/workers over multiple nodes using SLURM General Usage hpc , distributed , pmap	10	2125	October 25, 2022
How to have each node in a cluster run a parallel job on its own CPUs, using SSH? Julia at Scale	4	725	October 14, 2024
MPI.jl tasks in multiple nodes General Usage parallel	1	62	April 27, 2025
MPIClusterManagers add more nodes interactive session New to Julia	0	172	October 27, 2023

MPIClusterManagers task split across nodes

Related topics