I want to run Julia MPI jobs on a cluster but not sure if I did it in the correct way.

There is a large cluster with hundreds of nodes where each node has 128 cores. The resource management system on the cluster is SLURM. I implement some scripts using MPI.jl and want to know the correct way to run the jobs among multiple nodes.

For example,I have a simple script *test_collectives.jl* which tests the MPI.Allreduce! collective on the cluster. To run the job, I wrote and submitted the following sbatch script *run_test_collectives.sub*, where *OpenMPI_jll* is used because I failed to configure the system-provided MPI library as the backend. I ran the job with 4 processes (1 node), 16 processes (1 node), 64 processes (1 node), 256 processes (2 nodes), 1024 processes (8 nodes), and 4096 processes (32 nodes), respectively.

```
#!/bin/tcsh
#SBATCH -p standard
#SBATCH -t 01:00:00
#SBATCH --mem=500000
#SBATCH -n 4096
#SBATCH -c 1
#SBATCH -N 32
#SBATCH -o results_test_collectives_p5m_np4096_openmpi.log
module load julia
julia-1.7 -e 'using MPIPreferences; MPIPreferences.use_jll_binary("OpenMPI_jll");'
mpiexec -n 4096 julia-1.7 test_collectives.jl
```

**However, the following plot shows that the execution time of MPI.Allreduce! doesn’t scale as the same as the theoretical complexity (O(log(p))). There is a big jump from 1.75e-02(s) at 256 processes to 1.91e-01(s) at 1024 processes. Am I on the correct way to submit and run the jobs? Any input?**

```
#filename test_collectives.jl
using MPI, LinearAlgebra, Printf
for code in 1:1
repeats1 = 5
MPI.Init()
## construct a 2D process grid
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
comm_size = MPI.Comm_size(comm)
comm_size_sq = trunc(Int64, sqrt(comm_size))
comm_col = MPI.Comm_split(comm, trunc(Int64, rank/comm_size_sq), rank)
rank_col = MPI.Comm_rank(comm_col)
comm_row = MPI.Comm_split(comm, mod(rank, comm_size_sq), rank)
rank_row = MPI.Comm_rank(comm_row)
MPI.Barrier(comm)
## reduce a 16-by-16 matrix K in all row communicators and then column communicators
k = 16
K = rand(k, k)
cputime_allreduce = 0.0
for t in 1:repeats1
MPI.Barrier(comm)
cputime_allreduce += @elapsed begin
MPI.Allreduce!(K, +, comm_row)
MPI.Allreduce!(K, +, comm_col)
end
MPI.Barrier(comm)
end
cputime_allreduce /= repeats1
if rank == 0
@printf("#processes: %i N: %i k: %i \n", comm_size, n_samples, k)
@printf("walltime MPI.Allreduce!: %.2e \n", cputime_allreduce)
end
GC.gc()
MPI.Finalize()
end
```