How to run MPI jobs on a cluster

I want to run Julia MPI jobs on a cluster but not sure if I did it in the correct way.

There is a large cluster with hundreds of nodes where each node has 128 cores. The resource management system on the cluster is SLURM. I implement some scripts using MPI.jl and want to know the correct way to run the jobs among multiple nodes.

For example,I have a simple script test_collectives.jl which tests the MPI.Allreduce! collective on the cluster. To run the job, I wrote and submitted the following sbatch script run_test_collectives.sub, where OpenMPI_jll is used because I failed to configure the system-provided MPI library as the backend. I ran the job with 4 processes (1 node), 16 processes (1 node), 64 processes (1 node), 256 processes (2 nodes), 1024 processes (8 nodes), and 4096 processes (32 nodes), respectively.

#!/bin/tcsh
#SBATCH -p standard
#SBATCH -t 01:00:00
#SBATCH --mem=500000
#SBATCH -n 4096
#SBATCH -c 1
#SBATCH -N 32
#SBATCH -o results_test_collectives_p5m_np4096_openmpi.log

module load julia

julia-1.7 -e 'using MPIPreferences; MPIPreferences.use_jll_binary("OpenMPI_jll");'

mpiexec -n 4096 julia-1.7 test_collectives.jl

However, the following plot shows that the execution time of MPI.Allreduce! doesn’t scale as the same as the theoretical complexity (O(log(p))). There is a big jump from 1.75e-02(s) at 256 processes to 1.91e-01(s) at 1024 processes. Am I on the correct way to submit and run the jobs? Any input?

#filename test_collectives.jl
using MPI, LinearAlgebra, Printf

for code in 1:1

    repeats1 = 5


    MPI.Init()
    
    ## construct a 2D process grid
    comm = MPI.COMM_WORLD
    rank = MPI.Comm_rank(comm)
    comm_size = MPI.Comm_size(comm)
    comm_size_sq = trunc(Int64, sqrt(comm_size))

    comm_col = MPI.Comm_split(comm, trunc(Int64, rank/comm_size_sq), rank)
    rank_col = MPI.Comm_rank(comm_col)
    comm_row = MPI.Comm_split(comm, mod(rank, comm_size_sq), rank)
    rank_row = MPI.Comm_rank(comm_row)

    MPI.Barrier(comm)
    
    ## reduce a 16-by-16 matrix K in all row communicators and then column communicators
    k = 16

    K = rand(k, k)
    cputime_allreduce = 0.0
    for t in 1:repeats1
        MPI.Barrier(comm)
        cputime_allreduce += @elapsed begin
            MPI.Allreduce!(K, +, comm_row)
            MPI.Allreduce!(K, +, comm_col)
        end
        MPI.Barrier(comm)
    end
    cputime_allreduce /= repeats1
    if rank == 0
        @printf("#processes: %i N: %i k: %i \n", comm_size, n_samples, k)
        @printf("walltime MPI.Allreduce!: %.2e \n", cputime_allreduce)
    end


    GC.gc()
    MPI.Finalize()
end

Are you able to submit a working example? Running the current code gives

ERROR: LoadError: UndefVarError: n_samples not defined