I want to run Julia MPI jobs on a cluster but not sure if I did it in the correct way.
There is a large cluster with hundreds of nodes where each node has 128 cores. The resource management system on the cluster is SLURM. I implement some scripts using MPI.jl and want to know the correct way to run the jobs among multiple nodes.
For example,I have a simple script test_collectives.jl which tests the MPI.Allreduce! collective on the cluster. To run the job, I wrote and submitted the following sbatch script run_test_collectives.sub, where OpenMPI_jll is used because I failed to configure the system-provided MPI library as the backend. I ran the job with 4 processes (1 node), 16 processes (1 node), 64 processes (1 node), 256 processes (2 nodes), 1024 processes (8 nodes), and 4096 processes (32 nodes), respectively.
#!/bin/tcsh #SBATCH -p standard #SBATCH -t 01:00:00 #SBATCH --mem=500000 #SBATCH -n 4096 #SBATCH -c 1 #SBATCH -N 32 #SBATCH -o results_test_collectives_p5m_np4096_openmpi.log module load julia julia-1.7 -e 'using MPIPreferences; MPIPreferences.use_jll_binary("OpenMPI_jll");' mpiexec -n 4096 julia-1.7 test_collectives.jl
However, the following plot shows that the execution time of MPI.Allreduce! doesn’t scale as the same as the theoretical complexity (O(log(p))). There is a big jump from 1.75e-02(s) at 256 processes to 1.91e-01(s) at 1024 processes. Am I on the correct way to submit and run the jobs? Any input?
#filename test_collectives.jl using MPI, LinearAlgebra, Printf for code in 1:1 repeats1 = 5 MPI.Init() ## construct a 2D process grid comm = MPI.COMM_WORLD rank = MPI.Comm_rank(comm) comm_size = MPI.Comm_size(comm) comm_size_sq = trunc(Int64, sqrt(comm_size)) comm_col = MPI.Comm_split(comm, trunc(Int64, rank/comm_size_sq), rank) rank_col = MPI.Comm_rank(comm_col) comm_row = MPI.Comm_split(comm, mod(rank, comm_size_sq), rank) rank_row = MPI.Comm_rank(comm_row) MPI.Barrier(comm) ## reduce a 16-by-16 matrix K in all row communicators and then column communicators k = 16 K = rand(k, k) cputime_allreduce = 0.0 for t in 1:repeats1 MPI.Barrier(comm) cputime_allreduce += @elapsed begin MPI.Allreduce!(K, +, comm_row) MPI.Allreduce!(K, +, comm_col) end MPI.Barrier(comm) end cputime_allreduce /= repeats1 if rank == 0 @printf("#processes: %i N: %i k: %i \n", comm_size, n_samples, k) @printf("walltime MPI.Allreduce!: %.2e \n", cputime_allreduce) end GC.gc() MPI.Finalize() end