I am quite new to this. So, with the shell script for slurm which requests 2 nodes, 1 task per node, and 16 cores per node (my cluster has 16 cores, with 2 threads each, in each node),
#!/bin/bash
#SBATCH -J m_node
#SBATCH -t 0-04:00:00
#SBATCH --nodes 2
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task=16
srun /home/userdir/julia-1.10.4/bin/julia /home/userdir/Work/julia_mnode.jl
I am looking for the correct way to initialise the processes in the julia script.
using Distributed
addprocs(32)
println("Number of processes: ", nprocs())
println("Number of workers: ", nworkers())
@everywhere function inner(a,ij)
sleep(5);
println("Inside inner")
return a*ij;
end
function outer(a,N)
tt0 = time()
g(x) = ij -> inner(x, ij);
arrsum = sum(pmap(g(a), (1:N)));
tt1 = time()
println("outer time = $(tt1-tt0)")
return arrsum
end
function innerserial(a,ij)
sleep(5);
println("Inside innerserial")
return a*ij;
end
function outerserial(a,N)
arrsum = 0;
tt0 = time()
for ij in 1:N
arrsum = arrsum + innerserial(a,ij);
end
tt1 = time()
println("outerserial time = $(tt1-tt0)")
return arrsum
end
println("outer = ",outer(1,5))
println("outerserial = ",outerserial(1,5))
(a) What I require:
If the loop were distributed over 2 nodes (1 node evaluating one iteration using all of its 16 cores), then the time should have been around 15s. First, the two nodes evaluate ìnner
once each, then the two nodes repeat it, and finally one node evaluates it once, totalling 5+5+5=15s. In this scenario, each evaluation of ìnner
uses all 16 cores in each node.
(b) What I am seeing:
Instead, it is getting distributed over all cores, hence it is finishing in 5s. This means, each evaluation of ìnner
is only getting 1 core. Also, everything is evaluated twice, almost as if both nodes are repeating the same thing.
Number of processes: 33
Number of workers: 32
Number of processes: 33
Number of workers: 32
From worker 11: Inside inner
From worker 4: Inside inner
From worker 23: Inside inner
From worker 13: Inside inner
From worker 27: Inside inner
From worker 15: Inside inner
From worker 24: Inside inner
outer time = 6.385792970657349
outer = 15
From worker 21: Inside inner
From worker 14: Inside inner
From worker 16: Inside inner
outer time = 6.389128923416138
outer = 15
Inside innerl
Inside innerl
Inside innerl
Inside innerl
Inside innerl
Inside innerl
Inside innerl
Inside innerl
Inside innerl
outerl time = 25.03065299987793
outerl = 15
Inside innerl
outerl time = 25.02852702140808
outerl = 15