I would have expected the job to be distributed over 4 nodes but instead, 4 processes are created on each node. The output file is pasted below:
Number of processes: 5
Number of processes: 5
Number of workers: 4
Number of workers: 4
Number of processes: 5
Number of workers: 4
Number of processes: 5
Number of workers: 4
From worker 5: 5 16450 node764.cluster
From worker 3: 3 16448 node764.cluster
From worker 4: 4 16449 node764.cluster
From worker 2: 2 16446 node764.cluster
From worker 3: 3 15963 node765.cluster
From worker 4: 4 15964 node765.cluster
From worker 5: 5 15965 node765.cluster
From worker 2: 2 15961 node765.cluster
From worker 4: 4 23528 node762.cluster
From worker 5: 5 23529 node762.cluster
From worker 3: 3 23527 node762.cluster
From worker 2: 2 23525 node762.cluster
From worker 3: 3 19785 node763.cluster
From worker 5: 5 19787 node763.cluster
From worker 4: 4 19786 node763.cluster
From worker 2: 2 19783 node763.cluster
Can someone help me fix the problem? I would have expected the code to run each iteration to be run on a separate node on the same for loop on all nodes. What am I doing wrong here?
using Distributed
using ClusterManagers
addprocs(SlurmManager(4))
@sync @distributed for i in 1:4
sleep(1)
id, pid, host = myid(), getpid(), gethostname()
println(id, " " , pid, " ", host)
end
That certainly sounds like a config problem. Try just addprocs(SlurmManager(4)) and also take a look at the arguments for SlurmManager() to make sure you are requesting nodes/cpu.
I’ve had a similar issue with SlurmManager but have not had enough time to explore/diagnose it for bug reporting.
However, I’ve found that manually creating adding worker processes on the remote nodes of the cluster has worked well:
using Distributed
# Extract all the hostname info from slurm job
node_range = ENV["SLURM_JOB_NODELIST"]
tasks_per_node = parse(Int64,split(ENV["SLURM_TASKS_PER_NODE"],'(')[1])
node_nums = parse.(Int64,filter.(isdigit, split(node_range,"-")))
nodes = [("kn$num",tasks_per_node) for num in range(node_nums...)]
addprocs(nodes)
You may need to check the format of the SLURM_JOB_NODELIST and SLURM_TASKS_PER_NODE variables on your cluster to make sure everything parses correctly. In my case, SLURM_JOB_NODELIST provides a name like kn[15-21] to show that I have been allocated nodes kn15 through kn21.
I should also mention that I don’t use srun but just submit the script with sbatch.
Thanks a lot. The method you suggested worked with some additional changes. I had to add the following to ~/.ssh/config
Host node*
StrictHostKeyChecking no
PubkeyAuthentication yes
ChallengeResponseAuthentication no
IdentityFile /groups/astro/shashank/.ssh/id_rsa
And in addition, it works only if the tunnel is set to true on my cluster for some reason.
addprocs(nodes,tunnel=true)
And I also had to add the public key of the frontend to authorized_keys in ~/.ssh. This is not necessary when the connection is established by MPI.jl but is required by Distributed.jl for some reason.