I am unable to run a simple distributed.jl code on my slurm cluster

I am trying to run a simple file that is pasted below:

using Distributed

addprocs(4)

println("Number of processes: ", nprocs())
println("Number of workers: ", nworkers())

@sync @distributed for i in 1:4
    sleep(1)
    id, pid, host = myid(), getpid(), gethostname()
    println(id, " " , pid, " ", host)
end

The slurm script I am using to submit the job is pasted below:

#!/bin/bash
#
#SBATCH --nodes=4
#SBATCH --partition=astro_devel
#SBATCH --ntasks-per-node=1
#SBATCH --time=0-01:00:00
#SBATCH --cpus-per-task=10
module load astro
module load intel
module load mpi/mpich-x86_64
srun /groups/astro/shashank/Julia/julia-1.10.0/bin/julia --project test.jl > test.txt

I would have expected the job to be distributed over 4 nodes but instead, 4 processes are created on each node. The output file is pasted below:

Number of processes: 5
Number of processes: 5
Number of workers: 4
Number of workers: 4
Number of processes: 5
Number of workers: 4
Number of processes: 5
Number of workers: 4
      From worker 5:    5 16450 node764.cluster
      From worker 3:    3 16448 node764.cluster
      From worker 4:    4 16449 node764.cluster
      From worker 2:    2 16446 node764.cluster
      From worker 3:    3 15963 node765.cluster
      From worker 4:    4 15964 node765.cluster
      From worker 5:    5 15965 node765.cluster
      From worker 2:    2 15961 node765.cluster
      From worker 4:    4 23528 node762.cluster
      From worker 5:    5 23529 node762.cluster
      From worker 3:    3 23527 node762.cluster
      From worker 2:    2 23525 node762.cluster
      From worker 3:    3 19785 node763.cluster
      From worker 5:    5 19787 node763.cluster
      From worker 4:    4 19786 node763.cluster
      From worker 2:    2 19783 node763.cluster

Can someone help me fix the problem? I would have expected the code to run each iteration to be run on a separate node on the same for loop on all nodes. What am I doing wrong here?

You’re telling Slurm to run that Julia script on 4 nodes. And that Julia script starts 4 workers. So you end up with 16 workers.

You need to use a ClusterManager in order to get Slurm and Distributed working together.

Take a look at my reply here as well as the entire thread.

1 Like

Thanks a lot both of you. I tried it as follows:

using Distributed
using ClusterManagers
addprocs(SlurmManager(4))

@sync @distributed for i in 1:4
    sleep(1)
    id, pid, host = myid(), getpid(), gethostname()
    println(id, " " , pid, " ", host)
end

I get the following error:

ERROR: LoadError: TaskFailedException

    nested task error: IOError: connect: host is unreachable (EHOSTUNREACH)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /lustre/hpc/astro/shashank/Julia/julia-1.10.0/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:1093
     [2] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /lustre/hpc/astro/shashank/Julia/julia-1.10.0/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:1090 [inlined]
     [3] remote_do
       @ /lustre/hpc/astro/shashank/Julia/julia-1.10.0/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:557 [inlined]
     [4] kill
       @ /lustre/hpc/astro/shashank/Julia/julia-1.10.0/share/julia/stdlib/v1.10/Distributed/src/managers.jl:726 [inlined]
     [5] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /lustre/hpc/astro/shashank/Julia/julia-1.10.0/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:604
     [6] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /lustre/hpc/astro/shashank/Julia/julia-1.10.0/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:545
     [7] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed /lustre/hpc/astro/shashank/Julia/julia-1.10.0/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:501

Is this a problem with the configuration of the cluster I am using?

That certainly sounds like a config problem. Try just addprocs(SlurmManager(4)) and also take a look at the arguments for SlurmManager() to make sure you are requesting nodes/cpu.

I’ve had a similar issue with SlurmManager but have not had enough time to explore/diagnose it for bug reporting.

However, I’ve found that manually creating adding worker processes on the remote nodes of the cluster has worked well:

using Distributed

# Extract all the hostname info from slurm job
node_range = ENV["SLURM_JOB_NODELIST"]
tasks_per_node = parse(Int64,split(ENV["SLURM_TASKS_PER_NODE"],'(')[1])
node_nums = parse.(Int64,filter.(isdigit, split(node_range,"-")))
nodes = [("kn$num",tasks_per_node) for num in range(node_nums...)]

addprocs(nodes)

You may need to check the format of the SLURM_JOB_NODELIST and SLURM_TASKS_PER_NODE variables on your cluster to make sure everything parses correctly. In my case, SLURM_JOB_NODELIST provides a name like kn[15-21] to show that I have been allocated nodes kn15 through kn21.

I should also mention that I don’t use srun but just submit the script with sbatch.

Thanks a lot but it seems there is a problem with my cluster. I get the following error:

Host key verification failed.^M
Host key verification failed.^M
Host key verification failed.^M
Host key verification failed.^M
ERROR: LoadError: TaskFailedException

    nested task error: Unable to read host:port string from worker. Launch command exited with error?
    Stacktrace:

It seems I will need to as the cluster admin about this.

Ah yeah, you will need to have ssh access from the login node to the remote nodes, however that needs to be setup on your cluster.

I am not sure it is possible on my cluster. Is there an alternative?

Try this:

Note that the script is launched from the head/dev node through sbatch:

sbatch -N 2 --ntasks-per-node=64 script.jl

And then it acquired the resources when launched by slurm:

using Distributed, SlurmClusterManager
addprocs(SlurmManager())

Of course this assumes that julia is in your $PATH so you can launch it.

So to be a bit more explicit: delete the srun in your sbatch script above. Replace addprocs(4) with addprocs(SlurmManager()).

Best of luck…

Also check if GitHub - jishnub/SlurmAddAllocatedProcs.jl: Julia package to easily add workers while using Slurm in batch mode helps. I had written this package when I was struggling with Slurm allocations.

Thanks a lot. The method you suggested worked with some additional changes. I had to add the following to ~/.ssh/config

Host node*
    StrictHostKeyChecking no
    PubkeyAuthentication yes
    ChallengeResponseAuthentication no
    IdentityFile /groups/astro/shashank/.ssh/id_rsa

And in addition, it works only if the tunnel is set to true on my cluster for some reason.

addprocs(nodes,tunnel=true)

And I also had to add the public key of the frontend to authorized_keys in ~/.ssh. This is not necessary when the connection is established by MPI.jl but is required by Distributed.jl for some reason.