Multinode scaling

Antonis_Skourtis · October 27, 2021, 7:30pm

Hello,

While I was experimenting with the Distributed library I ran into some issue. I am trying to utilize multiple nodes on a given cluster using Slurm but as of yet I cannot get it working.

My script.jl is quite simple

using Distributed

# Define what id() is
@everywhere id() = (myid(), gethostname())

# Run id() on all nodes
ids = [id(), [@fetchfrom i id() for i in workers()]...]

# Print
println.(ids)

#EOF

And I’m submitting my job with the given submit.sh script

#!/bin/bash
#SBATCH --job-name=julia-demo
#SBATCH --time=00:01:00
#SBATCH --nodes=2
#SBATCH --partition=testing
#SBATCH --output=log.out
#SBATCH --error=log.err

module load gcc/10.2.0 julia
export NODEFILE=`generate_pbs_nodefile`
srun julia --machine-file $NODEFILE ./script.jl

This didn’t work as expected and so I thought that I probably need to run the script from my entry node and julia will spawn my workers on the allocated computing nodes of my cluster.

So I tried this approach instead $ salloc --nodes=2 --partition=testing julia --machine-file machinefile script.jl

machinefile is a file containing the hostnames of the allocated nodes

This approach yielded the following error

Host key verification failed.
Permission denied, please try again.
Permission denied, please try again.
Permission denied (gssapi-keyex,gssapi-with-mic,password).
ERROR: TaskFailedException

    nested task error: Unable to read host:port string from worker. Launch command exited with error?
    Stacktrace:
    ...

Please give me a simple example to get started with multinode scaling, without the use of ClusterManagers

Thank you for your time and help!

jishnub · October 28, 2021, 7:03am

Try something like this:

jobscript (modify as necessary):

#!/bin/bash
#SBATCH --job-name=julia-demo
#SBATCH --time=00:01:00
#SBATCH --nodes=2
#SBATCH --output=log.out
#SBATCH --error=log.err

cd $SCRATCH/temp
julia=$SCRATCH/julia/julia-1.7.0-rc2/bin/julia
$julia script.jl

script.jl

using ClusterManagers
# the jobscript sets the environment variable SLURM_NNODES
# we read it in here to launch the appropriate number of jobs
# In general we need to know which variable is set in the jobscript
N = parse(Int, ENV["SLURM_NNODES"])
addprocs_slurm(N, N = N)
using Distributed
@show workers()

# Define what id() is
@everywhere id() = (myid(), gethostname())

# Run id() on all nodes
ids = [id(), [@fetchfrom i id() for i in workers()]...]

# Print
println.(ids)

rmprocs.(workers())

output

temp/ $ cat log.err
temp/ $ cat log.out
connecting to worker 1 out of 2
connecting to worker 2 out of 2
workers() = [2, 3]
(1, "compute-19-12.local")
(2, "compute-19-12.local")
(3, "compute-24-18.local")

Antonis_Skourtis · October 28, 2021, 8:25am

Thank you for your response but like I mentioned in my original question, I don’t want to use cluster managers, or any other library for that matter. Since this is a learning experience for me I would like to do it as bare bones as possible.

Topic		Replies	Views
Distributed Computing with Slurm and Julia Julia at Scale	9	3503	February 10, 2022
Using multi-node Distributed.jl in a slurm cluster Julia at Scale	1	179	January 29, 2025
I am unable to run a simple distributed.jl code on my slurm cluster Julia at Scale parallel , distributed , slurm	11	639	February 10, 2024
How to parallel Julia on multiple nodes on HPC (slurm)? Julia at Scale question	11	3563	May 20, 2020
Using Julia on Cluster New to Julia question	2	621	March 6, 2023

Multinode scaling

Related topics