Multinode scaling

Hello,

While I was experimenting with the Distributed library I ran into some issue. I am trying to utilize multiple nodes on a given cluster using Slurm but as of yet I cannot get it working.

My script.jl is quite simple

using Distributed

# Define what id() is
@everywhere id() = (myid(), gethostname())

# Run id() on all nodes
ids = [id(), [@fetchfrom i id() for i in workers()]...]

# Print
println.(ids)

#EOF

And I’m submitting my job with the given submit.sh script

#!/bin/bash
#SBATCH --job-name=julia-demo
#SBATCH --time=00:01:00
#SBATCH --nodes=2
#SBATCH --partition=testing
#SBATCH --output=log.out
#SBATCH --error=log.err

module load gcc/10.2.0 julia
export NODEFILE=`generate_pbs_nodefile`
srun julia --machine-file $NODEFILE ./script.jl 

This didn’t work as expected and so I thought that I probably need to run the script from my entry node and julia will spawn my workers on the allocated computing nodes of my cluster.

So I tried this approach instead $ salloc --nodes=2 --partition=testing julia --machine-file machinefile script.jl

machinefile is a file containing the hostnames of the allocated nodes

This approach yielded the following error

Host key verification failed.
Permission denied, please try again.
Permission denied, please try again.
Permission denied (gssapi-keyex,gssapi-with-mic,password).
ERROR: TaskFailedException

    nested task error: Unable to read host:port string from worker. Launch command exited with error?
    Stacktrace:
    ...

Please give me a simple example to get started with multinode scaling, without the use of ClusterManagers

Thank you for your time and help!

Try something like this:

jobscript (modify as necessary):

#!/bin/bash
#SBATCH --job-name=julia-demo
#SBATCH --time=00:01:00
#SBATCH --nodes=2
#SBATCH --output=log.out
#SBATCH --error=log.err

cd $SCRATCH/temp
julia=$SCRATCH/julia/julia-1.7.0-rc2/bin/julia
$julia script.jl

script.jl

using ClusterManagers
# the jobscript sets the environment variable SLURM_NNODES
# we read it in here to launch the appropriate number of jobs
# In general we need to know which variable is set in the jobscript
N = parse(Int, ENV["SLURM_NNODES"])
addprocs_slurm(N, N = N)
using Distributed
@show workers()

# Define what id() is
@everywhere id() = (myid(), gethostname())

# Run id() on all nodes
ids = [id(), [@fetchfrom i id() for i in workers()]...]

# Print
println.(ids)

rmprocs.(workers())

output

temp/ $ cat log.err
temp/ $ cat log.out
connecting to worker 1 out of 2
connecting to worker 2 out of 2
workers() = [2, 3]
(1, "compute-19-12.local")
(2, "compute-19-12.local")
(3, "compute-24-18.local")

Thank you for your response but like I mentioned in my original question, I don’t want to use cluster managers, or any other library for that matter. Since this is a learning experience for me I would like to do it as bare bones as possible.