Hello,
While I was experimenting with the Distributed
library I ran into some issue. I am trying to utilize multiple nodes on a given cluster using Slurm but as of yet I cannot get it working.
My script.jl
is quite simple
using Distributed
# Define what id() is
@everywhere id() = (myid(), gethostname())
# Run id() on all nodes
ids = [id(), [@fetchfrom i id() for i in workers()]...]
# Print
println.(ids)
#EOF
And I’m submitting my job with the given submit.sh
script
#!/bin/bash
#SBATCH --job-name=julia-demo
#SBATCH --time=00:01:00
#SBATCH --nodes=2
#SBATCH --partition=testing
#SBATCH --output=log.out
#SBATCH --error=log.err
module load gcc/10.2.0 julia
export NODEFILE=`generate_pbs_nodefile`
srun julia --machine-file $NODEFILE ./script.jl
This didn’t work as expected and so I thought that I probably need to run the script from my entry node and julia will spawn my workers on the allocated computing nodes of my cluster.
So I tried this approach instead $ salloc --nodes=2 --partition=testing julia --machine-file machinefile script.jl
machinefile
is a file containing the hostnames of the allocated nodes
This approach yielded the following error
Host key verification failed.
Permission denied, please try again.
Permission denied, please try again.
Permission denied (gssapi-keyex,gssapi-with-mic,password).
ERROR: TaskFailedException
nested task error: Unable to read host:port string from worker. Launch command exited with error?
Stacktrace:
...
Please give me a simple example to get started with multinode scaling, without the use of ClusterManagers
Thank you for your time and help!