While I was experimenting with the
Distributed library I ran into some issue. I am trying to utilize multiple nodes on a given cluster using Slurm but as of yet I cannot get it working.
script.jl is quite simple
using Distributed # Define what id() is @everywhere id() = (myid(), gethostname()) # Run id() on all nodes ids = [id(), [@fetchfrom i id() for i in workers()]...] # Print println.(ids) #EOF
And I’m submitting my job with the given
#!/bin/bash #SBATCH --job-name=julia-demo #SBATCH --time=00:01:00 #SBATCH --nodes=2 #SBATCH --partition=testing #SBATCH --output=log.out #SBATCH --error=log.err module load gcc/10.2.0 julia export NODEFILE=`generate_pbs_nodefile` srun julia --machine-file $NODEFILE ./script.jl
This didn’t work as expected and so I thought that I probably need to run the script from my entry node and julia will spawn my workers on the allocated computing nodes of my cluster.
So I tried this approach instead
$ salloc --nodes=2 --partition=testing julia --machine-file machinefile script.jl
machinefile is a file containing the hostnames of the allocated nodes
This approach yielded the following error
Host key verification failed. Permission denied, please try again. Permission denied, please try again. Permission denied (gssapi-keyex,gssapi-with-mic,password). ERROR: TaskFailedException nested task error: Unable to read host:port string from worker. Launch command exited with error? Stacktrace: ...
Please give me a simple example to get started with multinode scaling, without the use of
Thank you for your time and help!