Julia on Cluster with SSH Restriction

I’m attempting to use Julia on a cluster (using PBS) that blocks ssh connections to anything except the head node. Based on the documentation, it seems that Julia requires passwordless ssh to start workers on cluster nodes:

The base Julia installation has in-built support for two types of clusters:

  • A local cluster specified with the -p option as shown above.
  • A cluster spanning machines using the --machinefile option. This uses a passwordless ssh login to start Julia worker processes (from the same path as the current host) on the specified machines.

I’ve tried the solutions presented on this thread, but the following errors occur: (1) ClusterManagers hangs when calling addprocs_pbs() or (2) I get a permissions error when the ssh connection is attempted.

Permission denied, please try again.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
ERROR: Unable to read host:port string from worker. Launch command exited with error?
read_worker_host_port(::Pipe) at ./distributed/cluster.jl:236
connect(::Base.Distributed.SSHManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:391
create_worker(::Base.Distributed.SSHManager, ::WorkerConfig) at ./distributed/cluster.jl:443
setup_launched_worker(::Base.Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:389
(::Base.Distributed.##33#36{Base.Distributed.SSHManager,WorkerConfig,Array{Int64,1}})() at ./task.jl:335

My SysAdmin seems unwilling to allow ssh connections to worker nodes. Is there another option for using Julia on the cluster that bypasses this problem?

1 Like

@Brosetti I have installed and managed PBS clusters. I think what is happening is that the PAM module for PBS is installed. Yes, this stops a user from sshing into the compute nodes. BUT when you are running a job you should be able to ssh into the compute nodes which are allocated to you.
You will find that there is an environment variable called PBS_NODELIST which gives you the list of compute nodes you can use.

Please give me five minutes and I will confirm the PAM module behaviour.

It could be that you have some other type of restriction though?

There is a very old style pbs_dsh utility whihc might be put into a wrapper script and substituted for ssh. But lets not go there.

2 Likes

What I said is true for Torque
http://docs.adaptivecomputing.com/torque/3-0-5/3.4hostsecurity.php

My Google-fu is exhausted in looking for the PBSPro style.
May I suggest that you start an interctive job using qsub -i Once you have started a shel lon the first compute node see if you can ssh into the others?
I guess not as you have already shown us this output…

Another option you may have is the MPI ClusterManager, see e.g. here:
https://github.com/JuliaParallel/MPI.jl/blob/master/test/test_cman_mpi.jl

This is the currently released version, there is also my version that uses one-sided MPI calls here, but it’s not merged to MPI.jl yet:

Either of these options should give you native Julia parallel calls using MPI as the communication layer, bypassing the need for ssh entirely. I have only tested this with some basic DistributedArrays stuff, so I’m not sure how well it holds up.

2 Likes

I would also like to ask. How are parallel jobs run on your cluster? I would guess by using an MPI with ‘munge’ authentication.
https://dun.github.io/munge/
Which leads me to open the discussin with the Julia developers. Maybe a plugin style is needed for Julia parallel, and do not assume that only ssh will be used.

Its worth saying that the original PBS was invented well before ssh was around and it used the r protocols - rsh rcp rlogin They have the wonderfulness of the original restriction to six character user names.
In addition to relying on dotfiles for security!

Hi @barche But surely MPI needs to be launched somehow… and there are a variety of ways in which that is done. One of them being ssh!
I may well be barking up the wrong branch here (Git pun intended).

Yep, with any luck by simply doing mpirun julia mysimulation.jl

I tested this with slurm, there mpirun gets the nodes to run on and so forth directly from the task manager.

@barche Slurm should be using munge for the mpirun startup phase
And yes slurm is pretty cute - it has good integration.
I will confess that the first time I met a slurm cluster I did the normal job submisssion script / find the list of nodes / create a custom hostfile. Till someone pointed out all you have to do is the srun… doooh

Thanks for the quick replies fellas. Let’s see if I can respond to your comments/questions:

Yes. I currently use this file in my pbs script. Here is what produced the errors that I reported above:

myjob.pbs

#!/bin/bash
#PBS -l nodes=12:ppn=4,walltime=00:05:00
#PBS -N test
#PBS -q batch

julia --machinefile $PBS_NODEFILE test.jl

test.jl

println("Hello from Julia")
addprocs(48)
np = nprocs()
println("Number of processes: $np")


@sync @parallel for i=1:48
  host = gethostname()
  pid = getpid()
  sleep(2)
  println("I am $host - $pid doing loop $i")
end

PAM is an interesting feature, but I don’t think this is what my SysAdmin is using. I believe he has globally blocked all users but root from ssh-ing to anything but the head node at all times. I’ll suggest that he look into PAM as an alternative.


The process hangs when I try interactive mode using qsub -I. It never gets past qsub: waiting for job to start. I suspect this has something to do with the global ssh block.


Thanks for this suggestion! I’m getting some build errors related to MPI_C. Once I sort them out, I’ll reply back with my results


This is my first time using the cluster, so I’ll have to look into the details.

1 Like

I suggest the following. HPC Admins portray themselves as Ogres. Heavy metal T-shirt? Combat boots? Black jeans?
Bring a packet of cookies. Even better some local craft beer.

Actually, ssh restriction like this are put there to stop users doing stupid things. The admin will want you to use his/her system (*). Some explanation of how Julia works and how great it is may wake the Ogre.

(*) Women can wear combat boots and T-shirts. I refer you to Lady Fiona in Shrek.

1 Like

OK, feel free to post the error here or in an MPI.jl issue if you get stuck on it. Looking at your script, it should be sufficient to replace the current julia command with mpirun julia test.jl (assuming mpirun supports PBS, which it should), and enclose your test script in:

using MPI

mgr = MPI.start_main_loop(MPI.MPI_TRANSPORT_ALL)

# code

MPI.stop_main_loop(mgr)
1 Like

Try doing:

export CC=mpicc
export FC=mpif90
export CXX=mpicxx

in the shell before building MPI.jl. How well CMake can find MPI installation seems to vary widely with CMake version, and setting these variables helps it.

I know I’m a little late to this thread, but had the same problem. I hope this helps others out. My cluster uses PBS, but no ssh communication is allowed between nodes. My Julia code uses DistributedArrays, @sync and @async blocks, and I didn’t want to modify it too much. So I did the following as suggested by @barche.

using MyModules, MPI

# serial part of code
# here to set options for parallel run

# parallel start
mgr = MPI.start_main_loop(MPI.MPI_TRANSPORT_ALL)

addprocs(nworkers)
@info "workers are $(workers())"
@everywhere any(pwd() .== LOAD_PATH) || push!(LOAD_PATH, pwd())
@everywhere using Distributed, MyModules

# parallel code here using MyModules.foo(options, data)

rmprocs(workers())
MPI.stop_main_loop(mgr)

Crucially, I had to run the code as
mpirun -np 1 julia MyCode.jl

Here’s the whole PBS script requesting 32 workers:

#PBS -P blah
#PBS -q myque
#PBS -l ncpus=32
#PBS -l mem=256GB
#PBS -l walltime=00:15:00
#PBS -l wd
#PBS -N testJulia
#PBS -o grid.out
#PBS -e grid.err
#PBS -j oe

ulimit -s unlimited
ulimit -c unlimited
module load gcc/5.2.0 openmpi/3.0.1 julia/1.1.1
mpirun  -np 1 julia ./MyCode.jl > outfile.run

5 Likes

Kindly assist. I am new to parallel computing on Julia.
I have a similar problem as stated here.
I am able to ssh into the nodes assigned to me from outside the job.
However, in the submitted job I get a timed out error with addprocs().

zsh: command not found: YLY3y21Al0T71U3B
zsh: command not found: YLY3y21Al0T71U3B
ERROR: LoadError: TaskFailedException:
Unable to read host:port string from worker. Launch command exited with error?
Stacktrace:
 [1] worker_from_id(::Distributed.ProcessGroup, ::Int64) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1074
 [2] worker_from_id at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1071 [inlined]
 [3] #remote_do#154 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:486 [inlined]
 [4] remote_do at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:486 [inlined]
 [5] kill(::Distributed.SSHManager, ::Int64, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/managers.jl:603
 [6] create_worker(::Distributed.SSHManager, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:585
 [7] setup_launched_worker(::Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:526
 [8] (::Distributed.var"#41#44"{Distributed.SSHManager,Array{Int64,1},WorkerConfig})() at ./task.jl:356

...and 1 more exception(s).

Stacktrace:
 [1] sync_end(::Channel{Any}) at ./task.jl:314
 [2] macro expansion at ./task.jl:333 [inlined]
 [3] addprocs_locked(::Distributed.SSHManager; kwargs::Base.Iterators.Pairs{Symbol,Any,NTuple{5,Symbol},NamedTuple{(:tunnel, :multiplex, :sshflags, :max_parallel, :topology),Tuple{Bool,Bool,Cmd,Int64,Symbol}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:480
 [4] addprocs(::Distributed.SSHManager; kwargs::Base.Iterators.Pairs{Symbol,Any,NTuple{5,Symbol},NamedTuple{(:tunnel, :multiplex, :sshflags, :max_parallel, :topology),Tuple{Bool,Bool,Cmd,Int64,Symbol}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:444
 [5] #addprocs#241 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/managers.jl:120 [inlined]
 [6] top-level scope at /home/beno/Desktop/juliaPD/startup.jl:2
 [7] include(::Function, ::Module, ::String) at ./Base.jl:380
 [8] include at ./Base.jl:368 [inlined]
 [9] exec_options(::Base.JLOptions) at ./client.jl:279
 [10] _start() at ./client.jl:506

I also tried the suggestion from @sparrowhawk and @barche. However, the MPI.start_main_loop(MPI_TRANSPORT_ALL) and MPI.stop_main_loop(mgr) commands seem to be deprecated.
Thank you. It is a PBSPro setup.

Depending on your cluster setup, one of the solutions with a minimum working example here should work. Please note that MPI_TRANSPORT_ALL will give you one less than the number of mpi nodes requested, i.e., mpirun -np 5 julia myfile.jl will give you 4 workers.

Thank you for the prompt response.
The PBS script is as follows.

#PBS -q somequeue
#PBS -l nodes=2:ppn=20
#PBS -j oe
cat $PBS_NODEFILE > pbs_nodes

mpirun -n 40 /path/to/julia test.jl > out </dev/null

It is not behaving as @sparrowhawk suggested. The error shows that only one of the two nodes, and its corresponding processes are accessed for parallel execution.
Further, each action in the script is performed 20 times.
And providing -n 40 complains of oversubscribing.

Sounds to me like mpirun is not getting the info from the PBS scheduler. What happens if you just run mpirun hostname? If mpirun properly reads the PBS environment, that should print the hostnames of the compute nodes the correct number of times, and then Julia should also work with the MPI clustermanager.

@barche, you are right. It is only receiving one of the hostfiles 20 times.

I used the -machinefile option. Now it seems to be reading all the relevant processes.

Thanks a ton. :slight_smile:

I have one more doubt, do I not have to use addprocs() for adding the workers anymore? Can I directly use the commands from the Distributed package?

1 Like

That’s correct I think, at least I didn’t need addprocs when I tried this (in 2018).