Addprocs with SSH and a ProxyJump times out

In my SSH config file, I have

Host a
    User me
    HostName a.example.com

Host b
    User me
    HostName b.example.com
    ProxyJump a

I want to add b as a worker process from my local machine using addprocs.

I’m able to add a as a worker just fine with addprocs(["a"]), but addprocs(["b"]) hangs for a while then times out:

ERROR: IOError: connect: connection timed out (ETIMEDOUT)
try_yieldto(::typeof(Base.ensure_rescheduled), ::Base.RefValue{Task}) at ./event.jl:196
wait() at ./event.jl:255
wait(::Condition) at ./event.jl:46
stream_wait(::Sockets.TCPSocket, ::Condition) at ./stream.jl:47
wait_connected(::Sockets.TCPSocket) at ./stream.jl:330
connect at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Sockets/src/Sockets.jl:456 [inlined]
connect_to_worker(::SubString{String}, ::UInt16) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/managers.jl:499
connect(::Distributed.SSHManager, ::Int64, ::WorkerConfig) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/managers.jl:437
create_worker(::Distributed.SSHManager, ::WorkerConfig) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:501
setup_launched_worker(::Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:447
(::getfield(Distributed, Symbol("##47#50")){Distributed.SSHManager,WorkerConfig})() at ./task.jl:259
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:226
 [2] macro expansion at ./task.jl:245 [inlined]
 [3] #addprocs_locked#44(::Base.Iterators.Pairs{Symbol,Any,NTuple{5,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel, :dir, :exename),Tuple{Bool,Cmd,Int64,String,String}}}, ::Function, ::Distributed.SSHManager) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:401
 [4] #addprocs_locked at ./none:0 [inlined]
 [5] #addprocs#43(::Base.Iterators.Pairs{Symbol,Any,NTuple{5,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel, :dir, :exename),Tuple{Bool,Cmd,Int64,String,String}}}, ::Function, ::Distributed.SSHManager) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:365
 [6] #addprocs at ./none:0 [inlined]
 [7] #addprocs#249 at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/managers.jl:118 [inlined]
 [8] (::getfield(Distributed, Symbol("#kw##addprocs")))(::NamedTuple{(:dir, :exename),Tuple{String,String}}, ::typeof(addprocs), ::Array{String,1}) at ./none:0
 [9] top-level scope at none:0

Some things to note:

  • My RSA key is password protected. I’ve run ssh-add on my local machine though.
  • The path to the Julia executable differs between my local machine, a, and b. (I’m using the exename keyword argument in addprocs and setting it to the path on a.)
  • I’ve tried with and without tunnel=true.

Is there something fundamental I’m missing here? My knowledge of SSH in general, as well as Distributed and SSHManager, is lacking.

3 Likes

I had the same problem, also with multiplexing ssh connections.

I think I’ve added .hushlogin to the home directory and seemed to fix the problem. Can you ssh b manually? can you ssh -T b true?

Also function ssh_tunnel(user, host, bind_addr, port, sshflags) in Distributed has hardcoded sleep value of 60 seconds - was preventing everything else from launching. reducing it to 5 seemed to help speed up addprocs and doesn’t seem to affect anything.

Yep, no problem at all.

I don’t know what that does, but the command executes successfully.

So you’re saying you were able to get around this problem by reducing the waiting time for connections?

Yes, it is working for me. I’ve added ssh multiplexing:

Host Worker*
   ControlPath ~/.ssh/controlmasters/%C
   ControlMaster auto
   ControlPersist 10m
   ProxyJump Firewall

And make a test ssh connection to each worker, so that the controlmaster connection already exists right before Julia starts. Then there’s no delay in establishing a connection in Julia with addprocs(workers, tunnel=true, topology=:master_worker, exename="worker.sh")

1 Like

@ararslan Did you get this to work? Unfortunately the suggested approach (control master) doesn’t work for me. A worker process gets started on the node (see htop in the gif below) but on the master I get

julia> addprocs(["l96"]; params...)
ERROR: TaskFailedException

    nested task error: unable to create SSH tunnel after 100 tries. No free port?

julia_ssh_tunnel

Any ideas what I could do to fix this? I didn’t expect that it would be so hard to start a worker through a ssh tunnel that works just fine from the terminal. :frowning_face:

My parameters:

julia> using Distributed

julia> params = (exename=`nice -19 /home/bauer/bin/julia-1.6.1/bin/julia --project=/home/bauer/JuliaNRWSS21`, dir="/home/bauer", tunnel=true, topology=:master_worker)
(exename = `nice -19 /home/bauer/bin/julia-1.6.1/bin/julia --project=/home/bauer/JuliaNRWSS21`, dir = "/home/bauer", tunnel = true, topology = :master_worker)

julia> addprocs(["l96"]; params...)

You might need to use the “tunnel” option. I think normally the ash connection is used to make initial contact and start things up, but after that the worker expects to be able to connect directly to others. If you are using ProxyJump I think there is a tunneling option in addprocs that forces all traffic through the aah tunnel

Not sure what you mean. I’m using tunnel = true and topology = :master_worker (only the master is connected to the workers). Can you elaborate?

Ah, sorry, I was replying from my phone and hadn’t seen the other details. The tunnel and topology options were what was suggesting, but it sounds like you’ve tried those already,

1 Like