Addprocs with SSH and a ProxyJump times out

In my SSH config file, I have

Host a
    User me
    HostName a.example.com

Host b
    User me
    HostName b.example.com
    ProxyJump a

I want to add b as a worker process from my local machine using addprocs.

I’m able to add a as a worker just fine with addprocs(["a"]), but addprocs(["b"]) hangs for a while then times out:

ERROR: IOError: connect: connection timed out (ETIMEDOUT)
try_yieldto(::typeof(Base.ensure_rescheduled), ::Base.RefValue{Task}) at ./event.jl:196
wait() at ./event.jl:255
wait(::Condition) at ./event.jl:46
stream_wait(::Sockets.TCPSocket, ::Condition) at ./stream.jl:47
wait_connected(::Sockets.TCPSocket) at ./stream.jl:330
connect at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Sockets/src/Sockets.jl:456 [inlined]
connect_to_worker(::SubString{String}, ::UInt16) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/managers.jl:499
connect(::Distributed.SSHManager, ::Int64, ::WorkerConfig) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/managers.jl:437
create_worker(::Distributed.SSHManager, ::WorkerConfig) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:501
setup_launched_worker(::Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:447
(::getfield(Distributed, Symbol("##47#50")){Distributed.SSHManager,WorkerConfig})() at ./task.jl:259
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:226
 [2] macro expansion at ./task.jl:245 [inlined]
 [3] #addprocs_locked#44(::Base.Iterators.Pairs{Symbol,Any,NTuple{5,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel, :dir, :exename),Tuple{Bool,Cmd,Int64,String,String}}}, ::Function, ::Distributed.SSHManager) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:401
 [4] #addprocs_locked at ./none:0 [inlined]
 [5] #addprocs#43(::Base.Iterators.Pairs{Symbol,Any,NTuple{5,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel, :dir, :exename),Tuple{Bool,Cmd,Int64,String,String}}}, ::Function, ::Distributed.SSHManager) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:365
 [6] #addprocs at ./none:0 [inlined]
 [7] #addprocs#249 at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Distributed/src/managers.jl:118 [inlined]
 [8] (::getfield(Distributed, Symbol("#kw##addprocs")))(::NamedTuple{(:dir, :exename),Tuple{String,String}}, ::typeof(addprocs), ::Array{String,1}) at ./none:0
 [9] top-level scope at none:0

Some things to note:

  • My RSA key is password protected. I’ve run ssh-add on my local machine though.
  • The path to the Julia executable differs between my local machine, a, and b. (I’m using the exename keyword argument in addprocs and setting it to the path on a.)
  • I’ve tried with and without tunnel=true.

Is there something fundamental I’m missing here? My knowledge of SSH in general, as well as Distributed and SSHManager, is lacking.

1 Like

I had the same problem, also with multiplexing ssh connections.

I think I’ve added .hushlogin to the home directory and seemed to fix the problem. Can you ssh b manually? can you ssh -T b true?

Also function ssh_tunnel(user, host, bind_addr, port, sshflags) in Distributed has hardcoded sleep value of 60 seconds - was preventing everything else from launching. reducing it to 5 seemed to help speed up addprocs and doesn’t seem to affect anything.

Yep, no problem at all.

I don’t know what that does, but the command executes successfully.

So you’re saying you were able to get around this problem by reducing the waiting time for connections?

Yes, it is working for me. I’ve added ssh multiplexing:

Host Worker*
   ControlPath ~/.ssh/controlmasters/%C
   ControlMaster auto
   ControlPersist 10m
   ProxyJump Firewall

And make a test ssh connection to each worker, so that the controlmaster connection already exists right before Julia starts. Then there’s no delay in establishing a connection in Julia with addprocs(workers, tunnel=true, topology=:master_worker, exename="worker.sh")

1 Like