Addprocs with ssh does not work on 0.6.1


#1

Hi, there!

My code works well on 0.6.0.
However, after upgrading it to 0.6.1 addprocs function does not work when I create a process on remote machine.
I tried to add a kwarg tunnel=true to be addprocs([“remotehost”]; tunnel=true)
It works if I only add one remote machine.
But I try addprocs([“newremotehost”], tunnel=true), it does not work too…
Here not working means that the Julia only generates error and got stuck.

What have changed on 0.6.1 about addprocs…


#2

addprocs works for me on 0.6.1. More specifically, I tested it on a linux cluster with the following julia install:

Julia Version 0.6.1
Commit 0d7248e2ff (2017-10-24 22:15 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2687W v2 @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)

I ran addprocs(["host1","host2","host3","host1","host2","host3"]) where the hostnames are accessible via ssh and everything works as expected. In general I haven’t noticed any regressions running parallel jobs in 0.6.1. What error are you getting? And how exactly are you calling addprocs?


#3

xref: https://github.com/JuliaLang/julia/issues/24722


#4

My calling is

addprocs(["ssh_host1"], tunnel=true)
addprocs(["ssh_host2"], tunnel=true)

In this case, the second line generates an error message.

And I tried what you said I got an error also as follows:

try_yieldto(::Base.##296#297{Task}, ::Task) at ./event.jl:189
wait() at ./event.jl:234
wait(::Condition) at ./event.jl:27
stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N} where N) at ./stream.jl:42
wait_connected(::TCPSocket) at ./stream.jl:258
connect at ./stream.jl:983 [inlined]
connect_to_worker(::SubString{String}, ::UInt16) at ./distributed/managers.jl:493
connect(::Base.Distributed.SSHManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:431
create_worker(::Base.Distributed.SSHManager, ::WorkerConfig) at ./distributed/cluster.jl:443
setup_launched_worker(::Base.Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:389
(::Base.Distributed.##33#36{Base.Distributed.SSHManager,WorkerConfig,Array{Int64,1}})() at ./task.jl:335

...and 1 more exception(s).

Stacktrace:
 [1] sync_end() at ./task.jl:287
 [2] macro expansion at ./task.jl:303 [inlined]
 [3] #addprocs_locked#30(::Array{Any,1}, ::Function, ::Base.Distributed.SSHManager) at ./distributed/cluster.jl:344
 [4] (::Base.Distributed.#kw##addprocs_locked)(::Array{Any,1}, ::Base.Distributed.#addprocs_locked, ::Base.Distributed.SSHManager) at ./<missing>:0
 [5] #addprocs#29(::Array{Any,1}, ::Function, ::Base.Distributed.SSHManager) at ./distributed/cluster.jl:319
 [6] (::Base.Distributed.#kw##addprocs)(::Array{Any,1}, ::Base.Distributed.#addprocs, ::Base.Distributed.SSHManager) at ./<missing>:0
 [7] #addprocs#239(::Bool, ::Cmd, ::Int64, ::Array{Any,1}, ::Function, ::Array{String,1}) at ./distributed/managers.jl:114
 [8] addprocs(::Array{String,1}) at ./distributed/managers.jl:113