Addprocs with ssh does not work on 0.6.1


Hi, there!

My code works well on 0.6.0.
However, after upgrading it to 0.6.1 addprocs function does not work when I create a process on remote machine.
I tried to add a kwarg tunnel=true to be addprocs([“remotehost”]; tunnel=true)
It works if I only add one remote machine.
But I try addprocs([“newremotehost”], tunnel=true), it does not work too…
Here not working means that the Julia only generates error and got stuck.

What have changed on 0.6.1 about addprocs…


addprocs works for me on 0.6.1. More specifically, I tested it on a linux cluster with the following julia install:

Julia Version 0.6.1
Commit 0d7248e2ff (2017-10-24 22:15 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2687W v2 @ 3.40GHz
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)

I ran addprocs(["host1","host2","host3","host1","host2","host3"]) where the hostnames are accessible via ssh and everything works as expected. In general I haven’t noticed any regressions running parallel jobs in 0.6.1. What error are you getting? And how exactly are you calling addprocs?




My calling is

addprocs(["ssh_host1"], tunnel=true)
addprocs(["ssh_host2"], tunnel=true)

In this case, the second line generates an error message.

And I tried what you said I got an error also as follows:

try_yieldto(::Base.##296#297{Task}, ::Task) at ./event.jl:189
wait() at ./event.jl:234
wait(::Condition) at ./event.jl:27
stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N} where N) at ./stream.jl:42
wait_connected(::TCPSocket) at ./stream.jl:258
connect at ./stream.jl:983 [inlined]
connect_to_worker(::SubString{String}, ::UInt16) at ./distributed/managers.jl:493
connect(::Base.Distributed.SSHManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:431
create_worker(::Base.Distributed.SSHManager, ::WorkerConfig) at ./distributed/cluster.jl:443
setup_launched_worker(::Base.Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:389
(::Base.Distributed.##33#36{Base.Distributed.SSHManager,WorkerConfig,Array{Int64,1}})() at ./task.jl:335

...and 1 more exception(s).

 [1] sync_end() at ./task.jl:287
 [2] macro expansion at ./task.jl:303 [inlined]
 [3] #addprocs_locked#30(::Array{Any,1}, ::Function, ::Base.Distributed.SSHManager) at ./distributed/cluster.jl:344
 [4] (::Base.Distributed.#kw##addprocs_locked)(::Array{Any,1}, ::Base.Distributed.#addprocs_locked, ::Base.Distributed.SSHManager) at ./<missing>:0
 [5] #addprocs#29(::Array{Any,1}, ::Function, ::Base.Distributed.SSHManager) at ./distributed/cluster.jl:319
 [6] (::Base.Distributed.#kw##addprocs)(::Array{Any,1}, ::Base.Distributed.#addprocs, ::Base.Distributed.SSHManager) at ./<missing>:0
 [7] #addprocs#239(::Bool, ::Cmd, ::Int64, ::Array{Any,1}, ::Function, ::Array{String,1}) at ./distributed/managers.jl:114
 [8] addprocs(::Array{String,1}) at ./distributed/managers.jl:113