Addprocs() on remote machines failing

I would like to spawn some jobs on a cluster of linux machines. The machines are configured a little differently to my own so I need to specify where the julia executable is and the username when ssh’ing into them.

   p=map(x->(@sprintf("cs4115@cs305l-%02d.csis.ul.ie",x),:auto), [08,11])

When I make make the addprocs() call

    addprocs(p, topology=:master_worker, exename="/bin/julia", dir="/home/ug2018/cs4115")

I get the error

ERROR: IOError: connect: host is unreachable (EHOSTUNREACH)

I’m running v1.1.1 and the remote machines are running v1.0.4.

Thanks for any suggestions.

I’m guessing this is the problem. I think you need workers to run the same Julia version.

Yes I’ve gotten this error before a few times. Indeed, you should run the same Julia version on all worker nodes/machines. Also, you should have passwordless ssh setup from the head node to the worker nodes.

OK, thanks a lot. That’s easy to fix.

But is there some deep reason why the versions must align? Just curious.

Unfortunately, this doesn’t solve it for me:

julia> using Printf, Distributed

julia> versioninfo()
Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge)
Environment:
  JULIAPATH = .:/home/healyp/src/julia-1.3.0/bin/
  JULIA_PKGDIR = /home/healyp/.julia/

julia> run(`ssh cs4115@cs305l-05.csis.ul.ie julia --version`)
julia version 1.3.0
Process(`ssh cs4115@cs305l-05.csis.ul.ie julia --version`, ProcessExited(0))

julia> params= (exename="/home/ug2018/cs4115/src/julia-1.3.0/bin/julia", dir="/home/ug2018/cs4115/pH")
(exename = "/home/ug2018/cs4115/src/julia-1.3.0/bin/julia", dir = "/home/ug2018/cs4115/pH")

julia> addprocs([("cs4115@cs305l-05.csis.ul.ie", 5)]; params...)
ERROR: TaskFailedException:
IOError: connect: host is unreachable (EHOSTUNREACH)
Stacktrace:
 [1] worker_from_id(::Distributed.ProcessGroup, ::Int64) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:1059
 [2] worker_from_id at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:1056 [inlined]
 [3] #remote_do#156 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:482 [inlined]
 [4] remote_do at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:482 [inlined]
 [5] kill at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/managers.jl:534 [inlined]
 [6] create_worker(::Distributed.SSHManager, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:581
 [7] setup_launched_worker(::Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:523
 [8] (::Distributed.var"#43#46"{Distributed.SSHManager,Array{Int64,1},WorkerConfig})() at ./task.jl:333
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:300
 [2] macro expansion at ./task.jl:319 [inlined]
 [3] #addprocs_locked#40(::Base.Iterators.Pairs{Symbol,Any,NTuple{5,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel, :exename, :dir),Tuple{Bool,Cmd,Int64,String,String}}}, ::typeof(Distributed.addprocs_locked), ::Distributed.SSHManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:477
 [4] #addprocs_locked at ./none:0 [inlined]
 [5] #addprocs#39(::Base.Iterators.Pairs{Symbol,Any,NTuple{5,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel, :exename, :dir),Tuple{Bool,Cmd,Int64,String,String}}}, ::typeof(addprocs), ::Distributed.SSHManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:441
 [6] #addprocs at ./none:0 [inlined]
 [7] #addprocs#243 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/managers.jl:118 [inlined]
 [8] (::Distributed.var"#kw##addprocs")(::NamedTuple{(:exename, :dir),Tuple{String,String}}, ::typeof(addprocs), ::Array{Tuple{String,Int64},1}) at ./none:0
 [9] top-level scope at REPL[9]:1

julia> 

Given that I can successfully ssh into the machine it seems like a strange error…

Thanks for any suggestions.

I posted previously about my difficulties with addprocs() on remote machines. I thought, perhaps, that the problem was with me ssh’ing in a) as a different user and, b) to a different directory structure. However, as the following demonstrates even attempting to add workers using ssh to login with defaults fails. Yet ssh appears to succeed. Does Distributed.SSHManager need some other information?

Thanks.

julia> using Distributed

julia> q=["cs144l-08", "cs144l-10"]
2-element Array{String,1}:
 "cs144l-08"
 "cs144l-10"

julia> map(p->run(`ssh $p julia --version`), q)
julia version 1.3.0
julia version 1.3.0
2-element Array{Base.Process,1}:
 Process(`ssh cs144l-08 julia --version`, ProcessExited(0))
 Process(`ssh cs144l-10 julia --version`, ProcessExited(0))

julia> addprocs(q) # fails
ERROR: TaskFailedException:
IOError: connect: host is unreachable (EHOSTUNREACH)
Stacktrace:
 [1] worker_from_id(::Distributed.ProcessGroup, ::Int64) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:1059
 [2] worker_from_id at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:1056 [inlined]
 [3] #remote_do#156 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:482 [inlined]
 [4] remote_do at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:482 [inlined]
 [5] kill at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/managers.jl:534 [inlined]
 [6] create_worker(::Distributed.SSHManager, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:581
 [7] setup_launched_worker(::Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:523
 [8] (::Distributed.var"#43#46"{Distributed.SSHManager,Array{Int64,1},WorkerConfig})() at ./task.jl:333

...and 1 more exception(s).

Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:300
 [2] macro expansion at ./task.jl:319 [inlined]
 [3] #addprocs_locked#40(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::typeof(Distributed.addprocs_locked), ::Distributed.SSHManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:477
 [4] #addprocs_locked at ./none:0 [inlined]
 [5] #addprocs#39(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::typeof(addprocs), ::Distributed.SSHManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:441
 [6] #addprocs at ./none:0 [inlined]
 [7] #addprocs#243(::Bool, ::Cmd, ::Int64, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(addprocs), ::Array{String,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/managers.jl:118
 [8] addprocs(::Array{String,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/managers.jl:117
 [9] top-level scope at REPL[8]:1

julia> 

I believe that I have solved my particular issue.

Although I first didn’t think it applied, by callingaddprocs() with tunnel=true I successfully set up a number of remote workers. It’s all there in the documentation :wink:

2 Likes