Problems with addprocs connection to second machine

question

#1

I am having problems connecting to julia on a second machine via ssh. it is odd, because a plain run call allows me to connect fine. the addprocs call always times out. here is my example:

julia> versioninfo()
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

julia> run(`ssh rentsScPo "./getshell.sh && julia -e 'println(pwd());println(versioninfo());println(gethostname());exit()'"`)
**************************************************

              WELCOME TO THE 

          SciencesPo Rent server

This system runs:
Ubuntu 16.04.1 LTS


/bin/bash
/home/floswald
Julia Version 0.5.1
Commit 6445c82 (2017-03-05 13:25 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
nothing
scpo-rents

julia> addprocs(["rentsScPo"],dir="/home/floswald",exename="julia")
**************************************************

              WELCOME TO THE 

          SciencesPo Rent server

This system runs:
Ubuntu 16.04.1 LTS

Master process (id 1) could not connect within 60.0 seconds.
exiting.

exiting.
ERROR: connect: connection timed out (ETIMEDOUT)
 in yieldto(::Task, ::ANY) at ./event.jl:136
 in yieldto(::Task, ::ANY) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in wait() at ./event.jl:169
 in wait(::Condition) at ./event.jl:27
 in wait(::Condition) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N}) at ./stream.jl:44
 in wait_connected(::TCPSocket) at ./stream.jl:265
 in connect at ./stream.jl:960 [inlined]
 in connect_to_worker(::SubString{String}, ::Int16) at ./managers.jl:483
 in connect(::Base.SSHManager, ::Int64, ::WorkerConfig) at ./managers.jl:425
 in create_worker(::Base.SSHManager, ::WorkerConfig) at ./multi.jl:1786
 in setup_launched_worker(::Base.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at ./multi.jl:1733
 in (::Base.##649#653{Base.SSHManager,Array{Int64,1}})() at ./task.jl:360
 in sync_end() at ./task.jl:311
 in macro expansion at ./task.jl:327 [inlined]
 in #addprocs_locked#645(::Array{Any,1}, ::Function, ::Base.SSHManager) at ./multi.jl:1688
 in #addprocs_locked#645(::Array{Any,1}, ::Function, ::Base.SSHManager) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in (::Base.#kw##addprocs_locked)(::Array{Any,1}, ::Base.#addprocs_locked, ::Base.SSHManager) at ./<missing>:0
 in #addprocs#644(::Array{Any,1}, ::Function, ::Base.SSHManager) at ./multi.jl:1658
 in #addprocs#644(::Array{Any,1}, ::Function, ::Base.SSHManager) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in (::Base.#kw##addprocs)(::Array{Any,1}, ::Base.#addprocs, ::Base.SSHManager) at ./<missing>:0
 in #addprocs#744(::Bool, ::Cmd, ::Int64, ::Array{Any,1}, ::Function, ::Array{String,1}) at ./managers.jl:112
 in (::Base.#kw##addprocs)(::Array{Any,1}, ::Base.#addprocs, ::Array{String,1}) at ./<missing>:0

the same thing happens if i specify the full path to the executable:

addprocs(["rentsScPo"],dir="/home/floswald",exename="/home/floswald/apps/julia-0.5/bin/julia")

#2

Dear @floswald,
In order for Julia parallel clusters to work, two following conditions need to be met:

  1. The cluster nodes must be able to connect to each other via SSH protocol (TCP/IP port 22)
  2. A passwordless SSH needs to be configured between cluster nodes.

The first point is purely a network configuration issue.

In order to setup passwordless SSH on master node type (assuming Ubuntu Linux).

KEY_NAME=mykey
KEY_FILE=~/.ssh/$KEY_NAME
ssh-keygen -P "" -t rsa -f $KEY_FILE
printf "\nUser ubuntu\nPubKeyAuthentication yes\nStrictHostKeyChecking no\nIdentityFile $KEY_FILE\n" >> ~/.ssh/config		

On slave nodes:

KEY_NAME=mykey
KEY_FILE=~/.ssh/$KEY_NAME
# this key should be copied from master
printf "User ubuntu\nPubKeyAuthentication yes\nIdentityFile ${KEY_FILE}\nStrictHostKeyChecking no" > ~/.ssh/config

Additionally, you need to copy the last line (containing information about the above public key) in the ~/.ssh/authorized_keys file from master to slave.

Hope that helps.
Przemek.

P.S.
Since many people have exactly the same problem, I have developed a tool for quick deploying Julia parallel clusters (currently on Amazon Web Services cloud, other cloud vendors planned) - it comes with a detailed tutorial on how to setup Julia parallel on AWS. Have a look at https://github.com/pszufe/KissCluster.


#3

Hi!
I don’t think you looked at my example long enough. Both of those conditions are clearly met, which is evident from my first call actually working. Any idea why that works, but addprocs not?


#4

i like your toolbox though!


#5

Hi floswald,

You get the error:
ERROR: connect: connection timed out (ETIMEDOUT)
This seems to be the first point from my list (network connectivity problem)

You try to connect to rentsScPo machine and cannot. Maybe try the bash command (assuming it is ubuntu) and post the results:

ssh ubuntu@rentsScPo

If you get a similar result to Julia it means that either:

  1. The server name rentsScPo does not get properly resolved to an IP address
  2. The server name rentsScPo is properly resolved but the target host can not be found
  3. A firewall on rentsScPo is not open on 22 port for SSH protocol.

Hope that helps.


#6

i told you ssh works. i can even start a julia process on the remote. please look back at my example!


#7

Do you have a firewall on the master? The master connects to the worker via ssh to start the julia process, but then the worker julia process connects to the master process via a TCP socket.


#8

ha. good point! I’ve got no idea. I’ll check tomorrow. what do I need to verify here? being able to SSH from worker to master is not the right thing? these are just 2 workstations on my company network.

thanks!


#9

And this connection mentioned by avik must be configured two-ways (both master and slave must have identical passwordless configurations…

Still I would not neglect some ugly issue with name resolving… maybe try with IP addresses instead of names.

I have never used SciencesPo Rent server mentioned in your logs, but some cloud providers (e.g. AWS) offer both “public” and “private” IP addresses for instances. In that case you want to use the private IP to configure your cluster.


#10

of course, that’s it! i didn’t think about the connection back from the worker. thanks. will try tomorrow!
SciencesPo Rent server is just the fancy name I gave the second workstation under my desk. :slight_smile:

cheers guys


#11

@floswald just out of curiosity - did it work in the end?


#12

Arg I hit a wall. One computer is on a wired connection and the other on wifi. I can ssh from wifi to wired but other way round says route not found. Seems to be a network issue at my company. But I’m sure your advice would have worked!