Connection timed out problem on SSHManager of Distributed.jl

Hi everyone, I met a problem in adding another Windows machine as a worker using SSHManager.

Background information:

  • I am able to direct ssh into the remote machine, and the machine is under the same local network with the host machine.
  • I also configured correctly the exename and directory, so that I can literally see a new julia process launched on the remote machine using Task Manager on the machine.

However, after that, the local process and the remote process seemed unable to initiate a connection. The error information is

addprocs([(machine_spec,1)], shell=:wincmd, exename=exename, dir=dir, cmdline_cookie=true)
ERROR: TaskFailedException

    nested task error: IOError: connect: connection timed out (ETIMEDOUT)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1092
     [2] worker_from_id
       @ C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1089 [inlined]
     [3] #remote_do#170
       @ C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:557 [inlined]
     [4] remote_do
       @ C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:557 [inlined]
     [5] kill(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\managers.jl:692
     [6] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:603
     [7] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:544
     [8] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed .\task.jl:484

    caused by: IOError: connect: connection timed out (ETIMEDOUT)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Sockets\src\Sockets.jl:529
     [2] connect
       @ C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Sockets\src\Sockets.jl:564 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\managers.jl:651
     [4] connect(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\managers.jl:578
     [5] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:599
     [6] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:544
     [7] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed .\task.jl:484
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base .\task.jl:436
 [2] macro expansion
   @ .\task.jl:455 [inlined]
 [3] addprocs_locked(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:shell, :exename, :dir, :cmdline_cookie), Tuple{Symbol, String, String, Bool}}})
   @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:490
 [4] addprocs(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:shell, :exename, :dir, :cmdline_cookie), Tuple{Symbol, String, String, Bool}}})
   @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:450
 [5] #addprocs#255
   @ C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\managers.jl:146 [inlined]
 [6] top-level scope
   @ d:\OneDrive\Documents\Awesome_obsidian\Awesome\distributed_test_1.jl:11

After spending some time searching the discussions, I noticed an unanswered question question which was similar to my case.

I decided to investigate the issue further and adapted the source code in Distributed.jl to see what the issue underneath is.

As far as I can tell, the workflow is roughly as follows

  • split the machine vector into individual machine and cnt=core count,
  • other parameters passed on to construct proper ssh ... command.
  • The ssh channel and other information is passed on to the worker config struct wconfig
  • wconfig is handled by setup_launched_worker:1 and create_worker.
  • create_worker returns the process id (pid) for further instructions.

There are two issues.

  • First, the program decided to obtain binding address bind_addr for wconfig directly from the io flow. But on my machine it basically failed to do so.
    • But this should be the same as the ip address of the remote machine, I was able to get around this by manually supply bind_addr to wconfig.
  • Second, after doing that, I was able to get something of type TCPSocket using Distributed.connect_to_worker(bind_addr, port).
    • The information From worker 7: julia_worker:9888#192.168.50.192 seems to be as expected.
    • I was then able to construct a worker using the socket r_s, w_s and the wconfig by command w = Distributed.Worker(w.id, r_s, w_s, manager; config=wconfig)
    • However, the create_worker function then requires to execute the following command Distributed.process_messages(w.r_stream, w.w_stream, false) which breaks up everything.

The error message is

Unhandled Task ERROR: UndefVarError: oldstate not defined
Stacktrace:
 [1] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:244
 [2] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
 [3] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed .\task.jl:484

caused by: ArgumentError: invalid version string: SSH-2.0-OpenSSH_
Stacktrace:
 [1] parse
   @ .\version.jl:140 [inlined]
 [2] VersionNumber
   @ .\version.jl:144 [inlined]
 [3] process_hdr(s::Sockets.TCPSocket, validate_cookie::Bool)
   @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:277
 [4] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:158
 [5] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
 [6] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed .\task.jl:484

I think the problem is that process_hdr tries to obtain the julia version information.
But it actually got a string of OpenSSH information.
Because then I cannot do anything to solve this problem, the worker decided to exit.

From worker 8:     Master process (id 1) could not connect within 100.0 seconds.
From worker 8:    exiting.

I’m not familiar with the network, io, and sockets. So, I cannot figure out what is going wrong.
Hopefully the information I provided can be useful for investigating the issue.

By the way, Windows to Ubuntu connection seems to work out of box. However, Windows to WSL2 Ubuntu connection experiences exactly the same issue. I don’t know if it is platform specific. I guess it’s not a Windows Firewall issue because there is actually a TCP socket established, but Distributed.process_messages failed for some reason.

Thanks for your attention!

2 Likes