Hi everyone, I met a problem in adding another Windows machine as a worker using SSHManager.
Background information:
- I am able to direct ssh into the remote machine, and the machine is under the same local network with the host machine.
- I also configured correctly the exename and directory, so that I can literally see a new julia process launched on the remote machine using Task Manager on the machine.
However, after that, the local process and the remote process seemed unable to initiate a connection. The error information is
addprocs([(machine_spec,1)], shell=:wincmd, exename=exename, dir=dir, cmdline_cookie=true)
ERROR: TaskFailedException
nested task error: IOError: connect: connection timed out (ETIMEDOUT)
Stacktrace:
[1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1092
[2] worker_from_id
@ C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1089 [inlined]
[3] #remote_do#170
@ C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:557 [inlined]
[4] remote_do
@ C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:557 [inlined]
[5] kill(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\managers.jl:692
[6] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:603
[7] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:544
[8] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
@ Distributed .\task.jl:484
caused by: IOError: connect: connection timed out (ETIMEDOUT)
Stacktrace:
[1] wait_connected(x::Sockets.TCPSocket)
@ Sockets C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Sockets\src\Sockets.jl:529
[2] connect
@ C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Sockets\src\Sockets.jl:564 [inlined]
[3] connect_to_worker(host::String, port::Int64)
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\managers.jl:651
[4] connect(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\managers.jl:578
[5] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:599
[6] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:544
[7] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
@ Distributed .\task.jl:484
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base .\task.jl:436
[2] macro expansion
@ .\task.jl:455 [inlined]
[3] addprocs_locked(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:shell, :exename, :dir, :cmdline_cookie), Tuple{Symbol, String, String, Bool}}})
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:490
[4] addprocs(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:shell, :exename, :dir, :cmdline_cookie), Tuple{Symbol, String, String, Bool}}})
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:450
[5] #addprocs#255
@ C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\managers.jl:146 [inlined]
[6] top-level scope
@ d:\OneDrive\Documents\Awesome_obsidian\Awesome\distributed_test_1.jl:11
After spending some time searching the discussions, I noticed an unanswered question question which was similar to my case.
I decided to investigate the issue further and adapted the source code in Distributed.jl
to see what the issue underneath is.
As far as I can tell, the workflow is roughly as follows
- split the machine vector into individual
machine
andcnt
=core count, - other parameters passed on to construct proper
ssh ...
command. - The ssh channel and other information is passed on to the worker config struct
wconfig
-
wconfig
is handled bysetup_launched_worker:1
andcreate_worker
. -
create_worker
returns the process id (pid
) for further instructions.
There are two issues.
- First, the program decided to obtain binding address
bind_addr
forwconfig
directly from theio
flow. But on my machine it basically failed to do so.- But this should be the same as the ip address of the remote machine, I was able to get around this by manually supply
bind_addr
towconfig
.
- But this should be the same as the ip address of the remote machine, I was able to get around this by manually supply
- Second, after doing that, I was able to get something of type
TCPSocket
usingDistributed.connect_to_worker(bind_addr, port)
.- The information
From worker 7: julia_worker:9888#192.168.50.192
seems to be as expected. - I was then able to construct a worker using the socket
r_s
,w_s
and thewconfig
by commandw = Distributed.Worker(w.id, r_s, w_s, manager; config=wconfig)
- However, the
create_worker
function then requires to execute the following commandDistributed.process_messages(w.r_stream, w.w_stream, false)
which breaks up everything.
- The information
The error message is
Unhandled Task ERROR: UndefVarError: oldstate not defined
Stacktrace:
[1] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:244
[2] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
[3] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
@ Distributed .\task.jl:484
caused by: ArgumentError: invalid version string: SSH-2.0-OpenSSH_
Stacktrace:
[1] parse
@ .\version.jl:140 [inlined]
[2] VersionNumber
@ .\version.jl:144 [inlined]
[3] process_hdr(s::Sockets.TCPSocket, validate_cookie::Bool)
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:277
[4] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:158
[5] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed C:\Users\Xiangting\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
[6] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
@ Distributed .\task.jl:484
I think the problem is that process_hdr
tries to obtain the julia
version information.
But it actually got a string of OpenSSH
information.
Because then I cannot do anything to solve this problem, the worker decided to exit.
From worker 8: Master process (id 1) could not connect within 100.0 seconds.
From worker 8: exiting.
I’m not familiar with the network, io, and sockets. So, I cannot figure out what is going wrong.
Hopefully the information I provided can be useful for investigating the issue.
By the way, Windows to Ubuntu connection seems to work out of box. However, Windows to WSL2 Ubuntu connection experiences exactly the same issue. I don’t know if it is platform specific. I guess it’s not a Windows Firewall issue because there is actually a TCP socket established, but Distributed.process_messages
failed for some reason.
Thanks for your attention!