When adding remote workers through an ssh connection using Distributed.jl, the SSH Manager is able to connect but for some reason a command over ssh gets improperly formatted and setting up the remote workers fails. The error I receive is The syntax of the command is incorrect
, more details below.
I am running Julia 1.10.3 on an M1 MacBook Pro and am connecting to a Windows 10 device using ssh (OpenSSH server) on my local network. The ssh connection is established using a certificate based authentication, which I guarantee works because I can connect through a terminal and the debug from the ssh connection in Julia shows that the connection is established. Specifically, I run the command:
addprocs([("user@hostname", :auto)], shell=:wincmd, tunnel=false, exename="julia", sshflags=`-vvv`)
With debug output shows (some stuff omitted):
debug2: client_session2_setup: id 0
debug1: Sending environment.
debug3: Ignored env MallocNanoZone
debug3: Ignored env USER
debug3: Ignored env SECURITYSESSIONID
debug3: Ignored env COMMAND_MODE
debug3: Ignored env __CFBundleIdentifier
debug3: Ignored env PATH
debug3: Ignored env HOME
debug3: Ignored env SHELL
debug3: Ignored env LaunchInstanceID
debug3: Ignored env __CF_USER_TEXT_ENCODING
debug3: Ignored env XPC_SERVICE_NAME
debug3: Ignored env SSH_AUTH_SOCK
debug3: Ignored env XPC_FLAGS
debug3: Ignored env LOGNAME
debug3: Ignored env TMPDIR
debug3: Ignored env ORIGINAL_XDG_CURRENT_DESKTOP
debug3: Ignored env SHLVL
debug3: Ignored env PWD
debug3: Ignored env OLDPWD
debug3: Ignored env HOMEBREW_PREFIX
debug3: Ignored env HOMEBREW_CELLAR
debug3: Ignored env HOMEBREW_REPOSITORY
debug3: Ignored env MANPATH
debug3: Ignored env INFOPATH
debug3: Ignored env LDFLAGS
debug3: Ignored env CPPFLAGS
debug3: Ignored env PKG_CONFIG_PATH
debug3: Ignored env _
debug3: Ignored env JULIA_EDITOR
debug3: Ignored env TERM_PROGRAM
debug3: Ignored env TERM_PROGRAM_VERSION
debug1: channel 0: setting env LANG = "en_US.UTF-8"
debug2: channel 0: request env confirm 0
debug3: send packet: type 98
debug3: Ignored env COLORTERM
debug3: Ignored env VSCODE_ENV_REPLACE
debug3: Ignored env VSCODE_ENV_PREPEND
debug3: Ignored env VIRTUAL_ENV
debug3: Ignored env GIT_ASKPASS
debug3: Ignored env VSCODE_GIT_ASKPASS_NODE
debug3: Ignored env VSCODE_GIT_ASKPASS_EXTRA_ARGS
debug3: Ignored env VSCODE_GIT_ASKPASS_MAIN
debug3: Ignored env VSCODE_GIT_IPC_HANDLE
debug3: Ignored env TERM
debug3: Ignored env OPENBLAS_MAIN_FREE
debug3: Ignored env OPENBLAS_DEFAULT_NUM_THREADS
debug1: Sending command: pushd "/Julia" && julia --worker
debug2: channel 0: request exec confirm 1
debug3: send packet: type 98
debug3: client_repledge: enter
debug2: channel_input_open_confirmation: channel 0: callback done
debug2: channel 0: open confirm rwindow 0 rmax 32768
debug3: receive packet: type 81
debug1: client_global_hostkeys_prove_confirm: server used untrusted RSA signature algorithm ssh-rsa for key 0, disregarding
debug3: client_repledge: enter
debug1: pledge: fork
debug2: channel 0: rcvd adjust 2097152
debug3: receive packet: type 99
debug2: channel_input_status_confirm: type 99 id 0
debug2: exec request accepted on channel 0
debug2: channel 0: rcvd ext data 41
The syntax of the command is incorrect.
debug2: channel 0: written 41 to efd 9
debug3: receive packet: type 96
debug2: channel 0: rcvd eof
debug2: channel 0: output open -> drain
debug2: channel 0: obuf empty
debug2: chan_shutdown_write: channel 0: (i0 o1 sock -1 wfd 8 efd 9 [write])
debug2: channel 0: output drain -> closed
debug3: receive packet: type 98
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug3: receive packet: type 97
debug2: channel 0: rcvd close
debug2: chan_shutdown_read: channel 0: (i0 o3 sock -1 wfd 7 efd 9 [write])
debug2: channel 0: input open -> closed
debug3: channel 0: will not send data after close
debug2: channel 0: almost dead
debug2: channel 0: gc: notify user
debug2: channel 0: gc: user detached
debug2: channel 0: send close
debug3: send packet: type 97
debug2: channel 0: is dead
debug2: channel 0: garbage collecting
debug1: channel 0: free: client-session, nchannels 1
debug3: channel 0: status: The following connections are open:
#0 client-session (t4 [session] r0 i3/0 o3/0 e[write]/0 fd -1/-1/9 sock -1 cc -1 io 0x00/0x00)
debug3: send packet: type 1
Transferred: sent 3252, received 2932 bytes, in 0.2 seconds
Bytes per second: sent 19759.6, received 17815.3
debug1: Exit status 1
ERROR: TaskFailedException
nested task error: Unable to read host:port string from worker. Launch command exited with error?
Stacktrace:
[1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
@ Distributed /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:1093
[2] worker_from_id
@ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:1090 [inlined]
[3] remote_do
@ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:557 [inlined]
[4] kill(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
@ Distributed /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/managers.jl:731
[5] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
@ Distributed /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:604
[6] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:545
[7] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
@ Distributed /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:501
caused by: Unable to read host:port string from worker. Launch command exited with error?
Stacktrace:
[1] read_worker_host_port(io::Base.PipeEndpoint)
@ Distributed /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:330
[2] connect(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
@ Distributed /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/managers.jl:575
[3] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
@ Distributed /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:600
[4] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:545
[5] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
@ Distributed /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:501
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base ./task.jl:448
[2] macro expansion
@ ./task.jl:480 [inlined]
[3] addprocs_locked(manager::Distributed.SSHManager; kwargs::@Kwargs{…})
@ Distributed /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:490
[4] addprocs_locked
@ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:456 [inlined]
[5] addprocs(manager::Distributed.SSHManager; kwargs::@Kwargs{…})
@ Distributed /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:450
[6] addprocs
@ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:443 [inlined]
[7] #addprocs#255
@ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Distributed/src/managers.jl:159 [inlined]
[8] top-level scope
@ clustermanagement.jl:18
Some type information was truncated. Use `show(err)` to see complete types.
My current idea is that this is likely a problem with Distributed.jl not serializing commands correct for communication between MacOS and Windows 10. There was an issue previous found here which notes that communication with Windows 10 is non-Posix, but this has been added to Distributed.jl by specifying shell=:wincmd
. I am not sure if this is a problem with how I can configured my ssh connection, the Distributed.jl implementation, or if this is a problem with Julia’s implementation for ssh communications between MacOS and Windows 10 operating systems. Any thoughts would be helpful.