What does this error mean?

I am attempting to use the Distributed package from: macOS to: Ubuntu.
Julia 1.5.2 on both machines. The remote machine has ssh on port 2222 and not the usual 22. I attempted at one point to provide a bind address of uno:2222, but took it out. I am not clear about its use. I have also tried using an sshflag of -p 2222 like I do when calling ssh from the cl. Once again, I am not clear as to whether I need it.

Here is my most recent attempt and the error in question is below. I am not sure what it is telling me.

julia> using Distributed

julia>workervec = [“haz@uno:2222”, 3]

2-element Array{Any,1}:

"haz@uno:2222"

3

julia> addprocs(workervec; dir="/home/haz", exename="/usr/bin/julia")

exception launching on machine 3 : MethodError(Distributed.launch_on_machine, (SSHManager(machines=Dict{Any,Any}(3 => 1,“haz@uno:2222” => 1)), 3, 1, Dict{Symbol,Any}(:lazy => true,:tunnel => false,:topology => :all_to_all,:multiplex => false,:sshflags => ,:max_parallel => 10,:exeflags => ,:enable_threaded_blas => false,:exename => “/usr/bin/julia”,:dir => “/home/haz”), WorkerConfig[], Base.GenericCondition{Base.AlwaysLockedST}(Base.InvasiveLinkedList{Task}(Task (runnable) @0x0000000116868010, Task (runnable) @0x0000000116868010), Base.AlwaysLockedST(1))), 0x0000000000006ca4)**

I think you want [(“haz@uno:2222”, 3)] (addprocs expects a vector of names or (name,count) tuples).

I tried the tuple. It looks like this error is on the macOS side. I still don’t know how to understand the error.

Here’s what happened:

Julia> workervec = [(“haz@uno:2222”, 3)]

1-element Array{Tuple{String,Int64},1}:

(“haz@uno:2222”, 3)

Julia> addprocs(workervec; dir="/home/haz", exename="/usr/bin/julia")

ERROR: TaskFailedException:

IOError: connect: connection timed out (ETIMEDOUT)

Stacktrace:

[1] worker_from_id(::Distributed.ProcessGroup, ::Int64) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1074

[2] worker_from_id at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1071 [inlined]

[3] #remote_do#154 at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:486 [inlined]

[4] remote_do at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:486 [inlined]

[5] kill(::Distributed.SSHManager, ::Int64, ::WorkerConfig) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/Distributed/src/managers.jl:603

[6] create_worker(::Distributed.SSHManager, ::WorkerConfig) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:585

[7] setup_launched_worker(::Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:526

[8] (::Distributed.var"#41#44"{Distributed.SSHManager,Array{Int64,1},WorkerConfig})() at ./task.jl:356

Stacktrace:

[1] sync_end( ::Channel{Any} ) at ./task.jl:314

[2] macro expansion at ./task.jl:333 [inlined]

[3] addprocs_locked( ::Distributed.SSHManager; kwargs::Base.Iterators.Pairs{Symbol,Any,NTuple{6,Symbol},NamedTuple{(:tunnel, :multiplex, :sshflags, :max_parallel, :dir, :exename),Tuple{Bool,Bool,Cmd,Int64,String,String}}} ) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:480

[4] addprocs( ::Distributed.SSHManager; kwargs::Base.Iterators.Pairs{Symbol,Any,NTuple{6,Symbol},NamedTuple{(:tunnel, :multiplex, :sshflags, :max_parallel, :dir, :exename),Tuple{Bool,Bool,Cmd,Int64,String,String}}} ) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:444

[5] #addprocs#241 at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/Distributed/src/managers.jl:120 [inlined]

[6] top-level scope at REPL[24]:1

Is the reason for the 2222 port that you’re trying to connect into WSL on a Windows machine?

The reason for 2222 is because I am trying to connect from macOS host to a VBox guest running Ubuntu. I have an rsa public key setup and can ssh into the guest passwordless.

addprocs(3), for example, works on both host and guest without error.

My problem is I can’t figure out how to trouble shoot because the error message (above), I can’t interpret.

One thing I have noticed is the path: /Users/Julia/buildbot is referred to repeatedly in the error. I can find nothing on my host that matches. Do we have a problem with the macOS version of Julia?

I don’t think the paths in the error stacktrace are related to the problem, since I often see buildbot paths for errors inside the standard library (running on Windows).
I’ve had similar problems trying to run remote workers on WSL, which seem to be due to it having a different ip address than the host machine (merely forwarding the ssh port was not enough). Maybe this is the case here?

Although the error may not be related to this, and also a quick look at the docs was not enough to find it stated directly, but I am almost sure that cross-OS distribution is not supported. I think this because it depends on https://docs.julialang.org/en/v1/stdlib/Serialization/#Serialization.serialize

By now, I suspect tistzamo might well be right. Distribution only works between like OSes. I shall test this out and let you all know. If true, it means I’ll have to launch any kind of distributed process by another means. I am thinking of having it as a micro service. A Julia http server would be nice. Failing that it’ll be more polyglot.

I said I would test using two like OSes. I did. I tested from Ubuntu to Ubuntu. Both machines have Julia 1.5.2. They are called uno and haz00 respectively. I can sign in from uno to haz00 using ssh with rsa public key–no problem.

When I run the following–note the -v flag–all seems to go well. The debug output indicates a connection. After that, as you can see below, some kind of error occurs and the remote is inaccessible.

I am still not sure what is going wrong.

julia> addprocs(workervec; sshflags = -v, dir="/home/gcr", exename=julia)

OpenSSH_7.9p1 Ubuntu-10, OpenSSL 1.1.1b 26 Feb 2019

debug1: Reading configuration data /etc/ssh/ssh_config

**debug1: /etc/ssh/ssh_config line 19: Applying options for ***

debug1: Connecting to haz00 [10.185.3.197] port 22.

debug1: Connection established.

debug1: identity file /home/haz/.ssh/id_rsa type 0

debug1: identity file /home/haz/.ssh/id_rsa-cert type -1

debug1: identity file /home/haz/.ssh/id_dsa type -1

debug1: identity file /home/haz/.ssh/id_dsa-cert type -1

debug1: identity file /home/haz/.ssh/id_ecdsa type -1

debug1: identity file /home/haz/.ssh/id_ecdsa-cert type -1

debug1: identity file /home/haz/.ssh/id_ed25519 type -1

debug1: identity file /home/haz/.ssh/id_ed25519-cert type -1

debug1: identity file /home/haz/.ssh/id_xmss type -1

debug1: identity file /home/haz/.ssh/id_xmss-cert type -1

debug1: Local version string SSH-2.0-OpenSSH_7.9p1 Ubuntu-10

debug1: Remote protocol version 2.0, remote software version OpenSSH_7.6p1 Ubuntu-4ubuntu0.3

debug1: match: OpenSSH_7.6p1 Ubuntu-4ubuntu0.3 pat OpenSSH_7.0,OpenSSH_7.1,OpenSSH_7.2*,OpenSSH_7.3*,OpenSSH_7.4*,OpenSSH_7.5*,OpenSSH_7.6*,OpenSSH_7.7* compat 0x04000002**

debug1: Authenticating to haz00:22 as 'gcr’

debug1: SSH2_MSG_KEXINIT sent

debug1: SSH2_MSG_KEXINIT received

debug1: kex: algorithm: curve25519-sha256

debug1: kex: host key algorithm: ecdsa-sha2-nistp256

debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: compression: none

debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: compression: none

debug1: expecting SSH2_MSG_KEX_ECDH_REPLY

debug1: Server host key: ecdsa-sha2-nistp256 SHA256:S98DQb8FCMuzIi1cjMBn8lvX/J5bRBVVj7QM9AMWL7I

debug1: Host ‘haz00’ is known and matches the ECDSA host key.

debug1: Found key in /home/haz/.ssh/known_hosts:2

debug1: rekey after 134217728 blocks

debug1: SSH2_MSG_NEWKEYS sent

debug1: expecting SSH2_MSG_NEWKEYS

debug1: SSH2_MSG_NEWKEYS received

debug1: rekey after 134217728 blocks

debug1: Will attempt key: /home/haz/.ssh/id_rsa RSA SHA256:GoGcdDxfYU6S1GLEHNPjaaaus0Fbhg84Rupe4kg2h5c

debug1: Will attempt key: /home/haz/.ssh/id_dsa

debug1: Will attempt key: /home/haz/.ssh/id_ecdsa

debug1: Will attempt key: /home/haz/.ssh/id_ed25519

debug1: Will attempt key: /home/haz/.ssh/id_xmss

debug1: SSH2_MSG_EXT_INFO received

debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521>

debug1: SSH2_MSG_SERVICE_ACCEPT received

debug1: Authentications that can continue: publickey

debug1: Next authentication method: publickey

debug1: Offering public key: /home/haz/.ssh/id_rsa RSA SHA256:GoGcdDxfYU6S1GLEHNPjaaaus0Fbhg84Rupe4kg2h5c

debug1: Server accepts key: /home/haz/.ssh/id_rsa RSA SHA256:GoGcdDxfYU6S1GLEHNPjaaaus0Fbhg84Rupe4kg2h5c

debug1: Authentication succeeded (publickey).

Authenticated to haz00 ([10.185.3.197]:22).

debug1: channel 0: new [client-session]

debug1: Requesting no-more-sessions@openssh.com

debug1: Entering interactive session.

debug1: pledge: network

debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0

debug1: Sending environment.

debug1: Sending env LANG = en_US.UTF-8

debug1: Sending command: sh -l -c 'cd – /home/gcr

julia --worker’

ERROR: TaskFailedException:

IOError: connect: connection refused (ECONNREFUSED)

Stacktrace:

[1] worker_from_id(::Distributed.ProcessGroup, ::Int64) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1074

[2] worker_from_id at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1071 [inlined]

[3] #remote_do#154 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:486 [inlined]

[4] remote_do at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:486 [inlined]

[5] kill(::Distributed.SSHManager, ::Int64, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/managers.jl:603

[6] create_worker(::Distributed.SSHManager, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:585

[7] setup_launched_worker(::Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:526

[8] (::Distributed.var"#41#44"{Distributed.SSHManager,Array{Int64,1},WorkerConfig})() at ./task.jl:356

Stacktrace:

[1] sync_end( ::Channel{Any} ) at ./task.jl:314

[2] macro expansion at ./task.jl:333 [inlined]

[3] addprocs_locked( ::Distributed.SSHManager; kwargs::Base.Iterators.Pairs{Symbol,Any,NTuple{6,Symbol},NamedTuple{(:tunnel, :multiplex, :sshflags, :max_parallel, :dir, :exename),Tuple{Bool,Bool,Cmd,Int64,String,Cmd}}} ) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:480

[4] addprocs( ::Distributed.SSHManager; kwargs::Base.Iterators.Pairs{Symbol,Any,NTuple{6,Symbol},NamedTuple{(:tunnel, :multiplex, :sshflags, :max_parallel, :dir, :exename),Tuple{Bool,Bool,Cmd,Int64,String,Cmd}}} ) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:444

[5] #addprocs#241 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/managers.jl:120 [inlined]

[6] top-level scope at REPL[31]:1

I’ve used cross-OS distribution many times before (macOS → Windows, macOS → Linux, Windows → Linux). Last time I checked was in a workshop I gave in March, see here (macOS → Linux).

Having said that, I’m also currently trying to start workers on a ubuntu VM from macOS and it doesn’t work for me. Even more surprising: My old examples don’t work anymore (again, they used to work fine in March). Don’t know what’s going on.