Julia Distributed can't find ssh key, can't use remote machine

Hi all

I have the following script 06_test_nuc12.jl that seems not to find the private key for the ssh

using Distributed

procs = addprocs(["dbp@79.152.67.28:8620", 4], 
    dir="/home/dbp", 
    exename="/home/dbp/julia/bin/julia", 
    sshflags="-i /Users/dbuchaca/.ssh/nuc12"
)
println("procs: ",procs)

Nevertheless ssh -i /Users/dbuchaca/.ssh/nuc12 -p 8620 dbp@79.152.67.28 works fine.

The julia script produces the following error, it seems it can’t find the private key /Users/dbuchaca/.ssh/nuc12 not accessible.

Error message:

julia 06_test_nuc12.jl 

Permission denied, please try again.
Permission denied, please try again.
Received disconnect from 79.152.67.28 port 8620:2: Too many authentication failures
Disconnected from 79.152.67.28 port 8620
ERROR: LoadError: TaskFailedException

    nested task error: Unable to read host:port string from worker. Launch command exited with error?
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:1093
     [2] worker_from_id
       @ /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:1090 [inlined]
     [3] remote_do
       @ /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:557 [inlined]
     [4] kill(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/managers.jl:736
     [5] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:604
     [6] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:545
     [7] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:501
    
    caused by: Unable to read host:port string from worker. Launch command exited with error?
    Stacktrace:
     [1] read_worker_host_port(io::Base.PipeEndpoint)
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:330
     [2] connect(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/managers.jl:580
     [3] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:600
     [4] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:545
     [5] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:501
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:466
 [2] macro expansion
   @ ./task.jl:499 [inlined]
 [3] addprocs_locked(manager::Distributed.SSHManager; kwargs::@Kwargs{dir::String, exename::String, sshflags::String, tunnel::Bool})
   @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:490
 [4] addprocs_locked
   @ /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:456 [inlined]
 [5] addprocs(manager::Distributed.SSHManager; kwargs::@Kwargs{dir::String, exename::String, sshflags::String, tunnel::Bool})
   @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:450
 [6] addprocs
   @ /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:443 [inlined]
 [7] #addprocs#255
   @ /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/managers.jl:159 [inlined]
 [8] top-level scope
   @ ~/personal/git_stuff/julia_tutorials/basics/distributed/06_test_nuc12.jl:4

Any hints on what I might do wrong? I have to mention that the machine is in remote, not in my local network (but I would expect the script to work this way as well).

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M1 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 10 default, 0 interactive, 5 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = 10

I have very little experience with windows, and this is a long shot. In the error message there seems to be a space in front of the filename, i.e. two spaces between “file” and “/Users…”. Could it be that some magic is done, and things fail? Have you tried sshflags="-i/Users/...", i.e. no space after -i?

Now that you mention this… I am trying this code on a mac with apple silicon and my remote machine is X86. Could this be a problem?

I tried without the space with sshflags="-i/Users/dbuchaca/.ssh/nuc12"

Now I don’t get same error message, it is not complaining about not finding the private key!

julia 06_test_nuc12.jl 
exception launching on machine 4 :MethodError(Distributed.launch_on_machine, 
(SSHManager(machines=Dict{Any, Any}(4 => 1, "dbp@79.152.67.28:8620" => 1)),
 4, 1, Dict{Symbol, Any}(:ssh => "ssh", :cmdline_cookie => false, :env => Any[],
 :multiplex => false, :sshflags => "-i/Users/dbuchaca/.ssh/nuc12", :max_parallel => 10, 
:exeflags => ``, :enable_threaded_blas => false, :lazy => true, :tunnel => false, :topology =>
 :all_to_all, :shell => :posix, :exename => "/home/dbp/julia/bin/julia", :dir => "/home/dbp"), 
WorkerConfig[], Condition(Base.IntrusiveLinkedList{Task}(Task (runnable, started) @0x0000000108670010, Task (runnable, started) @0x0000000108670010), Base.AlwaysLockedST(1))), 0x0000000000006801)


ERROR: LoadError: TaskFailedException

    nested task error: IOError: connect: connection timed out (ETIMEDOUT)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:1093
     [2] worker_from_id
       @ /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:1090 [inlined]
     [3] remote_do
       @ /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:557 [inlined]
     [4] kill(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/managers.jl:736
     [5] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:604
     [6] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:545
     [7] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:501
    
    caused by: IOError: connect: connection timed out (ETIMEDOUT)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Sockets/src/Sockets.jl:528
     [2] connect
       @ /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Sockets/src/Sockets.jl:563 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/managers.jl:695
     [4] connect(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/managers.jl:622
     [5] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:600
     [6] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:545
     [7] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:501
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:466
 [2] macro expansion
   @ ./task.jl:499 [inlined]
 [3] addprocs_locked(manager::Distributed.SSHManager; kwargs::@Kwargs{dir::String, exename::String, sshflags::String})
   @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:490
 [4] addprocs_locked
   @ /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:456 [inlined]
 [5] addprocs(manager::Distributed.SSHManager; kwargs::@Kwargs{dir::String, exename::String, sshflags::String})
   @ Distributed /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:450
 [6] addprocs
   @ /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:443 [inlined]
 [7] #addprocs#255
   @ /Applications/Julia-1.11.app/Contents/Resources/julia/share/julia/stdlib/v1.11/Distributed/src/managers.jl:159 [inlined]
 [8] top-level scope
   @ ~/personal/git_stuff/julia_tutorials/basics/distributed/06_test_nuc12.jl:3

If the architecture is a problem it would show up later, I think. The ssh connection should succeed.

Btw, it is possible to avoid some clutter in the julia script by adding your remote machine to ~/.ssh/config as:

host jserver
  hostname 79.152.67.28
  user dbg
  port 8620
  identityfile ~/.ssh/nuc12

then, connect with

julia> procs = addprocs("jserver", home="...", exename="...")

or $ ssh jserver.

Ok, now the ssh succeeds, but it seems that julia either does not start on the remote, or you can’t connect properly to the workers. It might be a problem with firewalls?

I’m not very experienced with Distributed, but if you add a tunnel=true, i.e. addprocs(..., tunnel=true), it should use the ssh-channel to communicate rather than creating new connections. This will bypass firewalls. But it might slow down the connection, I’m not sure.

1 Like

Many thanks @sgaure !

using Distributed

procs = addprocs(["dbp@79.152.67.28:8620", 4], 
    dir="/home/dbp", 
    exename="/home/dbp/julia/bin/julia", 
    sshflags="-i/Users/dbuchaca/.ssh/nuc12",
    tunnel=true,
)
println("\n\nprocs: ",procs)

This indeed executes

julia 06_test_nuc12.jl 
exception launching on machine 4 : MethodError(Distributed.launch_on_machine, (SSHManager(machines=Dict{Any, Any}(4 => 1, "dbp@79.152.67.28:8620" => 1)), 4, 1, Dict{Symbol, Any}(:ssh => "ssh", :cmdline_cookie => false, :env => Any[], :multiplex => false, :sshflags => "-i/Users/dbuchaca/.ssh/nuc12", :max_parallel => 10, :exeflags => ``, :enable_threaded_blas => false, :lazy => true, :tunnel => true, :topology => :all_to_all, :shell => :posix, :exename => "/home/dbp/julia/bin/julia", :dir => "/home/dbp"), WorkerConfig[], Condition(Base.IntrusiveLinkedList{Task}(Task (runnable, started) @0x000000010816c010, Task (runnable, started) @0x000000010816c010), Base.AlwaysLockedST(1))), 0x0000000000006801)


procs: [2]