Question about cluster setup on windows

I am learning how to build up a computer cluster for computation. Now I have already started up the SSH service on one server running windows 10 and keep the same julia version of the latest 1.9.3 installed in the same path, and I can successfully ssh to it via my laptop running windows 11. However, when I call addprocs([“user@remote_IP_address”]), it is thrown that

ERROR: TaskFailedException

nested task error: Unable to read host:port string from worker. Launch command exited with error?

What were missed during the above settings to result in the failure? Or preferably, could you please share your experience of a successful setting?

I would ask you to think about using a cloud service using Linux

AWS Parallel Cluster AWS ParallelCluster - Amazon Web Services
Azure High-performance computing (HPC) on Azure - Azure Architecture Center | Microsoft Learn

I have been building HPC clusters for over 20 years - go with the flow

Really cool! but it actually doesn’t resolve my problem :rofl:

1 Like

If you use the ssh server in Mobaxterm does this help?
https://mobaxterm.mobatek.net/features.html

@lionisxn Did you ever solve this?

I have a Linux host that can do key based SSH login into a Windows host and start julia.exe

However, on the Linux machine:

julia> addprocs(["sob@win10-work"],shell=:wincmd)
The syntax of the command is incorrect.
ERROR: TaskFailedException
 
    nested task error: Unable to read host:port string from worker. Launch command exited with error?
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:1093
     [2] worker_from_id
       @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:1090 [inlined]
     [3] remote_do
       @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:557 [inlined]
     [4] kill(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/managers.jl:736
     [5] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:604
     [6] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:545
     [7] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:501
    
    caused by: Unable to read host:port string from worker. Launch command exited with error?
    Stacktrace:
     [1] read_worker_host_port(io::Base.PipeEndpoint)
       @ Distributed ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:330
     [2] connect(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/managers.jl:580
     [3] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:600
     [4] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:545
     [5] (::Distributed.var"#45#48"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:501
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:466
 [2] macro expansion
   @ ./task.jl:499 [inlined]
 [3] addprocs_locked(manager::Distributed.SSHManager; kwargs::@Kwargs{shell::Symbol})
   @ Distributed ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:490
 [4] addprocs_locked
   @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:456 [inlined]
 [5] addprocs(manager::Distributed.SSHManager; kwargs::@Kwargs{shell::Symbol})
   @ Distributed ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:450
 [6] addprocs
   @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:443 [inlined]
 [7] #addprocs#255
   @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/managers.jl:159 [inlined]
 [8] top-level scope
   @ REPL[16]:1
 
julia> 

I have tried to set exename in an explicit manner, but that doesn’t change the outcome. I should also note that I’m able to create a worker on a Linux remote host without issue (also with key based SSH auth).

It turns out I only needed to add dir=nothing for the above addprocs() call to work. Now I successfully get

julia> @show remotecall_fetch(Sys.windows_version,26) remotecall_fetch(Sys.windows_version, 26) = v"10.0.19045" v"10.0.19045"