PBS cluster: "Syntax error: redirection unexpected" and "nested task error: Unable to read host:port string from worker."

j_u · December 5, 2021, 10:16pm

Hi, I am new to Distributed computing.

Everything works fine with Distributed on one machine. It seems that everything works fine on two machines with MPI.jl (it seems to be slower than Distributed on one machine but works - still have to investigate it).

However when I am trying to scale to two nodes on PBS cluster (each node with 12 physical CPUs) with Distributed I am receiving "sh: 7: /etc/profile.d/add-local-path.sh: Syntax error: redirection unexpected" and "nested task error: Unable to read host:port string from worker." . Below I am providing as detailed info as possible. Any help regarding the situation?

My shell is: (I do not know if it makes any difference, I’ve read some posts suggesting to check it thus I am providing info)
ls -l $(which sh) returns:
lrwxrwxrwx 1 root root 4 Jan 28 2020 /bin/sh -> dash

When I submit a job with qsub launch_distributed_nodes.sh:

#!/bin/bash
#PBS -d . -l nodes=2:ppn=2 -l walltime=24:00:00
julia --machine-file=$PBS_NODEFILE -p 24 distributed_nodes.jl

with $PBS_NODEFILE as:

node1
node1
node2
node2

and distributed_nodes.jl including:

using Distributed
@everywhere begin
    using LinearAlgebra
    using MKL
    BLAS.set_num_threads(1)
    using "PackageName"
end

I am receiving following error:

sh: 7: /etc/profile.d/add-local-path.sh: Syntax error: redirection unexpected
sh: 7: /etc/profile.d/add-local-path.sh: Syntax error: redirection unexpected
ERROR: TaskFailedException

    nested task error: Unable to read host:port string from worker. Launch command exited with error?
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1089
     [2] worker_from_id
       @ ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1086 [inlined]
     [3] #remote_do#166
       @ ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:544 [inlined]
     [4] remote_do
       @ ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:544 [inlined]
     [5] kill(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/managers.jl:673
     [6] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:600
     [7] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
     [8] (::Distributed.var"#41#44"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:423

    caused by: Unable to read host:port string from worker. Launch command exited with error?
    Stacktrace:
     [1] read_worker_host_port(io::Base.PipeEndpoint)
       @ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:327
     [2] connect(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/managers.jl:517
     [3] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:596
     [4] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
     [5] (::Distributed.var"#41#44"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:423

...and 1 more exception.

Stacktrace:
  [1] sync_end(c::Channel{Any})
    @ Base ./task.jl:381
  [2] macro expansion
    @ ./task.jl:400 [inlined]
  [3] addprocs_locked(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, Cmd, Tuple{Symbol}, NamedTuple{(:exeflags,), Tuple{Cmd}}})
    @ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:487
  [4] addprocs(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, Cmd, Tuple{Symbol}, NamedTuple{(:exeflags,), Tuple{Cmd}}})
    @ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:447
  [5] #addprocs#249
    @ ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/managers.jl:143 [inlined]
  [6] process_opts(opts::Base.JLOptions)
    @ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1339
  [7] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
  [8] invokelatest
    @ ./essentials.jl:714 [inlined]
  [9] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:252
 [10] _start()
    @ Base ./client.jl:495
/usr/sbin/kill-illegit-procs: line 86: kill: (1463198) - No such process

j_u · December 6, 2021, 2:50pm

Can I ask, if you see any part in my workflow that is incorrect? If so, please let me know. Otherwise, I will understand that best for me would be to consult with the cluster’s admins. However, if there are any parts to be changed, I’d appreciate any advice as Julia is not natively supported there.

Topic		Replies	Views
Julia on Cluster with SSH Restriction General Usage question , cluster	18	3948	January 16, 2021
Help setting up Julia on a cluster Julia at Scale question , parallel , cluster	28	14947	March 4, 2020
Setting up distributed workers on seperate nodes of cluster - PBS and OpenMPI Julia at Scale	10	1199	May 19, 2021
What does this error mean? Julia at Scale question	10	1479	June 21, 2021
Ssh cluster julia host address issue Julia at Scale	7	734	January 19, 2022

PBS cluster: "Syntax error: redirection unexpected" and "nested task error: Unable to read host:port string from worker."

Related topics