Hi, I am new to Distributed computing.
Everything works fine with Distributed
on one machine. It seems that everything works fine on two machines with MPI.jl
(it seems to be slower than Distributed
on one machine but works - still have to investigate it).
However when I am trying to scale to two nodes on PBS cluster (each node with 12 physical CPUs) with Distributed
I am receiving "sh: 7: /etc/profile.d/add-local-path.sh: Syntax error: redirection unexpected"
and "nested task error: Unable to read host:port string from worker."
. Below I am providing as detailed info as possible. Any help regarding the situation?
My shell is: (I do not know if it makes any difference, I’ve read some posts suggesting to check it thus I am providing info)
ls -l $(which sh)
returns:
lrwxrwxrwx 1 root root 4 Jan 28 2020 /bin/sh -> dash
When I submit a job with qsub launch_distributed_nodes.sh
:
#!/bin/bash
#PBS -d . -l nodes=2:ppn=2 -l walltime=24:00:00
julia --machine-file=$PBS_NODEFILE -p 24 distributed_nodes.jl
with $PBS_NODEFILE
as:
node1
node1
node2
node2
and distributed_nodes.jl
including:
using Distributed
@everywhere begin
using LinearAlgebra
using MKL
BLAS.set_num_threads(1)
using "PackageName"
end
I am receiving following error:
sh: 7: /etc/profile.d/add-local-path.sh: Syntax error: redirection unexpected
sh: 7: /etc/profile.d/add-local-path.sh: Syntax error: redirection unexpected
ERROR: TaskFailedException
nested task error: Unable to read host:port string from worker. Launch command exited with error?
Stacktrace:
[1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
@ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1089
[2] worker_from_id
@ ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1086 [inlined]
[3] #remote_do#166
@ ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:544 [inlined]
[4] remote_do
@ ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:544 [inlined]
[5] kill(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
@ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/managers.jl:673
[6] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
@ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:600
[7] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
[8] (::Distributed.var"#41#44"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
@ Distributed ./task.jl:423
caused by: Unable to read host:port string from worker. Launch command exited with error?
Stacktrace:
[1] read_worker_host_port(io::Base.PipeEndpoint)
@ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:327
[2] connect(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
@ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/managers.jl:517
[3] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
@ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:596
[4] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
[5] (::Distributed.var"#41#44"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
@ Distributed ./task.jl:423
...and 1 more exception.
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base ./task.jl:381
[2] macro expansion
@ ./task.jl:400 [inlined]
[3] addprocs_locked(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, Cmd, Tuple{Symbol}, NamedTuple{(:exeflags,), Tuple{Cmd}}})
@ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:487
[4] addprocs(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, Cmd, Tuple{Symbol}, NamedTuple{(:exeflags,), Tuple{Cmd}}})
@ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:447
[5] #addprocs#249
@ ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/managers.jl:143 [inlined]
[6] process_opts(opts::Base.JLOptions)
@ Distributed ~/packages/julias/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1339
[7] #invokelatest#2
@ ./essentials.jl:716 [inlined]
[8] invokelatest
@ ./essentials.jl:714 [inlined]
[9] exec_options(opts::Base.JLOptions)
@ Base ./client.jl:252
[10] _start()
@ Base ./client.jl:495
/usr/sbin/kill-illegit-procs: line 86: kill: (1463198) - No such process