julia> using Distributed, ClusterManagers
julia> addprocs(SGEManager(1,""), qsub_env="",res_list="")
Error launching workers
MethodError(iterate, (Base.ProcessChain(Base.Process[Process(`echo 'cd /lmb/home/alewis && /lmb/home/alewis/local/julia-1.0.0/bin/julia --worker=EiIxQRHNaEkyxSBA'`, ProcessExited(0)), Process(`qsub -N julia-47120 -terse -j y -R y -t 1-1 -V`, ProcessExited(0))], Base.DevNull(), Pipe(RawFD(0xffffffff) closed => RawFD(0x00000014) open, 0 bytes waiting), Base.TTY(RawFD(0x0000000f) open, 0 bytes waiting)),), 0x00000000000061b3)
0-element Array{Int64,1}
julia> addprocs_sge(1)
Error launching workers
MethodError(iterate, (Base.ProcessChain(Base.Process[Process(`echo 'cd /lmb/home/alewis && /lmb/home/alewis/local/julia-1.0.0/bin/julia --worker=EiIxQRHNaEkyxSBA'`, ProcessRunning), Process(`qsub -N julia-47120 -terse -j y -R y -t 1-1 -V`, ProcessRunning)], Base.DevNull(), Pipe(RawFD(0xffffffff) closed => RawFD(0x00000015) open, 0 bytes waiting), Base.TTY(RawFD(0x0000000f) open, 0 bytes waiting)),), 0x00000000000061b3)
0-element Array{Int64,1}
I’ve twiddled with a few things like checking that qsub -N julia-47120 -terse -j y -R y -t 1-1 -V
itself runs without error, but don’t really know how to go about debugging this further, short of reading up on each of the functions named in the error or the launch
method. Any suggestions?
As a workaround, I’ve been queueing myself with qrsh
inside a tmux
session and then launching julia
once I’m out of the queue, or writing Julia scripts that can be called from a qsub
script. In doing so I’ve noticed that adding processes with the addprocs(machines)
syntax fails for any hostname not in my .ssh/known_hosts
. This is, I expect, unrelated to the main issue, but I’ve included the error message below just in case.
error for an unknown host
julia> addprocs(["fmg01"])
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Host key verification failed.
ERROR: Unable to read host:port string from worker. Launch command exited with error?
error(::String) at ./error.jl:33
read_worker_host_port(::Pipe) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:273
connect(::Distributed.SSHManager, ::Int64, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/managers.jl:397
create_worker(::Distributed.SSHManager, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:505
setup_launched_worker(::Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:451
(::getfield(Distributed, Symbol("##47#50")){Distributed.SSHManager,WorkerConfig})() at ./task.jl:259
Stacktrace:
[1] sync_end(::Array{Any,1}) at ./task.jl:226
[2] #addprocs_locked#44(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::Function, ::Distributed.SSHManager) at ./task.jl:266
[3] #addprocs_locked at ./none:0 [inlined]
[4] #addprocs#43(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::Function, ::Distributed.SSHManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:369
[5] #addprocs at ./none:0 [inlined]
[6] #addprocs#251(::Bool, ::Cmd, ::Int64, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Array{String,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/managers.jl:118
[7] addprocs(::Array{String,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/managers.jl:117
[8] top-level scope at none:0```
julia
here is the precompiled 1.0.0 binary for Linux, running on a shared-filesystem, Scientific Linux 7.4 cluster, with SGE 6.2u3 installed.