Error with addprocs_sge()

thielehhi · June 16, 2017, 1:08pm

Is there someone experiencing the same issue and having a solution for me:
I started using Julia in combination with the ClusterManager on our HPC. I am trying to start workers with the following lines:

## predefining parameters, given by global variables
np = parse(Int, ARGS[1])
queue = ARGS[2]
if nprocs()>1 || workers()[1] != myid()
  rmprocs(workers())
end
print("example script for HPC on SGE... \n")
## needed packages
using ClusterManagers
## initial parameters for calculation
n = Int(100)
num_workers = Int(np) # assign number of workers
sleep(1)
# you can check which workers are currently active
print("current internal IDs of workers (without SGE): ", workers()," \n")
print("Now $num_workers workers are added (SGE), this can take some time... \n")
@time addprocs_sge(num_workers, queue=queue, topology=:master_slave)
@everywhere using HDF5
## hdf-write
@everywhere function foo(n)
  a = randn(n,n)
  id = myid()
  h5write("task_id_$id.h5", "a", a)
end

If num_workers is small<200 everything works. In contrast if e.g. num_workers=600 or larger then I am receiving the following error message:

fatal: error thrown and no exception handler available.
UndefRefError()
unknown function (ip: 0x2b9d7fe6b6f7)
unknown function (ip: 0x2b9d7fe43811)
jl_throw at /usr/bin/../lib64/libjulia.so.0.5 (unknown line)
showerror at ./replutil.jl:254
#showerror#919 at ./replutil.jl:210
unknown function (ip: 0x2b9e0e784f19)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.0.5 (unknown line)
#showerror at ./<missing>:0
unknown function (ip: 0x2b9e0e784d42)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.0.5 (unknown line)
showerror at ./task.jl:23
unknown function (ip: 0x2b9e0e784a06)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.0.5 (unknown line)
showerror at ./task.jl:39
#showerror#919 at ./replutil.jl:210
unknown function (ip: 0x2b9e0e784659)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.0.5 (unknown line)
#showerror at ./<missing>:0
unknown function (ip: 0x2b9e0e7843d2)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.0.5 (unknown line)
#showerror#920 at ./replutil.jl:218
unknown function (ip: 0x2b9e0e784079)
#939 at ./client.jl:100
unknown function (ip: 0x2b9e0e783e22)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.0.5 (unknown line)
with_output_color at ./util.jl:303
jl_apply_generic at /usr/bin/../lib64/libjulia.so.0.5 (unknown line)
unknown function (ip: 0x2b9d8aae0d87)
unknown function (ip: 0x2b9d8aae17b8)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.0.5 (unknown line)
unknown function (ip: 0x4019ee)
unknown function (ip: 0x401399)
__libc_start_main at /usr/bin/../lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4013df)
WARNING: Error trying to reuse client port number, falling back to plain socket : cannot obtain socket name: operation not permitted (EPERM)
ERROR: LoadError: 
 in connect_to_worker(::SubString{String}, ::Int16) at ./managers.jl:482
 in connect(::ClusterManagers.SGEManager, ::Int64, ::WorkerConfig) at ./managers.jl:425
 in create_worker(::ClusterManagers.SGEManager, ::WorkerConfig) at ./multi.jl:1786
 in setup_launched_worker(::ClusterManagers.SGEManager, ::WorkerConfig, ::Array{Int64,1}) at ./multi.jl:1733
 in (::Base.##669#673{ClusterManagers.SGEManager,Array{Int64,1}})() at ./task.jl:360
 in macro expansion at ./task.jl:327 [inlined]
 in #addprocs_locked#665(::Array{Any,1}, ::Function, ::ClusterManagers.SGEManager) at ./multi.jl:1688
 in (::Base.#kw##addprocs_locked)(::Array{Any,1}, ::Base.#addprocs_locked, ::ClusterManagers.SGEManager) at ./<missing>:0
 in #addprocs#664(::Array{Any,1}, ::Function, ::ClusterManagers.SGEManager) at ./multi.jl:1658
 in (::Base.#kw##addprocs)(::Array{Any,1}, ::Base.#addprocs, ::ClusterManagers.SGEManager) at ./<missing>:0
 in (::ClusterManagers.#kw##addprocs_sge)(::Array{Any,1}, ::ClusterManagers.#addprocs_sge, ::Int64) at ./<missing>:0UndefRefError()
example script for HPC on SGE... 
current internal IDs of workers (without SGE): [1] 
Now 610 workers are added (SGE), this can take some time... 
job id is 86009, waiting for job to start .................................Error launching workers
could not spawn `tail -f /data/cluster/users/test_user/.julia_logs/julia-2805.o86009.481`: too many open files (EMFILE)

Any suggestions?

Topic		Replies	Views
Why do these SGE procs fail to launch? General Usage question , debug , cluster , distributed	1	906	October 3, 2018
Addprocs error General Usage distributed	3	703	October 12, 2021
Addprocs() does not work General Usage	6	383	March 30, 2021
Addprocs() on remote machines failing Julia at Scale	6	1130	December 9, 2019
Julia parallel computing over multiple nodes in SGE cluster New to Julia parallel	3	1309	April 16, 2017

Error with addprocs_sge()

Related topics