Hello everyone,
I am currently trying to run hybrid shared-distributed memory code on a large computing cluster, which is configured with SLURM. My code is roughly structured like this
using Distributed
using ClusterManagers
addprocs(SlurmManager(number_of_nodes), topology = :master_worker)
include(joinpath(@__DIR__, "src.jl"))
main()
Here, “src.jl” is the code that I want to use which includes several packages (e.g. HDF5, ParallelDataTransfer … ) and subfiles first on the master process and then shares with the workers using @everywhere
. main()
is the function to be run from that code. Oddly, when I submit jobs I sometimes get IOError: unlink: no such file or directory (ENOENT)
with the stack trace pointing me towards the addprocs
line and the job is terminated. If I resubmit however, it sometimes works flawlessly. Is there any paradigm I am missing when using ClusterManagers on a SLURM system?
EDIT:
The full error message is
ERROR: LoadError: TaskFailedException:
IOError: unlink: no such file or directory (ENOENT)
Stacktrace:
[1] uv_error at ./libuv.jl:97 [inlined]
[2] unlink(::String) at ./file.jl:781
[3] #rm#9(::Bool, ::Bool, ::typeof(rm), ::String) at ./file.jl:261
[4] rm at ./file.jl:253 [inlined]
[5] iterate at ./generator.jl:47 [inlined]
[6] _collect(::Array{String,1}, ::Base.Generator{Array{String,1},typeof(rm)}, ::Base.EltypeUnknown, ::Base.HasShape{1}) at ./array.jl:635
[7] collect_similar at ./array.jl:564 [inlined]
[8] map at ./abstractarray.jl:2073 [inlined]
[9] launch(::SlurmManager, ::Dict{Symbol,Any}, ::Array{WorkerConfig,1}, ::Base.GenericCondition{Base.AlwaysLockedST}) at /p/scratch/chku27/kiese1/julia_packages/packages/ClusterManagers/7pPEP/src/slurm.jl:39
[10] (::Distributed.var"#41#44"{SlurmManager,Dict{Symbol,Any},Array{WorkerConfig,1},Base.GenericCondition{Base.AlwaysLockedST}})() at ./task.jl:333
Stacktrace:
[1] wait at ./task.jl:251 [inlined]
[2] #addprocs_locked#40(::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:topology,),Tuple{Symbol}}}, ::typeof(Distributed.addprocs_locked), ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:494
[3] #addprocs_locked at ./none:0 [inlined]
[4] #addprocs#39(::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:topology,),Tuple{Symbol}}}, ::typeof(addprocs), ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:441
[5] (::Distributed.var"#kw##addprocs")(::NamedTuple{(:topology,),Tuple{Symbol}}, ::typeof(addprocs), ::SlurmManager) at ./none:0
[6] top-level scope at /p/scratch/chku27/kiese1/fcc_ml/num_param/lattice/fcc_L6_W44_N125_1l.jl:5
[7] include at ./boot.jl:328 [inlined]
[8] include_relative(::Module, ::String) at ./loading.jl:1105
[9] include(::Module, ::String) at ./Base.jl:31
[10] exec_options(::Base.JLOptions) at ./client.jl:287
[11] _start() at ./client.jl:460