Addprocs() fails with IOError on SLURM

Hello everyone,

I am currently trying to run hybrid shared-distributed memory code on a large computing cluster, which is configured with SLURM. My code is roughly structured like this

using Distributed
using ClusterManagers

addprocs(SlurmManager(number_of_nodes), topology = :master_worker)
include(joinpath(@__DIR__, "src.jl"))

Here, “src.jl” is the code that I want to use which includes several packages (e.g. HDF5, ParallelDataTransfer … ) and subfiles first on the master process and then shares with the workers using @everywhere. main() is the function to be run from that code. Oddly, when I submit jobs I sometimes get IOError: unlink: no such file or directory (ENOENT) with the stack trace pointing me towards the addprocs line and the job is terminated. If I resubmit however, it sometimes works flawlessly. Is there any paradigm I am missing when using ClusterManagers on a SLURM system?


The full error message is

ERROR: LoadError: TaskFailedException:
IOError: unlink: no such file or directory (ENOENT)
 [1] uv_error at ./libuv.jl:97 [inlined]
 [2] unlink(::String) at ./file.jl:781
 [3] #rm#9(::Bool, ::Bool, ::typeof(rm), ::String) at ./file.jl:261
 [4] rm at ./file.jl:253 [inlined]
 [5] iterate at ./generator.jl:47 [inlined]
 [6] _collect(::Array{String,1}, ::Base.Generator{Array{String,1},typeof(rm)}, ::Base.EltypeUnknown, ::Base.HasShape{1}) at ./array.jl:635
 [7] collect_similar at ./array.jl:564 [inlined]
 [8] map at ./abstractarray.jl:2073 [inlined]
 [9] launch(::SlurmManager, ::Dict{Symbol,Any}, ::Array{WorkerConfig,1}, ::Base.GenericCondition{Base.AlwaysLockedST}) at /p/scratch/chku27/kiese1/julia_packages/packages/ClusterManagers/7pPEP/src/slurm.jl:39
 [10] (::Distributed.var"#41#44"{SlurmManager,Dict{Symbol,Any},Array{WorkerConfig,1},Base.GenericCondition{Base.AlwaysLockedST}})() at ./task.jl:333
 [1] wait at ./task.jl:251 [inlined]
 [2] #addprocs_locked#40(::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:topology,),Tuple{Symbol}}}, ::typeof(Distributed.addprocs_locked), ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:494
 [3] #addprocs_locked at ./none:0 [inlined]
 [4] #addprocs#39(::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:topology,),Tuple{Symbol}}}, ::typeof(addprocs), ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:441
 [5] (::Distributed.var"#kw##addprocs")(::NamedTuple{(:topology,),Tuple{Symbol}}, ::typeof(addprocs), ::SlurmManager) at ./none:0
 [6] top-level scope at /p/scratch/chku27/kiese1/fcc_ml/num_param/lattice/fcc_L6_W44_N125_1l.jl:5
 [7] include at ./boot.jl:328 [inlined]
 [8] include_relative(::Module, ::String) at ./loading.jl:1105
 [9] include(::Module, ::String) at ./Base.jl:31
 [10] exec_options(::Base.JLOptions) at ./client.jl:287
 [11] _start() at ./client.jl:460

I am not completely sure, because these days I have not access to my cluster, but in SLURM usually you have to put the absolute path when you work with files, because the workers have initially a different relative path.

I hope this help you.

Thanks for the reply. That is actually the way I do it. I have edited my post accordingly. Anyways, the error seems to happen already when the ClusterManager tries to set up the workers, so there should not have been any file loading yet.