Addprocs() fails with IOError on SLURM

dkiese · April 21, 2020, 8:55am

Hello everyone,

I am currently trying to run hybrid shared-distributed memory code on a large computing cluster, which is configured with SLURM. My code is roughly structured like this

using Distributed
using ClusterManagers

addprocs(SlurmManager(number_of_nodes), topology = :master_worker)
include(joinpath(@__DIR__, "src.jl"))
main()

Here, “src.jl” is the code that I want to use which includes several packages (e.g. HDF5, ParallelDataTransfer … ) and subfiles first on the master process and then shares with the workers using @everywhere. main() is the function to be run from that code. Oddly, when I submit jobs I sometimes get IOError: unlink: no such file or directory (ENOENT) with the stack trace pointing me towards the addprocs line and the job is terminated. If I resubmit however, it sometimes works flawlessly. Is there any paradigm I am missing when using ClusterManagers on a SLURM system?

EDIT:

The full error message is

ERROR: LoadError: TaskFailedException:
IOError: unlink: no such file or directory (ENOENT)
Stacktrace:
 [1] uv_error at ./libuv.jl:97 [inlined]
 [2] unlink(::String) at ./file.jl:781
 [3] #rm#9(::Bool, ::Bool, ::typeof(rm), ::String) at ./file.jl:261
 [4] rm at ./file.jl:253 [inlined]
 [5] iterate at ./generator.jl:47 [inlined]
 [6] _collect(::Array{String,1}, ::Base.Generator{Array{String,1},typeof(rm)}, ::Base.EltypeUnknown, ::Base.HasShape{1}) at ./array.jl:635
 [7] collect_similar at ./array.jl:564 [inlined]
 [8] map at ./abstractarray.jl:2073 [inlined]
 [9] launch(::SlurmManager, ::Dict{Symbol,Any}, ::Array{WorkerConfig,1}, ::Base.GenericCondition{Base.AlwaysLockedST}) at /p/scratch/chku27/kiese1/julia_packages/packages/ClusterManagers/7pPEP/src/slurm.jl:39
 [10] (::Distributed.var"#41#44"{SlurmManager,Dict{Symbol,Any},Array{WorkerConfig,1},Base.GenericCondition{Base.AlwaysLockedST}})() at ./task.jl:333
Stacktrace:
 [1] wait at ./task.jl:251 [inlined]
 [2] #addprocs_locked#40(::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:topology,),Tuple{Symbol}}}, ::typeof(Distributed.addprocs_locked), ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:494
 [3] #addprocs_locked at ./none:0 [inlined]
 [4] #addprocs#39(::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:topology,),Tuple{Symbol}}}, ::typeof(addprocs), ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:441
 [5] (::Distributed.var"#kw##addprocs")(::NamedTuple{(:topology,),Tuple{Symbol}}, ::typeof(addprocs), ::SlurmManager) at ./none:0
 [6] top-level scope at /p/scratch/chku27/kiese1/fcc_ml/num_param/lattice/fcc_L6_W44_N125_1l.jl:5
 [7] include at ./boot.jl:328 [inlined]
 [8] include_relative(::Module, ::String) at ./loading.jl:1105
 [9] include(::Module, ::String) at ./Base.jl:31
 [10] exec_options(::Base.JLOptions) at ./client.jl:287
 [11] _start() at ./client.jl:460

dmolina · April 21, 2020, 10:53am

I am not completely sure, because these days I have not access to my cluster, but in SLURM usually you have to put the absolute path when you work with files, because the workers have initially a different relative path.

I hope this help you.

dkiese · April 21, 2020, 11:19am

Thanks for the reply. That is actually the way I do it. I have edited my post accordingly. Anyways, the error seems to happen already when the ClusterManager tries to set up the workers, so there should not have been any file loading yet.

Topic		Replies	Views
Interminnent errors throwing when adding procs General Usage	1	307	September 5, 2022
Slurm Manager not working New to Julia question , package	3	543	February 22, 2023
Code that works fine distributed across processes on one node using slurm seems to fail when trying to generate workers across many Julia at Scale question	2	1421	May 19, 2022
Error using ClusterManager on PBS Torque New to Julia question	0	435	March 28, 2019
Addprocs_slurm not connecting to all available workers Julia at Scale question , parallel , cluster , distributed , high-performance	7	151	December 4, 2024

Addprocs() fails with IOError on SLURM

Related topics