'Stale File Handle' Error When Submitting Job Array on SLURM

I’m attempting to submit a large number of jobs on an HPC running SLURM. The code runs fine when submitting a single job at a time, but I am submitting 4000 jobs at once. The error is as follows:

ERROR: SystemError: close: Stale file handle
Stacktrace:
[1] systemerror(p::String, errno::Int32; extrainfo::Nothing)
@ Base ./error.jl:168
[2] #systemerror#62
@ ./error.jl:167 [inlined]
[3] systemerror
@ ./error.jl:167 [inlined]
[4] close
@ ./iostream.jl:63 [inlined]
[5] open(::Pkg.API.var"#248#249"{Vector{Pkg.Types.PackageSpec}}, ::String, ::Vararg{String, N} where N; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Base ./io.jl:332
[6] open
@ ./io.jl:328 [inlined]
[7] save_precompile_state()
@ Pkg.API /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1279
[8] precompile(ctx::Pkg.Types.Context; internal_call::Bool, strict::Bool, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Pkg.API /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1210
[9] _auto_precompile(ctx::Pkg.Types.Context)
@ Pkg /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/Pkg.jl:596
[10] instantiate(ctx::Pkg.Types.Context; manifest::Nothing, update_registry::Bool, verbose::Bool, platform::Base.BinaryPlatforms.Platform, allow_build::Bool, allow_autoprecomp::Bool, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Pkg.API /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1357
[11] instantiate
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1325 [inlined]
[12] #instantiate#252
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1321 [inlined]
[13] instantiate()
@ Pkg.API /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:1321
[14] top-level scope
@ none:1

I suspect this is a race condition. Following some other advice found on Discourse, I’ve tried calling the code using the following to no avail:

~/julia-1.6.2/bin/julia --project=@. -e 'using Pkg; Pkg.instantiate(); Pkg.precompile(); include(PATH_TO_SCRIPT)'

I’ve tried adding random sleep time before the above code, which has allowed maybe 70% of my jobs to start successfully, but I am looking for a 100% solution as computing resources are costly, and I don’t want to pay for random sleep time.

1 Like

The issue was resolved using @simonbyrne 's PkgLock.jl (https://github.com/simonbyrne/PkgLock.jl).

The submission code was changed to the following after installing PkgLock.jl as a shared environment:

~/julia-1.6.2/bin/julia --project -e 'push!(LOAD_PATH, "@pkglock"); using PkgLock; PkgLock.instantiate_precompile()'
~/julia-1.6.2/bin/julia --project -e 'include(PATH_TO_SCRIPT)'
1 Like