Running batch jobs on cluster

I am trying to run a number of instances of a program on a cluster. The job submit script is as follows

#!/bin/bash
#SBATCH --job-name=open_array
#SBATCH --array=1-96
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=7G
#SBATCH --time=48:00:00
#SBATCH --output=slurm_files_1/job_open_%a.out
#SBATCH --account=open
#SBATCH --export=ALL

module load mkl
julia sampler_qps.jl $SLURM_ARRAY_TASK_ID open --heap-size-hint=7000MB

I get the following error:

error in running finalizer: Base.IOError(msg="stat(RawFD(17)): Unknown system error -116 (Unknown system error -116)", code=-116)
uv_error at ./libuv.jl:106 [inlined]
stat at ./stat.jl:176
stat at ./filesystem.jl:356 [inlined]
close at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:341
jfptr_close_49814.1 at /storage/home/my_user_id/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
run_finalizer at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/gc.c:303
jl_gc_run_finalizers_in_list at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/gc.c:395
run_finalizers at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/gc.c:439
run_finalizers at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/gc.c:420 [inlined]
ijl_gc_collect at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/gc.c:3915
maybe_collect at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/gc.c:926 [inlined]
jl_gc_pool_alloc_inner at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/gc.c:1319
jl_gc_alloc_ at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia_internal.h:523 [inlined]
_new_genericmemory_ at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/genericmemory.c:56 [inlined]
jl_alloc_genericmemory at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/genericmemory.c:99
ijl_array_grow_end at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/array.c:229
ijl_module_names at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/module.c:1001
#unsorted_names#9 at ./reflection.jl:96 [inlined]
unsorted_names at ./reflection.jl:96 [inlined]
make_typealias at ./show.jl:624
show_typealias at ./show.jl:805
_show_type at ./show.jl:970
show at ./show.jl:965
jfptr_show_49354.1 at /storage/home/my_user_id/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
#sprint#592 at ./strings/io.jl:112
sprint at ./strings/io.jl:107 [inlined]
#print_type_bicolor#657 at ./show.jl:2718 [inlined]
print_type_bicolor at ./show.jl:2717
unknown function (ip: 0x14a8359c11cd)
#show_tuple_as_call#651 at ./show.jl:2599
show_tuple_as_call at ./show.jl:2552 [inlined]
show_spec_sig at ./stacktraces.jl:260
show_spec_linfo at ./stacktraces.jl:232
print_stackframe at ./errorshow.jl:762
print_stackframe at ./errorshow.jl:729
#show_full_backtrace#1045 at ./errorshow.jl:628
show_full_backtrace at ./errorshow.jl:621 [inlined]
show_backtrace at ./errorshow.jl:823
#showerror#1024 at ./errorshow.jl:99
showerror at ./errorshow.jl:95
unknown function (ip: 0x14a8359c7802)
show_exception_stack at ./errorshow.jl:996
display_error at ./client.jl:117
unknown function (ip: 0x14a8359c4266)
display_error at ./client.jl:120
jfptr_display_error_73226.1 at /storage/home/my_user_id/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
jl_f__call_latest at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/builtins.c:875
#invokelatest#2 at ./essentials.jl:1055 [inlined]
invokelatest at ./essentials.jl:1052 [inlined]
exec_options at ./client.jl:326
_start at ./client.jl:531
jfptr__start_73430.1 at /storage/home/my_user_id/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
true_main at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/jlapi.c:1059
main at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Error in Timer:
SystemError: futimes: Stale file handle
Stacktrace:
 [1] systemerror(p::Symbol, errno::Int32; extrainfo::Nothing)
   @ ERROR: LoadError: IOError: stat(RawFD(17)): Unknown system error -116 (Unknown system error -116)
Stacktrace:
  [1] uv_error
    @ ./libuv.jl:106 [inlined]
  [2] stat(fd::RawFD)
    @ Base.Filesystem ./stat.jl:176
  [3] stat
    @ ./filesystem.jl:356 [inlined]
  [4] close(lock::FileWatching.Pidfile.LockMonitor)
    @ FileWatching.Pidfile ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:341
  [5] mkpidlock(f::Base.var"#1110#1111"{Base.PkgId}, at::String, pid::Int32; kwopts::@Kwargs{stale_age::Int64, wait::Bool})
    @ FileWatching.Pidfile ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:97
  [6] #mkpidlock#6
    @ ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:90 [inlined]
  [7] trymkpidlock(::Function, ::Vararg{Any}; kwargs::@Kwargs{stale_age::Int64})
    @ FileWatching.Pidfile ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:116
  [8] #invokelatest#2
    @ ./essentials.jl:1057 [inlined]
  [9] invokelatest
    @ ./essentials.jl:1052 [inlined]
 [10] maybe_cachefile_lock(f::Base.var"#1110#1111"{Base.PkgId}, pkg::Base.PkgId, srcpath::String; stale_age::Int64)
    @ Base ./loading.jl:3698
 [11] maybe_cachefile_lock
    @ ./loading.jl:3695 [inlined]
 [12] _require(pkg::Base.PkgId, env::String)
    @ Base ./loading.jl:2565
 [13] __require_prelocked(uuidkey::Base.PkgId, env::String)
    @ Base ./loading.jl:2388
 [14] #invoke_in_world#3
    @ ./essentials.jl:1089 [inlined]
 [15] invoke_in_world
    @ ./essentials.jl:1086 [inlined]
 [16] _require_prelocked(uuidkey::Base.PkgId, env::String)
    @ Base ./loading.jl:2375
 [17] macro expansion
    @ ./loading.jl:2314 [inlined]
 [18] macro expansion
    @ ./lock.jl:273 [inlined]
 [19] __require(into::Module, mod::Symbol)
    @ Base ./loading.jl:2271
 [20] #invoke_in_world#3
    @ ./essentials.jl:1089 [inlined]
 [21] invoke_in_world
    @ ./essentials.jl:1086 [inlined]
 [22] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:2260
in expression starting at /storage/work/my_user_id/Julia_reinstall/sampler_qps.jl:4
 
caused by: Failed to precompile TensorOperations [6aa20fa7-93e2-5fca-9bc0-fbd0db3c71a2] to "/storage/home/my_user_id/.julia/compiled/v1.11/TensorOperations/jl_nxgLIB".
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool; flags::Cmd, cacheflags::Base.CacheFlags, reasons::Dict{String, Int64}, loadable_exts::Nothing)
    @ Base ./loading.jl:3174
  [3] (::Base.var"#1110#1111"{Base.PkgId})()
    @ Base ./loading.jl:2579
  [4] mkpidlock(f::Base.var"#1110#1111"{Base.PkgId}, at::String, pid::Int32; kwopts::@Kwargs{stale_age::Int64, wait::Bool})
    @ FileWatching.Pidfile ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:95
  [5] #mkpidlock#6
    @ ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:90 [inlined]
  [6] trymkpidlock(::Function, ::Vararg{Any}; kwargs::@Kwargs{stale_age::Int64})
    @ FileWatching.Pidfile ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:116
  [7] #invokelatest#2
    @ ./essentials.jl:1057 [inlined]
  [8] invokelatest
    @ ./essentials.jl:1052 [inlined]
  [9] maybe_cachefile_lock(f::Base.var"#1110#1111"{Base.PkgId}, pkg::Base.PkgId, srcpath::String; stale_age::Int64)
    @ Base ./loading.jl:3698
 [10] maybe_cachefile_lock
    @ ./loading.jl:3695 [inlined]
 [11] _require(pkg::Base.PkgId, env::String)
    @ Base ./loading.jl:2565
 [12] __require_prelocked(uuidkey::Base.PkgId, env::String)
    @ Base ./loading.jl:2388
 [13] #invoke_in_world#3
    @ ./essentials.jl:1089 [inlined]
 [14] invoke_in_world
    @ ./essentials.jl:1086 [inlined]
 [15] _require_prelocked(uuidkey::Base.PkgId, env::String)
    @ Base ./loading.jl:2375
 [16] macro expansion
    @ ./loading.jl:2314 [inlined]
 [17] macro expansion
    @ ./lock.jl:273 [inlined]
 [18] __require(into::Module, mod::Symbol)
    @ Base ./loading.jl:2271
 [19] #invoke_in_world#3
    @ ./essentials.jl:1089 [inlined]
 [20] invoke_in_world
    @ ./essentials.jl:1086 [inlined]
 [21] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:2260
Base ./error.jl:176
 [2] systemerror
   @ ./error.jl:175 [inlined]
 [3] touch
   @ ./filesystem.jl:361 [inlined]
 [4] #2
   @ ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:75 [inlined]
 [5] (::Base.var"#816#817"{FileWatching.Pidfile.var"#2#4"{Base.Filesystem.File}, Timer})()
   @ Base ./asyncevent.jl:313slurmstepd: error: Detected 1 oom_kill event in StepId=38460595.batch. Some of the step tasks have been OOM Killed.

Is there a way to precompile packages only once instead of doing it for every job? I got an error running the same program on a single node (for the same TensorOperations package), but deleting the pidfile worked and the program ran correctly. I don’t know how to make it work on the cluster. Also, the error description is somewhat different for different instances, but they all seem to identify the precompilation failure of TensorOperations as the cause.

Yes, just literally do what you said. If you use Julia v1.11+, one thing you can do to fail early in case an environment isn’t fully precompiled is to use the flag --compiled-modules=strict.

1 Like

I do precompile and test a single instance before running the batch job. But for some reason it tries to precompile on every node.

Ok, you say you did try to precompile the packages, but the cache wasn’t hit for some reason. Then you have to figure out why. But in principle this should work. A couple of comments:

  • is this a heterogeneous system where different nodes have CPUs with different ISAs (microarchitectures)? If so, read JuliaHPC notes about setting JULIA_CPU_TARGET appropriately
  • if the above doesn’t apply or it’s not enough, then you’ll need to discover why the cache is rejected. At the top of your script put
    ENV["JULIA_DEBUG"] = "loading"
    
    This doesn’t solve your problem, but prints to screen information about why cache is rejected.
2 Likes