Isfile() for JLD2 appears to be giving segmentation faults / bus errors on a computer cluster

I’m posting a question because I’m not sure if I’m making a silly mistake, or if there’s something a bit more subtle going wrong.

I am running a Julia script on a HPC cluster. One aspect of this script is that it checks to see if it’s already completed the job and saved a .jld2 file before running as a checkpoint.

file = output_loc * string(input) * "_final.jld2"
if isfile(file)
    println("Output file already exists at $(file)")
else

I have also tried with ispath(). Somewhere between 10-20% of the jobs submitted crash. When they are restarted they run fine, although this number 10-20% stays consistent until eventually all the jobs run fine.

The error I receive points to the isfile() line.

[29188] signal (7.2): Bus error
in expression starting at /gpfs/fs1/home/a/t_J.jl/VUMPS/run.jl:40
unsafe_store! at ./pointer.jl:146 [inlined]
unsafe_store! at ./pointer.jl:146 [inlined]
jlunsafe_store! at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/JLD2.jl:51 [inlined]
jlunsafe_store! at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/misc.jl:15 [inlined]
_write at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/mmapio.jl:190 [inlined]
jlwrite at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/misc.jl:27 [inlined]
write_object_header_and_dataspace_message at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/datasets.jl:596
write_dataset at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/datasets.jl:574
jfptr_write_dataset_12158 at /scratch/a/aparamek/andykh/julia-depot-x86_64/compiled/v1.10/JLD2/O1EyT_F9htY.so (unknown line)
write_dataset at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/datasets.jl:653
write_ref at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/datasets.jl:656 [inlined]
WriteDataspace at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/dataspaces.jl:53
unknown function (ip: 0x2ab8f46bf9e1)
... 

This error message seems to imply that it’s trying to open the JLD2 file, but I do not understand how this is the case, as there’s no reading or writing involved.

I assume you mean that run.jl:40 is the isfile line?

What happens before this line? Are you using JLD2 at that point or is this part of a loop?

I suspect this is delayed IO error from lib. One thing that particularly stands out to me is the use of mmapio. It is particularly reminiscent of bus errors caused when using shared arrays.

Do you have multiple jobs running on the same node that may be trying to use the same /dev/shm ? How large is /dev/shm?

These are all jobs running on different nodes, but it seems to be an error when two jobs try to touch the same piece of memory or something. I am not using any shared arrays or anything complicated. The only overlap is they’re all looking in the same directory. The lines prior are just defining strings, so it doesn’t seem to be the problem.

output_loc = params["output"]
input = split(input_file[begin:end-4], "/")[end]
file = output_loc * string(input) * "_final.jld2"

JLD2 is using a memory mapping mechanism which may use overlapping resources with /dev/shm. My theory here is that some of jobs are running out of /dev/shm, possibly because there is more than one of them running on the same computer at once.

Did you tell the HPC cluster not to schedule jobs on the same node? My cluster will assign multiple jobs per node if they fit within the memory and cou constraints.

I’m really confused on how you would be receiving errors from lines of code that you did not run. Could you include the full unabridged error?

[143001] signal (7.2): Bus error
in expression starting at /gpfs/fs1/home/a/aparamek/andykh/t_J.jl/VUMPS/run.jl:40
unsafe_store! at ./pointer.jl:146 [inlined]
unsafe_store! at ./pointer.jl:146 [inlined]
jlunsafe_store! at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/JLD2.jl:51 [inlined]
jlunsafe_store! at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/misc.jl:15 [inlined]
store_vlen! at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/data/writing_datatypes.jl:329 [inlined]
h5convert! at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/data/writing_datatypes.jl:418
unknown function (ip: 0x2b4d62f2e9d5)
write_data at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/dataio.jl:96
write_dataset at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/datasets.jl:581
jfptr_write_dataset_12158 at /scratch/a/aparamek/andykh/julia-depot-x86_64/compiled/v1.10/JLD2/O1EyT_F9htY.so (unknown line)
write_dataset at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/datasets.jl:581
jfptr_write_dataset_12158 at /scratch/a/aparamek/andykh/julia-depot-x86_64/compiled/v1.10/JLD2/O1EyT_F9htY.so (unknown line)
write_dataset at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/datasets.jl:653
write_ref at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/datasets.jl:656 [inlined]
WriteDataspace at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/dataspaces.jl:53
unknown function (ip: 0x2b4d62f2fa41)
write_dataset at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/inlineunion.jl:49
write_dataset at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/inlineunion.jl:36 [inlined]
write_ref at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/datasets.jl:656 [inlined]
h5convert! at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/data/writing_datatypes.jl:298 [inlined]
macro expansion at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/data/writing_datatypes.jl:237 [inlined]
h5convert! at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/data/writing_datatypes.jl:237 [inlined]
h5convert! at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/data/custom_serialization.jl:30 [inlined]
write_data at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/dataio.jl:96
write_dataset at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/datasets.jl:653
#write#110 at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/compression.jl:137
write at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/compression.jl:125 [inlined]
#write#109 at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/compression.jl:121 [inlined]
write at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/compression.jl:121
#89 at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/fileio.jl:14
unknown function (ip: 0x2b4d62f25c55)
#jldopen#69 at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/loadsave.jl:4
jldopen at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/loadsave.jl:1 [inlined]
#fileio_save#88 at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/fileio.jl:6 [inlined]
fileio_save at /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/JLD2/VWinU/src/fileio.jl:5
unknown function (ip: 0x2b4d62f226b9)
jl_apply at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/julia.h:1982 [inlined]
jl_f__call_latest at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/builtins.c:812
#invokelatest#2 at ./essentials.jl:887 [inlined]
invokelatest at ./essentials.jl:884 [inlined]
#action#33 at /home/a/aparamek/andykh/.julia/packages/FileIO/jMf68/src/loadsave.jl:219
action at /home/a/aparamek/andykh/.julia/packages/FileIO/jMf68/src/loadsave.jl:196 [inlined]
#action#32 at /home/a/aparamek/andykh/.julia/packages/FileIO/jMf68/src/loadsave.jl:185 [inlined]
action at /home/a/aparamek/andykh/.julia/packages/FileIO/jMf68/src/loadsave.jl:185 [inlined]
#save#20 at /home/a/aparamek/andykh/.julia/packages/FileIO/jMf68/src/loadsave.jl:129
save at /home/a/aparamek/andykh/.julia/packages/FileIO/jMf68/src/loadsave.jl:125
unknown function (ip: 0x2b4d62f04d79)
#run_vumps#1 at /gpfs/fs1/home/a/aparamek/andykh/t_J.jl/VUMPS/vumps.jl:64
run_vumps at /gpfs/fs1/home/a/aparamek/andykh/t_J.jl/VUMPS/vumps.jl:21
unknown function (ip: 0x2b4d62ebc979)
jl_apply at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/julia.h:1982 [inlined]
do_call at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/interpreter.c:126
eval_value at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/interpreter.c:223
eval_stmt_value at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/interpreter.c:174 [inlined]
eval_body at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/interpreter.c:621
jl_interpret_toplevel_thunk at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/interpreter.c:775
jl_toplevel_eval_flex at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/toplevel.c:934
jl_toplevel_eval_flex at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/toplevel.c:877
ijl_toplevel_eval_in at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
include_string at ./loading.jl:2070
_include at ./loading.jl:2130
include at ./Base.jl:495
jfptr_include_46289 at /cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v4/Compiler/gcccore/julia/1.10.0/lib/julia/sys.so (unknown line)
exec_options at ./client.jl:318
_start at ./client.jl:552
jfptr__start_82647 at /cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v4/Compiler/gcccore/julia/1.10.0/lib/julia/sys.so (unknown line)
jl_apply at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/julia.h:1982 [inlined]
#run_vumps#1 at /gpfs/fs1/home/a/aparamek/andykh/t_J.jl/VUMPS/vumps.jl:64
run_vumps at /gpfs/fs1/home/a/aparamek/andykh/t_J.jl/VUMPS/vumps.jl:21
unknown function (ip: 0x2b4d62ebc979)
jl_apply at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/julia.h:1982 [inlined]
do_call at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/interpreter.c:126
eval_value at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/interpreter.c:223
eval_stmt_value at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/interpreter.c:174 [inlined]
eval_body at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/interpreter.c:621
jl_interpret_toplevel_thunk at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/interpreter.c:775
jl_toplevel_eval_flex at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/toplevel.c:934
jl_toplevel_eval_flex at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/toplevel.c:877
ijl_toplevel_eval_in at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
include_string at ./loading.jl:2070
_include at ./loading.jl:2130
include at ./Base.jl:495
jfptr_include_46289 at /cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v4/Compiler/gcccore/julia/1.10.0/lib/julia/sys.so (unknown line)
exec_options at ./client.jl:318
_start at ./client.jl:552
jfptr__start_82647 at /cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v4/Compiler/gcccore/julia/1.10.0/lib/julia/sys.so (unknown line)
jl_apply at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/julia.h:1982 [inlined]
true_main at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/jlapi.c:582
jl_repl_entrypoint at /tmp/ebuser/avx512/Julia/1.10.0/GCCcore-12.3-gentoo/julia-1.10.0/src/jlapi.c:731
main at julia (unknown line)
unknown function (ip: 0x2b4d4d932949)
__libc_start_main at /cvmfs/soft.computecanada.ca/gentoo/2023/x86-64-v3/usr/lib64/libc.so.6 (unknown line)
_start at julia (unknown line)

This cluster explicitly only runs jobs on entire nodes.