Intermittent JLD2 Save Error on HPC

Gattu_Mytraya · February 27, 2025, 8:15pm

Hi,
I routinely use JLD2 to save the results of simulations to the disk on my university’s HPC. However, about 1 in 100 times, I get the following error:

ERROR: LoadError: IOError: stat("/storage/home/mvg6042/scratch/qps_2_5/data_40_particles_2_5_filling_factor_2_0_qps_68_chain_number.jld2"): Unknown system error -116 (Unknown system error -116)
Stacktrace:
 [1] uv_error
   @ ./libuv.jl:100 [inlined]
 [2] stat(path::String)
   @ Base.Filesystem ./stat.jl:152
 [3] isdir
   @ ./stat.jl:461 [inlined]
 [4] checkpath_save(file::String)
   @ FileIO ~/work/.julia/packages/FileIO/PtqMQ/src/loadsave.jl:173
 [5] save(file::String, args::Dict{String, Any}; options::@Kwargs{})
   @ FileIO ~/work/.julia/packages/FileIO/PtqMQ/src/loadsave.jl:126
 [6] save
   @ ~/work/.julia/packages/FileIO/PtqMQ/src/loadsave.jl:125 [inlined]
 [7] gibbs_sampler(filename::var"#filename#20"{String, Int64, Int64, Int64, Int64, Int64}, chain_number::Int64, Qstar::Rational{Int64}, l_m_list::Vector{Tuple{Rational{Int64}, Rational{Int64}}}, p::Int64, num_thermalization::Int64, num_steps::Int64)
   @ Main /storage/work/mvg6042/qps_2_5/sampler_single_state.jl:227
 [8] sample_qps(folder_name::String, chain_number::Int64, N::Int64, n::Int64, p::Int64, num_qps_1::Int64, num_qps_2::Int64, num_thermalization::Int64, num_steps::Int64)
   @ Main /storage/work/mvg6042/qps_2_5/sampler_single_state.jl:263
 [9] top-level scope
   @ /storage/work/mvg6042/qps_2_5/sampler_single_state.jl:308
in expression starting at /storage/work/mvg6042/qps_2_5/sampler_single_state.jl:299

I am unsure what causes this. The issue is sporadic, making it difficult to diagnose. I am using the following Julia version:

Julia Version 1.10.7
Commit 4976d05258e (2024-11-26 15:57 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, broadwell)
Threads: 1 default, 0 interactive, 1 GC (on 24 virtual cores)

And this is the relevant code-block which calls JLD2.save:

           data["number of steps"] = monte_carlo_iter
            data["acceptance rate"] = num_samples_accepted/monte_carlo_iter
            data["monte carlo duration"] = time() - t0
            data["pair densities"] = accumulated_pair_density ./ monte_carlo_iter
            data["r grid"] = 0.50 .* (rgrid[1:end-1] .+ rgrid[2:end])
            data["density"] = accumulated_density ./ monte_carlo_iter ./ Agrid
            data["theta grid"] = 0.50 .* (θmesh[1:end-1] .+ θmesh[2:end])

            save(filename(chain_number), data)

Here, data is a Dict.
Any insights into potential causes or debugging strategies would be greatly appreciated.

Alexander-Barth · February 28, 2025, 3:56pm

On the clusters that I have access to, the /home directories are on network file system, while /tmp is a local file system (or an in-memory file system).
If this is your case too (check with the shell command mount), I would try to save the data on /tmp (and move the data to your home directory once the results are done), to debug this issue.

If you are using multiple threads or processes, make sure that not different processes/threads are trying to create a jld2 file with the same name (that would be a race condition).

Gattu_Mytraya · February 28, 2025, 4:52pm

Hi,

I’m currently storing my data in /scratch, which is on a network file system.

Would you recommend incrementally moving the data from /tmp to /scratch during execution using Julia, or should I wait until the program fully completes and then move the files using a shell script? I want to ensure that my files are not lost if /tmp is cleared unexpectedly.

Alexander-Barth · March 1, 2025, 11:15am

I would first try to just copy the data at the end of your computation. On Linux system that I work with /tmp is cleared only during reboot.

Gattu_Mytraya · March 1, 2025, 1:06pm

I’ll try this and get back. Thanks.

Gattu_Mytraya · March 2, 2025, 3:30pm

This works. Thanks.

Topic		Replies	Views
Error saving a .jld file New to Julia	2	907	February 6, 2020
Signal (7): Bus error New to Julia	0	879	January 21, 2021
Isfile() for JLD2 appears to be giving segmentation faults / bus errors on a computer cluster General Usage bug , filesystem	5	238	April 17, 2024
Error EBUSY on saving JLD2 file using Julia-1.3.0-rc2 General Usage	13	1826	March 3, 2022
Drop in performance after saving data with jldsave General Usage	4	129	September 22, 2024

Intermittent JLD2 Save Error on HPC

Related topics