Intermittent JLD2 Save Error on HPC

Hi,
I routinely use JLD2 to save the results of simulations to the disk on my university’s HPC. However, about 1 in 100 times, I get the following error:

ERROR: LoadError: IOError: stat("/storage/home/mvg6042/scratch/qps_2_5/data_40_particles_2_5_filling_factor_2_0_qps_68_chain_number.jld2"): Unknown system error -116 (Unknown system error -116)
Stacktrace:
 [1] uv_error
   @ ./libuv.jl:100 [inlined]
 [2] stat(path::String)
   @ Base.Filesystem ./stat.jl:152
 [3] isdir
   @ ./stat.jl:461 [inlined]
 [4] checkpath_save(file::String)
   @ FileIO ~/work/.julia/packages/FileIO/PtqMQ/src/loadsave.jl:173
 [5] save(file::String, args::Dict{String, Any}; options::@Kwargs{})
   @ FileIO ~/work/.julia/packages/FileIO/PtqMQ/src/loadsave.jl:126
 [6] save
   @ ~/work/.julia/packages/FileIO/PtqMQ/src/loadsave.jl:125 [inlined]
 [7] gibbs_sampler(filename::var"#filename#20"{String, Int64, Int64, Int64, Int64, Int64}, chain_number::Int64, Qstar::Rational{Int64}, l_m_list::Vector{Tuple{Rational{Int64}, Rational{Int64}}}, p::Int64, num_thermalization::Int64, num_steps::Int64)
   @ Main /storage/work/mvg6042/qps_2_5/sampler_single_state.jl:227
 [8] sample_qps(folder_name::String, chain_number::Int64, N::Int64, n::Int64, p::Int64, num_qps_1::Int64, num_qps_2::Int64, num_thermalization::Int64, num_steps::Int64)
   @ Main /storage/work/mvg6042/qps_2_5/sampler_single_state.jl:263
 [9] top-level scope
   @ /storage/work/mvg6042/qps_2_5/sampler_single_state.jl:308
in expression starting at /storage/work/mvg6042/qps_2_5/sampler_single_state.jl:299

I am unsure what causes this. The issue is sporadic, making it difficult to diagnose. I am using the following Julia version:

Julia Version 1.10.7
Commit 4976d05258e (2024-11-26 15:57 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, broadwell)
Threads: 1 default, 0 interactive, 1 GC (on 24 virtual cores)

And this is the relevant code-block which calls JLD2.save:

           data["number of steps"] = monte_carlo_iter
            data["acceptance rate"] = num_samples_accepted/monte_carlo_iter
            data["monte carlo duration"] = time() - t0
            data["pair densities"] = accumulated_pair_density ./ monte_carlo_iter
            data["r grid"] = 0.50 .* (rgrid[1:end-1] .+ rgrid[2:end])
            data["density"] = accumulated_density ./ monte_carlo_iter ./ Agrid
            data["theta grid"] = 0.50 .* (θmesh[1:end-1] .+ θmesh[2:end])

            save(filename(chain_number), data)

Here, data is a Dict.
Any insights into potential causes or debugging strategies would be greatly appreciated.

On the clusters that I have access to, the /home directories are on network file system, while /tmp is a local file system (or an in-memory file system).
If this is your case too (check with the shell command mount), I would try to save the data on /tmp (and move the data to your home directory once the results are done), to debug this issue.

If you are using multiple threads or processes, make sure that not different processes/threads are trying to create a jld2 file with the same name (that would be a race condition).

1 Like

Hi,

I’m currently storing my data in /scratch, which is on a network file system.

Would you recommend incrementally moving the data from /tmp to /scratch during execution using Julia, or should I wait until the program fully completes and then move the files using a shell script? I want to ensure that my files are not lost if /tmp is cleared unexpectedly.

I would first try to just copy the data at the end of your computation. On Linux system that I work with /tmp is cleared only during reboot.

I’ll try this and get back. Thanks.

This works. Thanks.