Best practice for channels that use HDF5 (seems like race is still an issue with lock)

mkarikom · August 10, 2020, 1:07am

I’m about to submit some jobs that use following “emergency backup” scheme (see MWE testBackup.jl below).
I learned the hard way after loosing some data on a job that was submitted to our cluster, due to a failing checkpoint server.

Per recommendation in the manual, data-race freedom is provided by the function runSimulation which acquires a lock on the channel chn1 controlling access to JLD.@save, but if I save too often (by lowering saveinterval), bad things happen during local testing like the repl crashing (see below), or rarely Nautilus crashing.

The csize=10 argument on the constructor for chn1 should be unnecessary, but I just put it in there to see if it helped make things more reliable (it didn’t).

Stuff that is getting backed up will typically be the serialized representation of the model state (very large), so JLD seemed like the way to go, vs say writing some values to a database. Is there a better way to do this?

Job finishes (saveinterval = 50):

Job crashes (saveinterval = 5):
not

testBackup.jl:

using JLD

# hdf5 and therefore jld are not threadsafe, so run this to controll access to hdf5 library
function writeJLD(c::Channel)
  while true
    data = take!(c)
    JLD.@save data["fn"] data["state"]
  end
end

function runSimulation(a,b,c::Channel)
    for i in 1:1000
        # so stuff
        saveinterval = 5
        if mod(i,saveinterval) == 0
            fn = string(a,"_data.jld")
            print("\n saving to ",fn)
            lock(c)
            try
                put!(c,Dict("fn"=>fn,"state"=>b*2))
            finally
                unlock(c)
            end
        end
    end
end

chn1 = Channel(writeJLD;csize=10)
inits = rand(10)
for i in 1:length(inits)
    print(string("\n running foo ",i))
    if i < 0.5
        sleep(2)
    end
    Base.Threads.@spawn runSimulation(string("/tmp/foo",i),inits[i],chn1)
end

Topic		Replies	Views
JLD on Multithreading General Usage jld , hdf5 , multithreading	2	1220	July 24, 2018
Saving to a file during parallel computation General Usage	17	465	October 26, 2023
Recommendation for thread-safe file access (ex .jld2) New to Julia	0	358	August 8, 2020
Trying to read from a HDF5 file and convert to another format in parallel? General Usage	17	971	April 5, 2024
JLD2 seems slow at write operations compared to serialize and HDF5 General Usage data	3	1170	November 20, 2017

Best practice for channels that use HDF5 (seems like race is still an issue with lock)

Related topics