Best practice for channels that use HDF5 (seems like race is still an issue with lock)

I’m about to submit some jobs that use following “emergency backup” scheme (see MWE testBackup.jl below).
I learned the hard way after loosing some data on a job that was submitted to our cluster, due to a failing checkpoint server.

Per recommendation in the manual, data-race freedom is provided by the function runSimulation which acquires a lock on the channel chn1 controlling access to JLD.@save, but if I save too often (by lowering saveinterval), bad things happen during local testing like the repl crashing (see below), or rarely Nautilus crashing.

The csize=10 argument on the constructor for chn1 should be unnecessary, but I just put it in there to see if it helped make things more reliable (it didn’t).

Stuff that is getting backed up will typically be the serialized representation of the model state (very large), so JLD seemed like the way to go, vs say writing some values to a database. Is there a better way to do this?

Job finishes (saveinterval = 50):

Job crashes (saveinterval = 5):
not

testBackup.jl:

using JLD

# hdf5 and therefore jld are not threadsafe, so run this to controll access to hdf5 library
function writeJLD(c::Channel)
  while true
    data = take!(c)
    JLD.@save data["fn"] data["state"]
  end
end

function runSimulation(a,b,c::Channel)
    for i in 1:1000
        # so stuff
        saveinterval = 5
        if mod(i,saveinterval) == 0
            fn = string(a,"_data.jld")
            print("\n saving to ",fn)
            lock(c)
            try
                put!(c,Dict("fn"=>fn,"state"=>b*2))
            finally
                unlock(c)
            end
        end
    end
end

chn1 = Channel(writeJLD;csize=10)
inits = rand(10)
for i in 1:length(inits)
    print(string("\n running foo ",i))
    if i < 0.5
        sleep(2)
    end
    Base.Threads.@spawn runSimulation(string("/tmp/foo",i),inits[i],chn1)
end