I’m about to submit some jobs that use following “emergency backup” scheme (see MWE testBackup.jl below).
I learned the hard way after loosing some data on a job that was submitted to our cluster, due to a failing checkpoint server.
Per recommendation in the manual, data-race freedom is provided by the function runSimulation
which acquires a lock on the channel chn1
controlling access to JLD.@save, but if I save too often (by lowering saveinterval), bad things happen during local testing like the repl crashing (see below), or rarely Nautilus crashing.
The csize=10
argument on the constructor for chn1
should be unnecessary, but I just put it in there to see if it helped make things more reliable (it didn’t).
Stuff that is getting backed up will typically be the serialized representation of the model state (very large), so JLD seemed like the way to go, vs say writing some values to a database. Is there a better way to do this?
Job finishes (saveinterval = 50):
Job crashes (saveinterval = 5):
testBackup.jl:
using JLD
# hdf5 and therefore jld are not threadsafe, so run this to controll access to hdf5 library
function writeJLD(c::Channel)
while true
data = take!(c)
JLD.@save data["fn"] data["state"]
end
end
function runSimulation(a,b,c::Channel)
for i in 1:1000
# so stuff
saveinterval = 5
if mod(i,saveinterval) == 0
fn = string(a,"_data.jld")
print("\n saving to ",fn)
lock(c)
try
put!(c,Dict("fn"=>fn,"state"=>b*2))
finally
unlock(c)
end
end
end
end
chn1 = Channel(writeJLD;csize=10)
inits = rand(10)
for i in 1:length(inits)
print(string("\n running foo ",i))
if i < 0.5
sleep(2)
end
Base.Threads.@spawn runSimulation(string("/tmp/foo",i),inits[i],chn1)
end