I’m about to submit some jobs that use following “emergency backup” scheme (see MWE testBackup.jl below).
I learned the hard way after loosing some data on a job that was submitted to our cluster, due to a failing checkpoint server.
Per recommendation in the manual, data-race freedom is provided by the function
runSimulation which acquires a lock on the channel
chn1 controlling access to JLD.@save, but if I save too often (by lowering saveinterval), bad things happen during local testing like the repl crashing (see below), or rarely Nautilus crashing.
csize=10 argument on the constructor for
chn1 should be unnecessary, but I just put it in there to see if it helped make things more reliable (it didn’t).
Stuff that is getting backed up will typically be the serialized representation of the model state (very large), so JLD seemed like the way to go, vs say writing some values to a database. Is there a better way to do this?
Job finishes (saveinterval = 50):
Job crashes (saveinterval = 5):
using JLD # hdf5 and therefore jld are not threadsafe, so run this to controll access to hdf5 library function writeJLD(c::Channel) while true data = take!(c) JLD.@save data["fn"] data["state"] end end function runSimulation(a,b,c::Channel) for i in 1:1000 # so stuff saveinterval = 5 if mod(i,saveinterval) == 0 fn = string(a,"_data.jld") print("\n saving to ",fn) lock(c) try put!(c,Dict("fn"=>fn,"state"=>b*2)) finally unlock(c) end end end end chn1 = Channel(writeJLD;csize=10) inits = rand(10) for i in 1:length(inits) print(string("\n running foo ",i)) if i < 0.5 sleep(2) end Base.Threads.@spawn runSimulation(string("/tmp/foo",i),inits[i],chn1) end