The post today by @Xing_Shi_Cai leads me to ask - has anyone worked with checkpointing in Julia?
A bit of Googling would suggest DMTCP would work http://dmtcp.sourceforge.net/
I guess the real answer is - why don’t I give it a try instead of asking?
Can you give a brief description on the kind of checkpointing you’re referring to, and maybe a small example of a situation in which it’s beneficial? The Sourceforge page assumes prior knowledge of what checkpointing is.
Sure. When running on an HPC cluster you run a job on many compute nodes. Julia makes this easy. Quite often compute jobs will run for days or ever weeks. If a compute node crashes during that time then all your work is lost.
Applications will often write out an intermediate file following every N time steps.
So you can use this file to restart your computation for the point before it failed - you do not have to start from the beginning.
If you do not have an intermediate save file a checkpoint means that the state of the application on each compute node is written to disk. You then can start the application at that point.
With exascale compute cluster being almost ready to go, the probabilities of a node failure become very high. So there is huge interest in writing out these checkpoints quickly to ‘burst buffer’ filesystems.
Oh interesting. I never even thought to look for something like this. When I needed something like this, I did it the old fashioned way once where I just wrote out several large txt files every so often with the critical information.
Dagger.jl recently gained checkpoint/restore support, which lets you specify how and where to save data for each “thunk” (unit of work) in your computation. Of course, Dagger also has decent fault tolerance for when workers spontaneously die, and is designed for distributed and heterogeneous computing.
One day in the far future, CRIU support would be very cool to have.
Now that basic Erlang/OTP like error-handling is implemented in Actors.jl (see issue #16 there and description in the manual), I have an issue about checkpointing and want to develop it in the next weeks:
I would appreciate some comments or any help on that. Thank you in advance!