I guess the real answer is - why don’t I give it a try instead of asking?
Can you give a brief description on the kind of checkpointing you’re referring to, and maybe a small example of a situation in which it’s beneficial? The Sourceforge page assumes prior knowledge of what checkpointing is.
If you happen to try it out, please let us know how it works for you and if you run into any bugs
Sure. When running on an HPC cluster you run a job on many compute nodes. Julia makes this easy. Quite often compute jobs will run for days or ever weeks. If a compute node crashes during that time then all your work is lost.
Applications will often write out an intermediate file following every N time steps.
So you can use this file to restart your computation for the point before it failed - you do not have to start from the beginning.
If you do not have an intermediate save file a checkpoint means that the state of the application on each compute node is written to disk. You then can start the application at that point.
With exascale compute cluster being almost ready to go, the probabilities of a node failure become very high. So there is huge interest in writing out these checkpoints quickly to ‘burst buffer’ filesystems.
Oh interesting. I never even thought to look for something like this. When I needed something like this, I did it the old fashioned way once where I just wrote out several large txt files every so often with the critical information.
Nice description! I’m also wondering if there’s any way to auto-recover so the failed parts are re-started in another healthy node.
I’m curious if state of this has evolved in past two years? I found GitHub - hildebrandmw/Checkpoints.jl: Don't mess up notebooks with long running functions which looks promising, although not maintained it would seem.
Edit: a few other resources:
- linux / container userspace approach: CRIU - Checkpoint/Restore in user space - Red Hat Customer Portal
- python approach (not widely adopted): GitHub - a-rahimi/python-checkpointing2: Checkpoint the state of Python programs using Pythonic setjmp and longjmp
Dagger.jl recently gained checkpoint/restore support, which lets you specify how and where to save data for each “thunk” (unit of work) in your computation. Of course, Dagger also has decent fault tolerance for when workers spontaneously die, and is designed for distributed and heterogeneous computing.
One day in the far future, CRIU support would be very cool to have.