I guess the real answer is - why don’t I give it a try instead of asking?
Can you give a brief description on the kind of checkpointing you’re referring to, and maybe a small example of a situation in which it’s beneficial? The Sourceforge page assumes prior knowledge of what checkpointing is.
If you happen to try it out, please let us know how it works for you and if you run into any bugs
Sure. When running on an HPC cluster you run a job on many compute nodes. Julia makes this easy. Quite often compute jobs will run for days or ever weeks. If a compute node crashes during that time then all your work is lost.
Applications will often write out an intermediate file following every N time steps.
So you can use this file to restart your computation for the point before it failed - you do not have to start from the beginning.
If you do not have an intermediate save file a checkpoint means that the state of the application on each compute node is written to disk. You then can start the application at that point.
With exascale compute cluster being almost ready to go, the probabilities of a node failure become very high. So there is huge interest in writing out these checkpoints quickly to ‘burst buffer’ filesystems.
Oh interesting. I never even thought to look for something like this. When I needed something like this, I did it the old fashioned way once where I just wrote out several large txt files every so often with the critical information.
Nice description! I’m also wondering if there’s any way to auto-recover so the failed parts are re-started in another healthy node.