Checkpointing with Julia

The post today by @Xing_Shi_Cai leads me to ask - has anyone worked with checkpointing in Julia?
A bit of Googling would suggest DMTCP would work http://dmtcp.sourceforge.net/

I guess the real answer is - why don’t I give it a try instead of asking?

2 Likes

Can you give a brief description on the kind of checkpointing you’re referring to, and maybe a small example of a situation in which it’s beneficial? The Sourceforge page assumes prior knowledge of what checkpointing is.

Nextjournal offers similar checkpoints as an experimental feature via criu :slight_smile:

If you happen to try it out, please let us know how it works for you and if you run into any bugs :wink:

4 Likes

Sure. When running on an HPC cluster you run a job on many compute nodes. Julia makes this easy. Quite often compute jobs will run for days or ever weeks. If a compute node crashes during that time then all your work is lost.
Applications will often write out an intermediate file following every N time steps.
So you can use this file to restart your computation for the point before it failed - you do not have to start from the beginning.
If you do not have an intermediate save file a checkpoint means that the state of the application on each compute node is written to disk. You then can start the application at that point.

With exascale compute cluster being almost ready to go, the probabilities of a node failure become very high. So there is huge interest in writing out these checkpoints quickly to ‘burst buffer’ filesystems.

10 Likes

Oh interesting. I never even thought to look for something like this. When I needed something like this, I did it the old fashioned way once where I just wrote out several large txt files every so often with the critical information.

1 Like

Nice description! I’m also wondering if there’s any way to auto-recover so the failed parts are re-started in another healthy node.

I’m curious if state of this has evolved in past two years? I found https://github.com/hildebrandmw/Checkpoints.jl which looks promising, although not maintained it would seem.

Edit: a few other resources:

1 Like

Dagger.jl recently gained checkpoint/restore support, which lets you specify how and where to save data for each “thunk” (unit of work) in your computation. Of course, Dagger also has decent fault tolerance for when workers spontaneously die, and is designed for distributed and heterogeneous computing.

One day in the far future, CRIU support would be very cool to have.

5 Likes

Now that basic Erlang/OTP like error-handling is implemented in Actors.jl (see issue #16 there and description in the manual), I have an issue about checkpointing and want to develop it in the next weeks:

I would appreciate some comments or any help on that. Thank you in advance!

3 Likes