Checkpointing with Julia

johnh · May 9, 2019, 12:32pm

The post today by @Xing_Shi_Cai leads me to ask - has anyone worked with checkpointing in Julia?
A bit of Googling would suggest DMTCP would work http://dmtcp.sourceforge.net/

I guess the real answer is - why don’t I give it a try instead of asking?

jpsamaroo · May 9, 2019, 3:50pm

Can you give a brief description on the kind of checkpointing you’re referring to, and maybe a small example of a situation in which it’s beneficial? The Sourceforge page assumes prior knowledge of what checkpointing is.

sdanisch · May 9, 2019, 3:56pm

Nextjournal offers similar checkpoints as an experimental feature via criu

If you happen to try it out, please let us know how it works for you and if you run into any bugs

johnh · May 9, 2019, 4:24pm

Sure. When running on an HPC cluster you run a job on many compute nodes. Julia makes this easy. Quite often compute jobs will run for days or ever weeks. If a compute node crashes during that time then all your work is lost.
Applications will often write out an intermediate file following every N time steps.
So you can use this file to restart your computation for the point before it failed - you do not have to start from the beginning.
If you do not have an intermediate save file a checkpoint means that the state of the application on each compute node is written to disk. You then can start the application at that point.

With exascale compute cluster being almost ready to go, the probabilities of a node failure become very high. So there is huge interest in writing out these checkpoints quickly to ‘burst buffer’ filesystems.

tbeason · May 9, 2019, 4:31pm

Oh interesting. I never even thought to look for something like this. When I needed something like this, I did it the old fashioned way once where I just wrote out several large txt files every so often with the critical information.

tk3369 · May 9, 2019, 10:47pm

Nice description! I’m also wondering if there’s any way to auto-recover so the failed parts are re-started in another healthy node.

tbenst · February 11, 2021, 6:45pm

I’m curious if state of this has evolved in past two years? I found https://github.com/hildebrandmw/Checkpoints.jl which looks promising, although not maintained it would seem.

Edit: a few other resources:

linux / container userspace approach: CRIU - Checkpoint/Restore in user space - Red Hat Customer Portal
python approach (not widely adopted): https://github.com/a-rahimi/python-checkpointing2

jpsamaroo · February 11, 2021, 10:14pm

Dagger.jl recently gained checkpoint/restore support, which lets you specify how and where to save data for each “thunk” (unit of work) in your computation. Of course, Dagger also has decent fault tolerance for when workers spontaneously die, and is designed for distributed and heterogeneous computing.

One day in the far future, CRIU support would be very cool to have.

pbayer · February 15, 2021, 7:50am

Now that basic Erlang/OTP like error-handling is implemented in Actors.jl (see issue #16 there and description in the manual), I have an issue about checkpointing and want to develop it in the next weeks:

github.com/JuliaActors/Actors.jl

Implement basic checkpointing

opened 05:31PM - 13 Feb 21 UTC

pbayer

enhancement error handling

Now with basic error handling (see issue #16 and [description in the manual](htt…ps://juliaactors.github.io/Actors.jl/dev/errors/)) there is still an issue of maintaining/saving and restoring actor state at termination and restart. For actor and task restart (by supervisors) `checkpoint` and `restore` is an important option. Thus actor state can be restored at restart. ## Actor initialization and termination with user defined callback functions - [x] develop `init!` functionality, - [x] develop `term!` functionality, - [x] implement the restart strategy described below. ## User-defined checkpointing: - [x] basic `checkpointing` actor, - [x] `checkpoint` call, - [x] `restore` call, - [x] checkpointing hierarchy, - [x] checkpointing interval for 2nd level. ## Integration - [ ] tests, - [ ] documentation, - [ ] examples

I would appreciate some comments or any help on that. Thank you in advance!

Topic		Replies	Views
Failure-resilient parallel computing Julia at Scale	6	796	November 25, 2018
Saving RNG state/reproducibility in large scale Monte Carlo simulations New to Julia monte-carlo , random	17	697	October 29, 2022
Ideas for Saving Intermediate Results Modelling & Simulations data , jld2 , save	7	676	November 29, 2023
Is ClusterManagers.jl maintained? Or, how to do multi-node calculations in Julia? General Usage question , package	44	2133	July 13, 2024
Software (including Julia?) as Crash-only systems Offtopic	0	372	July 6, 2022

Checkpointing with Julia

Related topics