Hi everyone,
(I know this question has been asked in different ways a bunch of times, but for me, I still could not find a good solution)
I am currently porting my large scale Quantum Monte Carlo framework to Julia but I hit a kind of road block when it comes to implementing checkpointing. Let me summarize the problem.
Requirements
In short, for long calculations, I want to save the state of my simulation in a checkpoint file. This includes the state of the random number generator. The key reason for doing this is reproducibility. If I run the same simulation, same seed, and it finishes in 10 checkpoints or 11 checkpoints, it should give the same result.
Often people here seem to argue reproducibility is something I should not rely on in the first place. In my case, there are two killer reasons to have it though:
- Quantum Monte Carlo algorithms can be complicated. Sometimes you have bugs that only appear extremely rarely. I want same-machine and same version of everything reproducibility to just restart the same seed and have a reliable way to reproduce such bugs.
- Reproducibility of published results is nice. Of course, bit-by-bit is just infeasible and not worth it on different machines, but if someone with a different machine but the same version of everything can reproduce my results, that would make me feel a lot better.
Most importantly of all, even if reproducibility fails, resuming from checkpoint files should always work, no matter the version of the environment.
All of these problems were solved in my c++ version of the code. I even used a stable RNG implementation to get reproducibility of my random number stream across different versions.
Now in Julia land, I face the following problems:
- The number stream of Random is documented unstable.
- The state of the RNGs is hidden in private struct fields.
- Packages like Serialization or JLD2 seem very brittle. Their documentation mentions that if a struct contains Int type members or pointers or function pointers, things can fail. What if the RNGs add members of those types in the future? I do not even want to use those for my checkpoints. I want to use vanilla HDF5.
I would be okay with pinning everything to a version but what happens if I start a simulation in Julia 1.8 and want to continue it using 1.9? What if the things I have saved in my checkpoint files have become garbage and I will not be able to continue the simulation at all? This is the worst case scenario.
Commonly suggested solutions that do not work for me
- StableRNGs: It is an LCG unsuitable for production runs
- Save all the random numbers and attach them to your paper: A simulation will generate terabytes of random numbers.
- Pin all the versions: Okay, but how do I save the RNG state to a checkpoint file in a way that will at least recover gracefully if I do update the RNG at some point or continue the simulation on a different machine.
I hope I was able to describe my problem. In short, while I like Julia as a language a lot, the random number ecosystem contains these caveats that make me feel like I am building my castle on sand at the moment. Is there one clear path out of this that allows me to build a code where I can be sure it won’t break in two minor releases or should I wait another x years before switching?
Feel free to tell me why I am wrong about everything