Saving RNG state/reproducibility in large scale Monte Carlo simulations

Hi everyone,

(I know this question has been asked in different ways a bunch of times, but for me, I still could not find a good solution)

I am currently porting my large scale Quantum Monte Carlo framework to Julia but I hit a kind of road block when it comes to implementing checkpointing. Let me summarize the problem.

In short, for long calculations, I want to save the state of my simulation in a checkpoint file. This includes the state of the random number generator. The key reason for doing this is reproducibility. If I run the same simulation, same seed, and it finishes in 10 checkpoints or 11 checkpoints, it should give the same result.

Often people here seem to argue reproducibility is something I should not rely on in the first place. In my case, there are two killer reasons to have it though:

  1. Quantum Monte Carlo algorithms can be complicated. Sometimes you have bugs that only appear extremely rarely. I want same-machine and same version of everything reproducibility to just restart the same seed and have a reliable way to reproduce such bugs.
  2. Reproducibility of published results is nice. Of course, bit-by-bit is just infeasible and not worth it on different machines, but if someone with a different machine but the same version of everything can reproduce my results, that would make me feel a lot better.

Most importantly of all, even if reproducibility fails, resuming from checkpoint files should always work, no matter the version of the environment.

All of these problems were solved in my c++ version of the code. I even used a stable RNG implementation to get reproducibility of my random number stream across different versions.

Now in Julia land, I face the following problems:

  1. The number stream of Random is documented unstable.
  2. The state of the RNGs is hidden in private struct fields.
  3. Packages like Serialization or JLD2 seem very brittle. Their documentation mentions that if a struct contains Int type members or pointers or function pointers, things can fail. What if the RNGs add members of those types in the future? I do not even want to use those for my checkpoints. I want to use vanilla HDF5.

I would be okay with pinning everything to a version but what happens if I start a simulation in Julia 1.8 and want to continue it using 1.9? What if the things I have saved in my checkpoint files have become garbage and I will not be able to continue the simulation at all? This is the worst case scenario.

Commonly suggested solutions that do not work for me

  1. StableRNGs: It is an LCG unsuitable for production runs
  2. Save all the random numbers and attach them to your paper: A simulation will generate terabytes of random numbers.
  3. Pin all the versions: Okay, but how do I save the RNG state to a checkpoint file in a way that will at least recover gracefully if I do update the RNG at some point or continue the simulation on a different machine.

I hope I was able to describe my problem. In short, while I like Julia as a language a lot, the random number ecosystem contains these caveats that make me feel like I am building my castle on sand at the moment. Is there one clear path out of this that allows me to build a code where I can be sure it won’t break in two minor releases or should I wait another x years before switching?

Feel free to tell me why I am wrong about everything :slight_smile:

Could we get the xoroshiro RNG into its own StableXoroRNG package, then you have a suitable RNG that works well?

it doesn’t need to be a separate package. it can just be a separate rng from the same package.

Right just depends on if the StableRNG maintainers want it in there or not.

Having a stable xoroshiro would solve almost everything. If that package could then define a stable interface to export the state as a some form of byte array or whatever is the Julian way of doing it, I am a happy camper.

You should be able to rip the code out of Julia and make that package yourself. The xoroshiro page also links to other generators. Maybe the marsaglia multiply w carry?

In the very beginning, I considered ripping the entire Random package out, but it was a bit large for me to chew. Focusing on a single RNG may be more manageable.

one can always just use

1 Like

Right, I saw that package. It does not make stability guarantees as far as I know, but maybe it’s worth asking there.

I have been meaning to add StableXoshiro to the StableRNGs package. Was just waiting initially to see if bugs would pop up in Random’s implementation, and now waiting to find time for importing a version of it in StableRNGs. It will be slower in that making random vectors will repeatedly call the scalar version instead of using an optimized routine.

1 Like

There isn’t one state. I mean, AFAIK, there’s one stream/state per thread. It seems like you would then just want to save the state for them all (or use single-threaded), and that might be possible (I understand even HPC has check-pointing, that would also complicate, but I’m not sure how or if it works for RNG state).

It’s not possible with threading in general (though in some situations it might work), i.e. scheduling is non-deterministic, even on the same system.

Given threading, you would have to read your checkpoint file, and start with as many (or more, just disabling those extra threads?), still seemingly complications. And (for Monte Carlo) you likely want multi-threaded.

Is exact reproducibility worth it, given the complexity? I can se you want restarting, and even better on a larger machine with more threads and actually using those extra.

Was it single-threaded?

[I wouldn’t call Serialization “brittle” (at least if you pin your version, or otherwise prevent possible changes of a “definition of a type in an external package”, or at least know of such a change), you just have to use it right, and know the limitations. If you could serialize and read back, even for pointer code (see though below on , that’s not a reason it will break in the future.]

I think Int64 and Int32 is ok, it’s just that Int can mean either, but if you don’t use (outdated) 32-bit, then it’s always Int64. I don’t know of other issue (at least in your case, or by avoiding Int, note literal integers are Int…), and that one seems like a non-issue if you just specify non-32-bit used.

Pointer (Ptr) code is very rare in Julia, and even if pointers may may be used behind the scenes, even explicitly then may not be a problem. I’m not sure, but I think at least some code makes it work.

I guess that applies to function pointers too. I’m not even sure how you do that in Julia, nor that it’s a real worry, would you serialize such? nd for all of this there are alternative serialisation formats, mostly likely Arrow.jl applies for you, and I think no problems (also HDF5.jl and others). JLD[2].jl is actually built on HDF5 file format, and may not have problems.

I’m not sure I understand the question. They generate e.g. Float64 (the default), Float32, Int64 or Int32, what you ask for. You can ask for rand(Int) but I would avoid that, it’s going to be less uniformly random if you assumed Int64 on 32-bit…

All of these types have their streams unstable as you mention, but I think the issue is overblown. It would only possibly happen in a major minor upgrade, e.g. 1.10, 1.11 etc. (but not in 1.x.y, for higher y only) but likely at best/worst very far into the future. Well really took our time deciding the new default for 1.7, and you’re not likely to ever see a faster RNG. You will know when it’s changed, and if your code isn’t deterministic anyway because of threads then why bother worrying? Even if we change we would support the old RNG, just as we did from the last (and only) change. Should be a very localized change in your code.

I’ve considered 1.x upgrades for higher x a major upgrade, in the docs it’s described as “minor”:

The read-back value will be as identical as possible to the original, but note that Ptr values are serialized as all-zero bit patterns (NULL). […]
The data format can change in minor (1.x) Julia releases, but files written by prior 1.x versions will remain readable. The main exception to this is when the definition of a type in an external package changes. If that occurs, it may be necessary to specify an explicit compatible version of the affected package in your environment. Renaming functions, even private functions, inside packages can also put existing files out of sync. Anonymous functions require special care: because their names are automatically generated, minor code changes can cause them to be renamed. Serializing anonymous functions should be avoided in files intended for long-term storage.

In some cases, the word size (32- or 64-bit) of the reading and writing machines must match. In rarer cases the OS or architecture must also match, for example when using packages that contain platform-dependent code.

I’m guessing that “In some cases, the word size (32- or 64-bit) […] must match” only applies to Int, and only going to 32-bit, or vise versa, as mentioned, so not a real worry.

In rarer cases the OS or architecture must also match

I’m guessing non-little-endian (or referring to something else [too]?. The world has standardized on little-endian, I think even for PowerPC on Julia, at least this is least of my worries.

Please do not (or really think about it).

Is it worth it, to confuse users having two+ options there, i.e. that one redundantly there and slower than Base, which isn’t likely to change soon. Cross that bridge when we get there (would it belong in Compat.jl or elsewhere, and is even right now elsewhere)? What’s in the package is good enough for tests, what it’s meant for. I wouldn’t want people tempted by the “stable” package for wrong reasons.

Thanks for the detailed answer!

My c++ code is MPI parallelized and different single-threaded Monte Carlo runs have their own RNG. Each run has its own parameter set. They don’t really communicate other than reporting progress. It’s basically just a scheduler that averages results in the end. If I want to reproduce a bug, I typically run a job with one worker and the exact parameter set/seed I am interested in. Reproducing everything in a project is possible, also using extra workers, as long as there are not more workers than the number of parameter sets ever calculated. But I don’t mind. In C++ this was not so complex and it worked, at least on the same machine, where I used it for debugging and (I admit, lazy) testing.

I agree with everything about the using Serialization and JLD2 correctly. For my own structures I would have no problem, but what I would be doing is to save an external type I have no control of. I can assume it does not use any of the things that may cause issues. I can pin it to a version, but then what if that pinned version becomes incompatible with the newest Julia release? Will I be forced to break compatibility in my checkpoint file format because of a Julia update? Maybe it is a non-issue at the moment and in this case, but I think when the goal is to write stable software, it is helpful to avoid these kinds of assumptions from the start. Reading lines like

In rarer cases the OS or architecture must also match

just makes me anxious. What is a rare case? How should I as a newcomer judge that when you also have to guess.

Knowing the RNG will not change for a long time is useful information in any case. Using Serialization & Co on any external types that are not documented to be stable under it does not sit right with me though. Even for my own stuff I write custom serializers that will stay stable if I change internals.

In C++, I used to rolled my own RNG implementation for that level of control, but both the MKL and std::random RNGs had functions to save and load their state. How guaranteed they are to be stable, I do not know.

Edit: Actually, I can just write my own serializer for some Random RNG… it would require maintaining whenever something in the internals changes (which is unlikely in the near future, I assume?), but if it does, I have one layer where I can try to keep compatibility and fail gracefully if it’s not possible. Does that sound like an okay idea?

That’s not likely to happen, since Julia has (syntax and documented API) compatibility. You might use some other package (or that one) you want to upgrade, and pinning would block it. But Julia upgrade shouldn’t force that, unless some newer upgraded package is only compatible with later Julia.

If you upgrade by choice or accident, I think it would be pretty obvious if serialization would fail. Or I hope so, I couldn’t rule it out undetected (for a while) do be careful or just use e.g. Arrow.jl.

I would look into HPC snapshotting (I’m not too familiar), best case it does this automatically for you, it’s plausible that could work, dumping memory (even the program and packages used with it?):

That’s a good question. I think they are covering their bases, and I am too. I’m pretty sure I know the limitations, but great if others would confirm. Serialization, is an easy concept, it’s not magic, but there are some pros and cons to different ways. You can look into other posts on it.

Big- vs little-endian is an obvious difference between computers, and if you serialize in little-endian, and read in a big-endian machine then all your numbers will be off. One way is to support little-endian format for both then big-endian will be second-class (reading/writing slightly slower).

I’m not even sure Julia supports any big-endian CPUs. I think they just want their options open. This “rare” would only apply if you opted into such a non-standard CPU/OS.

Right now I think fully compiled code isn’t stored in packages, but it will change. “contain platform-dependent code” might apply to that, or something already existing. I don’t think you need to worry about it. You’re not going to be serializing such code. The Pkg manager would do it and make sure it just works, or recompiled or something.

I don’t know if the following is helpful:

# to save rng state
rng_state = copy(Random.default_rng())
using BSON: @save, @load
@save "rngstate.bson" rng_state

# to set rng to loaded state
@load "rngstate.bson" rng_state
copy!(Random.default_rng(), rng_state)

As others have mentioned you might need to do this per thread if you are using that. You don’t have to use BSON either, or the detault rng, swap that bit out for whatever you want.

1 Like

That’s essentially what I did a few years ago for my (D)QMC code, see dqmc/helpers.jl at 0797999881ca21013da0a6c44ff852017ac2127f · carstenbauer/dqmc · GitHub. Never had issues with it since the RNG only changed once since I started to seriously user Julia (about 6-7) years ago. (On the same machine, my code was resumable to the extent that you could randomly kill it and still get exactly the same numbers out.)

Btw, I assume you are aware of GitHub - carstenbauer/MonteCarlo.jl: Classical and quantum Monte Carlo simulations in Julia?

1 Like

Hi, very cool! Then that’s what I will do. I actually did check your package for how you did it before, but I failed to find that snippet somehow.

My plan is to bring a Stochastic Series Expansion code to Julia, along with an algorithm-agnostic Monte Carlo scheduler (sort of like a minimalist version of what ALPS had) capable of parallel tempering. I also saw your BinningAnalysis.jl.

P.S.: I think we’ve met in Cologne when I was visiting a few years ago. I was a PhD student of Stefan Wessel.