Parallel Mersenne Twister

Like many people, I’m pretty excited about the new multithreading capabilities in 1.3. But there’s one thing in the announcement that I’m not sure about, so it seems worth some discussion.

It says,

The approach we’ve taken with Julia’s default global random number generator ( rand() and friends) is to make it thread-specific. On first use, each thread will create an independent instance of the default RNG type (currently MersenneTwister ) seeded from system entropy.

I haven’t worked much with parallel random number generation, but ever since Hellekalek’s Don’t Trust Parallel Monte Carlo I’ve had a strong prior of suspicion when things sound too easy.

There are some ways to address this, for example O’Neill, 2014.

If there really is a problem with the current parallel RNG approach (is there?), what criteria should be most important when considering what approach to make the default? Which alternatives should be considered?

2 Likes

MersenneTwister has an incredible period length of

julia> big(2)^19937-1


So by using randjump to make sure the RNGs are spaced, they will not overlap.

The approach I use in VectorizedRNG is that I have a large number of different multipliers for the LCG component of the PCG. Each thread (or process if multiprocessing) will be given its own unique multiplier, and thus its own unique stream.

1 Like

Here is an example that initializes the default RNGs using randjump (from my comment in Julia’s issue tracker):

using Future

Random.seed!(1)
for i in 2:Threads.nthreads()
    Random.THREAD_RNGs[i] = Future.randjump(Random.THREAD_RNGs[i - 1], big(10)^20)
end

(Note: this code snippet is not “thread safe”)

I think a related and more challenging question is how to make parallel algorithms composable and reproducible. Since Julia schedules tasks dynamically, executing above snippet in the beginning of your script does not guarantee reproducibility anymore if you use multithreaded algorithms touching the default RNG (or any other parallel RNG using thread-local state).

2 Likes

Oh that’s interesting. Seems this could be easily avoided with thread-local RNG state. I’m sure there recent work in this direction (on my phone currently or I’d check)

IIUC, Threads.@spwan can run the task in any thread. So, if your computation relies on a thread-local state, it relies on exactly how the tasks are scheduled (which is not deterministic).

1 Like

It cannot be avoided with thread-local RNG state. It could be partially mitigated by using Task-local RNG state that uses seed!(rand(parent_task.RNG)) on each @spawn / @async. That way, the random streams are as predictable as the computational graph (i.e. are still non-deterministic with channels or with tasks that check whether others have computed; but code that only uses wait for synchronization has a chance).

Mersenne Twister is too large for that. If we used something like AES128-ctr or xoroshiro, then we could reasonably do that: < 64 bytes of state, <60 cycles to spin off a new instance from an old (overhead paid on every single task creation).

Regarding randjump: Is there any performance reason with Mersenne Twister for using randjump instead of just having entirely separate RNGs?

Is it OK to do this? My understanding is that randjump is the only way to get “independent” RNG stream. Or at least that’s the one recommended by RNG researchers. But I have no deep understanding of this issue.

1 Like

Statistical RNG research is all about using terribly bad fast PRNGs for monte carlo simulations in a way that hopefully does not distort results too much. As such, it is a question that depends on the specifics of how the RNG is implemented and of how the random numbers are used.

I come from a crypto background. There, our standard assumption is that a CSPRNG is indistinguishable from true random, within reasonable [1] limits on the amount of computation and without knowing the secret key / initial internal state. Under that assumption I can give you a simple proof that this procedure is ok: If you could run a computation where this is not OK, i.e. where the results differ from true random, then you would have demonstrated a distinguishing attack. This would lead to the world retiring the specific CSPRNG.

This is a very powerful and simple assumption, and is a useful heuristic even if it is known to be wrong. If such a procedure turns out to be not OK for a real statistical PRNG and real computation, then this would teach us something deep about the specific statistical PRNG and computation.

I don’t know whether that is OK for Mersenne Twister, nor can I say whether that is OK for xoroshiro, under reasonable assumptions on what kind of computations are done on the random numbers. Both generators are known to not be cryptographically secure, i.e. we know extremely fast computations that regenerate the internal state from a few observed outputs or the PRNG.

In other words, unqualified “use randjump” is bullshit. “use randjump instead of new states for Mersenne Twister and monte carlo simulations” or “use randjump instead of new states for xoroshiro and monte carlo simulations” might be good advice, though.

[1] “reasonable limits” vary from “not possible on realistic hardware in realistic timeframes” to “not possible between big bang and heat death of the universe, assuming current scientific understanding of crypto / complexity theory, fundamental physics and cosmology”. The claimed period of 2^20000 for Mersenne Twister is ridiculous; the visible universe between big bang and heat death is probably too small for 2^1000 computations, so the claimed period is “effectively never”. We can get “effectively never” for much fewer bits of internal state, though.

1 Like

I appreciate the detailed explanation. So, using task-local CSPRNG and combining seed!(rand(parent_task.RNG)) with @spwan sounds like a very powerful pattern to have composable parallel algorithms (although I suppose it requires that the ratio of single-thread speed of non-CS PRNG and CSPRNG is much smaller than number of CPU cores; or that RNG is not bottleneck).

@foobar_lv2 Thanks for the detail, I agree this is very helpful context.

Crypto applications are important, but do you think it should be a priority for general use? I expect there’s some high cost to pay, or every application would use crypto-secure PRNGs. Crypto-oriented users are likely to check details of the PRNG, while most users “just want some random numbers”. I’d think the best approach could be to have secure PRNGs available, but for the default to be faster, non-secure, but with good statistical properties. But then, that’s my application area so I do have some bias :slight_smile:

I’d be interested to hear anyone’s experience or opinions on PCG. On paper it seems the best of everything (save crypto). Are there reasons to prefer others over it?

2 Likes

Maybe relevant discussion.

There are the newer RNGs, like PCG and xoroshiro, that look strictly superior to mersenne twiser, but I am not an expert on that. If one has hardware support (for aes), then there are very fast cryptographic RNGs as well. Unfortunately, some architectures lack the hardware acceleration (ancient x86, some arm, maybe some atom). Vagaries of the cross-compilation build process might force generic binaries to use slow software implementations. But all modern CPU where you would realistically like to run non-trivial computations have hardware acceleration.

A fork-on-task model of RNG states would incur an overhead on every single task creation. Hence, this can be done at most for the default RNG, and only when using an RNG with fast seeding and small state, i.e. not mersenne twister (can work for PCG, xoroshiro, etc, and most cryptographic RNG).

It is arguable whether the price of default CSPRNG is worth the gain of mitigating some security flaws. Main price would be small slow-down of rand-heavy code that uses the default RNG on sane CPUs, and big slow-down on a few specific CPUs. Main gain would be to mitigate an entire class of security bugs (secondary gain is that reasoning about randomness becomes easier).

Speed is the more common concern, but security incidents have way bigger impact than a handful of wasted cycles. Really good code would be unaffected either way (if optimized for speed and random generation is a bottleneck, then the code should already use a faster RNG than mersenne twister; if security is relevant, then the code should already use a CSPRNG).

I personally think that adopting a fast CSPRNG as default, and exporting an even faster non-cryptographic RNG is a no-brainer. Most security people would agree, and some other modern languages agree as well, e.g. swift. But reasonable people can disagree on this point; and the julia user-base is more “technical computing” than swift “general purpose”, so the decision is much harder for us than for swift.

2 Likes

I would go for a fast, non-crypto RNG as the default, given that Julia is mostly targeted at scientific/numerical computing.

Regardless of the defaults, my experience is that it really pays off to keep the RNG explicit as an argument to all functions I write. It may sound a bit tedious, but I have always regretted not doing this later on, as it makes parallelization and unit testing much easier later on.

3 Likes

This is excellent best-practice :+1:

This is a great point, I need to update Soss to do this.

Counter-based RNGs (e.g. from GitHub - JuliaRandom/Random123.jl: Julia implementation of Random123.) also help a lot, in my experience, since they make it very easy to partition the RNG while still using a common seed.

4 Likes

I think I agree with most of what foobar_lv2 has said above. But from my amateur level reading of the literature (e.g. Mersenne Twister - Wikipedia), seed!(rand(parent_task.RNG)) is specifically what you must avoid ever doing. IIUC, this will give you N separate but correlated random number sequences. Whereas randjump will give you uncorrelated (but limited) streams. Or initialization with a CSPRNG will give you unlimited uncorrelated streams (per foobar_lv2’s “powerful and simple assumption” above). One way to achieve the latter may be to write this call as seed!(aes_hash(rand(parent_task.RNG))) as mentioned at ttps://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator#Designs_based_on_cryptographic_primitives, though I’ve done relatively little research here so I won’t promise that’s sufficient either.

I am not sure that is a concern in practice for non-crypto applications.

The periodicity of MT is approximately 10^{6000}. Assume a single draw takes 1 ns, so you get no more then 10^{8 + 9} a year. With judicious use of jumps, you should be fine for even massive parallel workloads.

That’s fine, and randjump is cool, but note that our current method requires less bookeeping, is conceptually simpler, and doesn’t have that little asterisk about “judicious use” (what is that?). I actually fail to see any reason to actually use randjump unless you’re in some alternative universe where cryptographic-quality hashes don’t exist.

1 Like

And the fact that there exist papers / discussions on this kind of thing makes my point: Ensure that the PRNG is indistinguishable from true random, and bury the entire academic discourse on this kind of stuff. If one chooses a weak PRNG, i.e. a PRNG that fails for certain types of computations, then all uses / schemes will forever need to ask themselves and document “Am I doing a forbidden computation? How is the set of forbidden computations propagated to downstream users of my output?”. If one instead adopts the random oracle model, then nothing is forbidden, and the amount of scientific literature that people need to read fits on a napkin (“you may mentally model the RNG as true random; if this model turns out incorrect, then you will be compensated by having discovered the cryptographic sensation of the decade”).

In order to reap these benefits (mental clarity), we don’t even need a strong CSPRNG with adequate security margins; we only need one that passes the laugh test. E.g. reduced round number / simplified key schedule AES-CTR variants are fine for that. If one forks global RNG-state on task creation, one pays <15 throughput cycles + 16 bytes on semi-modern hardware (one could e.g. have a single key shared by all threads/tasks, and only fork the counter, since 2^128 \approx \infty).

1 Like