Reproducible multithreaded monte-carlo / Task local random

foobar_lv2 · February 28, 2020, 12:51am

This is a companion post to https://github.com/JuliaLang/julia/pull/34852 , for a more user-centric discussion.

Reproducibility of randomized computations is a very nice property (when Random.seed! is used). This is good for testing, debugging, and generally for reproducing results. Unfortunately, reproducibility is currently somewhat at odds with performance: If you spawn any @async tasks, or use multithreading, then you normally don’t get reproducibility (even if there are no race conditions in your user code). This is

because the speed in which tasks finish, and how the scheduler switches between tasks, is fundamentally unpredictable (depends on randomized algorithm in the OS kernel and microarchitectural state of your CPU cores). Schedulers are inherently racy.
because of shared global state – the random generator – that is mutated by all your tasks (julia tasks, also called coroutines). For performance reasons, this is not how the julia default RNG works: Instead of one global RNG, there are nthreads() many, and each task mutates the RNG that belongs to the thread where the task happened to be scheduled. This, however, does not improve the reproducibility situation.

One obvious solution is to hand each computational Task its own RNG instance (discarded on task termination), such that there is no shared state; this RNG instance gets seeded from the parent RNG, which gets advanced in the process. With this, one can write computations that are independent of scheduler decisions.

Doing this by hand is somewhat cumbersome: User code would be responsible for all this book-keeping and passing-around of RNG instances. An alternative is to make the julia runtime do this work. This incurs a small (negligible?) overhead on task creation, and necessitates a change in the RNG algorithm (Mersenne twister takes too much memory). In the proof-of-concept, one uses rand(Random.TaskLocal()) to access the multithreading-reproducible random stream.

So, my question for the people here on discourse who run or maintain infrastructure for randomized computations: How do you currently deal with reproducibility vis-a-vis multithreading? Is this something only a tiny minority cares about? Is this a killer feature that you didn’t think was possible?

Ping @rfourquet @ChrisRackauckas @trappmartin

stephenll · February 28, 2020, 3:48am

At my company, our simulations are written in a combination of c#, c++, and assembly. Yes, there is assembly in there. We divide our simulations into “chunks” which is usually around 10-100 simulations. Each chunk gets its own seed. The scheduler we use is dynamic like the one in Julia and by having each chuck get its own seed we get reproducibility no matter if we are running on any number of cores. 1 core will produce the same results as 6 cores.

I am currently trying to do the same with Julia. Re-seeding the MT algorithm is very expensive time wise. Right now it is faster to keep my algorithm single threaded.

For me, this is a killer feature.

Tamas_Papp · February 28, 2020, 8:02am

In testing my own MCMC libraries, I gradually transitioned from treating any kind of bit-by-bit reproducibility of “random” results as a very desirable feature to considering it a red flag that I am
taking shortcuts in understanding my problem, figuring out the invariants, and finding the right tests.

Any kind of MC reproducibility is very fragile for nontrivial code. Non-deterministic multithreaded code can break it, but also changes in the RNG code (Random does not guarantee this kind of reproducibility when versions change), the environment (architecture, how external libraries were compiled).

Instead I prefer to break up my code into deterministic and random units, figure out invariants that hold for both, and devote time to devising powerful tests with a low false positive rate. This is a costly and time consuming process, but led me to uncover some bugs in my code that were not caught by reproducible MC results — I was getting the same bits as yesterday, but did not realize that they were the wrong bits.

So I think that this kind of reproducibility is overrated and the wrong thing to focus on.

pbayer · February 28, 2020, 8:17am

From my view (Discrete Event Systems) tasks need their own RNG only if they have their own time base. The RNG is always related to time. What you want to have is reproducibility of event sequences along a time line.

Two scenarios:

in the first scenario a task represents an entire simulation. Each such task needs its own time line and RNG.
In the second scenario a task branches into multiple actors (tasks) each representing an entity (e.g. a car, a person, a server …) collaborating on a timeline. Then they need a common clock/RNG to which they can refer in order to organise their actions on the time line. In this scenario those actors don’t need a task specific RNG. This would be counterproductive.

As I see it, there is a tradeoff involved: do we want to have generally a task specific RNG which would simplify things for people working on the first scenario. Or do we want maximum speed in those cases (2nd scenario) where this is not needed.

I definitely would prefer speed since I consider the 2nd scenario more relevant.

Putting it in more general terms: Tasks not always implement stochastic timed automata. Maybe they implement only simple state machines or timers listening to a signal … Therefore the RNG should remain an option.

rfourquet · March 22, 2020, 8:30am

Do you mean that you would then pass explicitly an RNG between those tasks, or that you would rely on the global RNG in the scenario where only one thread is used?

I don’t fully understand if your only concern with a task-local RNG is speed, but when the RNG is not used, the overhead would be negligible (thanks to the expert work from @foobar_lv2).
On the other hand, when it is used, I believe this would be quite faster, as – since Julia 1.3 – using the global RNG has a significant overhead due to thread-safety (plus, the new proposed RNG algorithm is already faster than MersenneTwister in itself). In other words, @foobar_lv2’s PR provides both more speed and reproducibility! (if I got it right)

rfourquet · March 22, 2020, 8:35am

I would tend to agree with @Tamas_Papp that reproducibility for tests is not that important for the reasons he gave, although it can come in handy occasionally.

On the other hand, although I didn’t have the chance yet to do multithreading in Julia, I find that reproducibility for debugging would be really nice. An example would the factorization algorithm in Primes.jl, which can easily be made multithreaded. Performance tuning/debugging would also be improved: the factorization speed can depend a lot on the initial seed, and reproducibility can help understanding patterns in the algorithm.

pbayer · March 22, 2020, 9:28am

Yes, in my 2nd scenario generally those tasks referring to and coordinating on the same timeline must share a common RNG. Sharing a RNG between tasks won’t go away in many use cases.

Yes, speed is my main concern. If speed is not compromised – in those cases where we don’t use a task-local RNG –, I’m ok with it. If it is faster for those cases where we use a local RNG, it is even better.

foobar_lv2 · March 24, 2020, 8:36pm

My PR does actually three things:

Implement a faster PRNG than mersenne twister
Make it task-local
Provide possibly lower overhead access to single random numbers in multi-threaded environments

Details:

The faster PRNG could be done in packages, with the same approach and code (xoshiro with state-splitting and avx2 for bulk generation) and is imo strictly (pareto) superior to mersenne twister. Note that the state-split-on-bulk gives up on the following reproducibility feature: Currently seed!(rng, 23); A = [rand(rng) for i=1:N] is the same as seed!(rng, 23); A = zeros(N); rand!(rng, A). With this PR, both are different (rand!(array) does not generate N random numbers, it fills a single random buffer). If you look at the linked PR, you see a 2x speedup for single and 3x speedup for bulk generation of UInt64. In fact, my laptop generates random at 12 gbps, and fill!(arr, 0) manages only 18 gbps. Within spitting distance of being bandwidth bound instead of cpu bound.
Task-locality cannot be done by packages, it must be done deep inside the julia runtime. It gives the nice multi-threaded reproducibility. It costs a handful of bytes and cycles for every task-spawn, regardless of whether random numbers are used in the program at all. The costs are imo borderline negligible, but this is an ugly precedent (first logging wanted language-level support, then random, who’s next?).
The codegen parts of my PR are obsolete, because an independent PR by @jeff.bezanson got merged in the meantime which provides the same fast access to current_task(). Yay! Otherwise, it is unfortunately kinda annoying to get proper fast access to thread-local variables in julia (proper: the exact desired instruction sequence with proper tbaa tags for good LICM). Task-local doesn’t have this problem if we cheat and embed it into the runtime.

rfourquet · March 25, 2020, 10:52am

This is currently not guarranteed, the reproducibility you observe here is very dependent on the specific operation. Change it a little bit and the reproducibility disappears:

julia> N = 400; rng = MersenneTwister();

julia> Random.seed!(rng, 23); rand(rng); extrema([rand(rng) for i=1:N])
(0.0005358737997744889, 0.9999865794616105)

julia> Random.seed!(rng, 23); rand(rng); extrema(rand!(rng, zeros(N)))
(0.00015700695341003268, 0.9930342736482414)

The reason is the same as for your PR, rand!(array) fills directly the array buffer (this array randomization is the native operation for MersenneTwister; the apparently innocuous call to rand(rng) in my example above will fill up a whole internal cache buffer in rng dedicated to scalar generation and to filling up arrays which are too small for the “native buffer randomization”).