Why is `TaskLocalRNG` faster than `Xoshiro` with multiple threads?

I was a bit confused by the docs:

In a multi-threaded program, you should generally use different RNG objects from different threads or tasks in order to be thread-safe. However, the default RNG is thread-safe as of Julia 1.3 (using a per-thread RNG up to version 1.6, and per-task thereafter).

They recommend generally using explicit RNGs for each task, but don’t give a reason why. Since the default RNG is now thread-safe, what is the reason? Performance?

I ran a simple test on a 1.8 nightly of Julia and was also confused by the results. Using an explicit Xoshiro RNG is slower on average than an explicit TaskLocalRNG. (The explicit TaskLocalRNG performs the same as not passing any RNG and using the default.)

julia> function f(rng_call,  n)
         Threads.@threads for _ in 1:Threads.nthreads()
           rng = rng_call()
           sum(rand(rng) for _ in 1:n)
         end
       end
f (generic function with 1 method)

julia> Threads.nthreads()
8

julia> @benchmark f(Random.Xoshiro, 10^6)
BenchmarkTools.Trial: 2151 samples with 1 evaluation.
 Range (min … max):  981.468 μs … 15.590 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       2.306 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     2.313 ms ±  1.369 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇▃▂▁▁   ▁█▆▃▅▁▁▂ ▁▄▃                                         ▁
  ████████████████████▆▄▅▄▆▃▄▃▁▃▃▃▁▃▃▁▃▁▄▁▁▄▄▄▁▃▃▃▁▄▄▃▃▄▄▃▄▃▁▄ █
  981 μs        Histogram: log(frequency) by time      9.15 ms <

 Memory estimate: 5.00 KiB, allocs estimate: 81.

julia> @benchmark f(Random.TaskLocalRNG, 10^6)
BenchmarkTools.Trial: 2902 samples with 1 evaluation.
 Range (min … max):  1.026 ms …  16.076 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.088 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.716 ms ± 998.001 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▁                         ▂▅▅▃▃   ▁▃▂▁                  ▁ ▁
  ███▆▅▃▃▄▄▄▄▁▃▄▅▃▃▃▄▁▁▁▁█▇▇▄▆████████████▅▆▆▃▄▅▆▆▇▄▁▃▆▅▆▆▇██ █
  1.03 ms      Histogram: log(frequency) by time      3.76 ms <

 Memory estimate: 4.50 KiB, allocs estimate: 65.

seems like you answered your own question in the title?


https://github.com/JuliaLang/julia/pull/32407:

The global random number generator (GLOBAL_RNG) is now thread-safe (and thread-local) ([#32407]).

My benchmark actually shows the opposite: the default TaskLocalRNG is faster than using a separate Xoshiro RNG for each task.

it IS task-local

Sorry, but I don’t follow your point.

I wasn’t suggesting that it wasn’t task-local. In my two examples, I was creating each RNG explicitly in each loop iteration so they should also be task local. Tasks are sticky by default, so there shouldn’t be any thread migration either. So I don’t see any reason why an explicit Xoshiro use should be slower.

They are basically the same, and btw I don’t think TaskLocalRNG is faster, if you look at min time, explicit Xoshiro is faster

They’re not the same. You can run the test for more iterations if you don’t believe those results are significant. The average runtime was always about 20% slower for Xoshiro on my system when using threads.

When running the tests using just a single thread, then Xoshiro is slightly faster, as expected. The main reason I posted this question is I’m wondering if there is something inefficient going on with threading that I’m not aware of.

Hmm. Actually after rerunning on my server after killing all other active processes, I am unable to reproduce consistent differences. Sorry for the noise. Closing this.