Thread Local Storage in Julia

Lately I begin to tackle the RNG thread safety issue, you can find the full story here.

After a couple of uncessfull attempts, I have check out how Rust has implemented a multi-thread RNG generator. It functions in a Local Key fashion, like Thread -Specific data in pthread.

I was thinking about writing a Julia wrapper to pthread_setspecific() but it seems that it would against the premise of getptls() that there is a unique pthread key, as each user allocated value would require a static key.

The point is, has somebody ever had the necessity of TLS? Because if not, writing a general wrapper would be unnecessary.

2 Likes

See also: https://github.com/JuliaLang/julia/issues/10441 … it is explained there (and in https://github.com/JuliaLang/julia/pull/26562) how to get thread-local storage of the RNG.

Two difficulties: how to do this by default without degrading single-thread performance, and what parallel RNG algorithm to use in order to get statistically independent streams.

As Jeff commented in one of the issues I linked, thread local storage is much more efficient than task local for this kind of thing, mainly because Julia uses a fixed pool of threads created when Julia starts.

Hi @stevengj, thanks for you response! I tried to write a smal script to test a similar version of a RNG in a thread indexed array :

using Random, BenchmarkTools

function init_TRNG(rng = MersenneTwister(0), gpt = 1)
    n = Threads.nthreads()
    rngjmp = randjump(rng, big(10)^20, n*(gpt+1))
    threadRNG = reshape(rngjmp, (gpt+1,n))[1:gpt,:]
    return threadRNG
end

TRNG = init_TRNG()

trand() = rand(@inbounds(TRNG[Threads.threadid()]))

function f(n::Int)
    a = zeros(n)
    Threads.@threads for i in 1:n
        a[i] = rand()
    end
    return a
end

function g(n::Int)
    a = zeros(n)
    Threads.@threads for i in 1:n
        a[i] = trand()
    end
    return a
end
f(400)
g(400)

Using the perf patch I have noticed that g() spends more time in ti_threadgroup_fork (called by @threads) during thread synchronization.

This gives a 5-10% overhead, also it seems that randjump is quite slow and this affects seriously benchmarks, especially for small loop cycles.

Also the most interesting part is that the extra threadid() call gives just a <1% overhead, so probably any attempt to circumvent it won’t be significative. (eg pthread TLS initialization or task local generators).

Regarding the two difficulties :

  1. What about having two separate functions?
  2. Using randjump gives independent streams. Yeah maybe finding a faster solution could be interesting.

Thanks again for your reply!

Don’t use non-constant globals, they will kill performance. Do e.g.

const TRNG = [MersenneTwister(0) for i = 1:Threads.nthreads()]
function trand()
    @inbounds rng = TRNG[Threads.threadid()]
    return rand(rng)
end

Using BenchmarkTools, I find that @btime rand() takes 5.7ns on my machine, whereas @btime trand() takes 10.6ns, a fairly significant slowdown unfortunately (this is with Julia 0.6.3). (This is the sort of thing that might benefit from compiler support for thread-local variables, maybe.)

On the other hand, the overhead will be much less if you do something like rand!(array), in which the overhead of looking up the RNG only occurs once.

Doh! Sorry yeah that was penzalizing my benchmarks.

Don’t you ha ve a 0.7 build? Because initializing TRNG with randjump I get :

@btime trand()
5.057 ns
@btime rand()
4.707 ns

which seems more acceptable!
Also the interesting part comes from benchmarks

julia> @benchmark g(400)
BenchmarkTools.Trial:
  memory estimate:  3.28 KiB
  allocs estimate:  2
  --------------
  minimum time:     3.276 μs (0.00% GC)
  median time:      3.741 μs (0.00% GC)
  mean time:        9.390 μs (43.47% GC)
  maximum time:     34.850 ms (99.85% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> @benchmark f(400)
BenchmarkTools.Trial:
  memory estimate:  3.28 KiB
  allocs estimate:  2
  --------------
  minimum time:     6.676 μs (0.00% GC)
  median time:      7.487 μs (0.00% GC)
  mean time:        19.037 μs (43.10% GC)
  maximum time:     73.441 ms (99.32% GC)
  --------------
  samples:          10000
  evals/sample:     4

so that the trand() loop in g is twice as fast!
This seems even an improvement on performance!