Lately I begin to tackle the RNG thread safety issue, you can find the full story here.
After a couple of uncessfull attempts, I have check out how Rust has implemented a multi-thread RNG generator. It functions in a Local Key fashion, like Thread -Specific data in pthread.
I was thinking about writing a Julia wrapper to pthread_setspecific() but it seems that it would against the premise of getptls() that there is a unique pthread key, as each user allocated value would require a static key.
The point is, has somebody ever had the necessity of TLS? Because if not, writing a general wrapper would be unnecessary.
Two difficulties: how to do this by default without degrading single-thread performance, and what parallel RNG algorithm to use in order to get statistically independent streams.
As Jeff commented in one of the issues I linked, thread local storage is much more efficient than task local for this kind of thing, mainly because Julia uses a fixed pool of threads created when Julia starts.
Hi @stevengj, thanks for you response! I tried to write a smal script to test a similar version of a RNG in a thread indexed array :
using Random, BenchmarkTools
function init_TRNG(rng = MersenneTwister(0), gpt = 1)
n = Threads.nthreads()
rngjmp = randjump(rng, big(10)^20, n*(gpt+1))
threadRNG = reshape(rngjmp, (gpt+1,n))[1:gpt,:]
return threadRNG
end
TRNG = init_TRNG()
trand() = rand(@inbounds(TRNG[Threads.threadid()]))
function f(n::Int)
a = zeros(n)
Threads.@threads for i in 1:n
a[i] = rand()
end
return a
end
function g(n::Int)
a = zeros(n)
Threads.@threads for i in 1:n
a[i] = trand()
end
return a
end
f(400)
g(400)
Using the perf patch I have noticed that g() spends more time in ti_threadgroup_fork (called by @threads) during thread synchronization.
This gives a 5-10% overhead, also it seems that randjump is quite slow and this affects seriously benchmarks, especially for small loop cycles.
Also the most interesting part is that the extra threadid() call gives just a <1% overhead, so probably any attempt to circumvent it won’t be significative. (eg pthread TLS initialization or task local generators).
Regarding the two difficulties :
What about having two separate functions?
Using randjump gives independent streams. Yeah maybe finding a faster solution could be interesting.
Don’t use non-constant globals, they will kill performance. Do e.g.
const TRNG = [MersenneTwister(0) for i = 1:Threads.nthreads()]
function trand()
@inbounds rng = TRNG[Threads.threadid()]
return rand(rng)
end
Using BenchmarkTools, I find that @btime rand() takes 5.7ns on my machine, whereas @btime trand() takes 10.6ns, a fairly significant slowdown unfortunately (this is with Julia 0.6.3). (This is the sort of thing that might benefit from compiler support for thread-local variables, maybe.)
On the other hand, the overhead will be much less if you do something like rand!(array), in which the overhead of looking up the RNG only occurs once.