# Random number and parallel execution

As a fork of this thread I created a MWE that illustrates a behaviour I have seen frequently and which, clearly, I do not understand.

First, let us sum serially and in parallel some simple computation that does not involve random numbers:

``````function sum_notrand_serial(n)
s = 0.
for i in 1:n
s += exp(1/i)
end
s
end

function sum_notrand_parallel(n)
end
end
sum(s)
end

``````

``````julia> @btime sum_notrand_serial(10_000)
128.227 μs (0 allocations: 0 bytes)
10010.865744767296

julia> @btime sum_notrand_parallel(10_000)
44.106 μs (22 allocations: 3.11 KiB)
10010.865744767287

``````

Thus a reasonable speedup.

Now let us compute another sum, but instead of that `exp` function, I will just sum random numbers:

``````
function sum_rand_serial(n)
s = 0.
for i in 1:n
s += rand()
end
s
end

function sum_rand_parallel(n)
s[i] += rand()
end
end
sum(s)
end

``````

Results in:

``````julia> @btime sum_rand_serial(10_000)
33.078 μs (0 allocations: 0 bytes)
4984.942412847785

julia> @btime sum_rand_parallel(10_000)
47.776 μs (22 allocations: 3.11 KiB)
4998.453249343397

``````

The parallel version is slower than the serial one.

What is going on? When this happen and how can it be avoided?

2 Likes

You hit memory bandwidth. There is limited speed of getting data from memory to CPU and you’re code is processing data too fast. There is nothing you can do.

3 Likes

I am not convinced . Why that happens with `rand()` and not with other functions? What is special about it?

No, it happens not only with rand.
Year ago I’ve asked similar question: How to run tasks in parallel?

To verify I implemented c version of multithread app and it has the same properties.

2 Likes

I may be wrong, but I have a recollection that it was written somewhere that it’s not safe to call rand() from more than one thread at a time.

I wanted to do it at some point a few months back. Now I didn’t find the place in docs where such restrictions are written.

Ok, I confirm that is a memory bandwith problem (although there is something we can do about it). If we change the inner loop from:

``````     for j in 1:n_per_thread
s[i] += rand(Float64)
end
``````

to

``````     ts = 0.
ts += rand(Float64)
end
s[i] = ts
``````

Time goes from `~40` to `~10` microseconds. That proves that the problem is the access to the memory position of `s[i]` which is limiting the calculation.

However, for some reason the same limit is not reached if for many non-rand functions. For example, in the initial example, if one changes `exp` by `sin`, the overall function is faster than the `rand` version, and still scales well even without explicitly avoiding the access of `s[i]` in the inner loop. Actually, that modification has no effect on performance there.

It seems that the compiler is figuring out that `s[i]` can be accumulated on a processor register or stack position when the function summed up is not a `rand()`, but does not do that automatically if it is.

Unfortunately the `llvm` codes are too complicated…

false sharing?

5 Likes

That’s odd. The ticket has been closed way before I started with Julia. I may have read some old forum post or maybe there was some documentation that wasn’t updated somewhere.

I’d really like old Julia search results to go away from Google… and Stack Overflow.

1 Like

This is 4x faster than your `sum_notrand_serial` example. Maybe just try increasing the number of loop iterations to make it similarly expensive? Overhead is more problematic for smaller problems.

Actually, you can’t just increase the number of loop iterations, because that will also increase the number of writes to `s[i]`. Instead, try e.g. `s[i] += rand() + rand() + rand() + rand()` to make it 4x more expensive without changing the number of `s[i]` accesses.

1 Like

Exactly, that is why I mentioned the `sin` function above. With the code below, in which the non-random function is simplified even further, I get:

Code
``````
function sum_rand_serial(n)
s = 0.
for i in 1:n
s += rand()
end
s
end

function sum_rand_parallel(n)
s[i] += rand()
end
end
sum(s)
end

function sum_notrand_serial(n)
s = 0.
for i in 1:n
s += (1/i)^2
end
s
end

function sum_notrand_parallel(n)
end
end
sum(s)
end

``````
``````julia> @btime sum_notrand_serial(10_000)
10.056 μs (0 allocations: 0 bytes)
1.6448340718480652

julia> @btime sum_notrand_parallel(10_000)
5.489 μs (22 allocations: 3.11 KiB)
1.6448340718480643

julia> @btime sum_rand_serial(10_000)
33.064 μs (0 allocations: 0 bytes)
5053.117071086374

julia> @btime sum_rand_parallel(10_000)
43.689 μs (22 allocations: 3.11 KiB)
5008.220633345147

``````

Now the non-random case is 3 times faster than the random case, and the scaling is still much better than in the random case.

But, as I mentioned, that is fixed if one modifies explicitly the loop of the random version to avoid accessing the `s[i]` variable in the inner loop, using a temporary local variable. That makes clear the fact that the problem is the access to memory.

What is not clear is why the same does not happen with some (many?) non-random functions. Apparently the compiler is smarter with those, and does that automatically, because the explicit avoiding of the access of the `s[i]` has no effect on the performance (in these cases, I have seen other cases where it does):

Code with temporary variable to accumulate
``````function sum_rand_serial(n)
s = 0.
for i in 1:n
s += rand()
end
s
end

function sum_rand_parallel(n)
ts = 0.
ts += rand()
end
s[i] = ts
end
sum(s)
end

function sum_notrand_serial(n)
s = 0.
for i in 1:n
s += (1/i)^2
end
s
end

function sum_notrand_parallel(n)
ts = 0.
end
s[i] = ts
end
sum(s)
end
``````
``````julia> @btime sum_notrand_parallel(10_000)
5.470 μs (22 allocations: 3.11 KiB)
1.6448340718480643

julia> @btime sum_rand_parallel(10_000)
14.686 μs (22 allocations: 3.11 KiB)
5041.680075280128

``````

Ah, @stevengj, effectively, augmenting the cost of the inner operation reduces this relative lag. I think it is quite clear what is happening. What is not clear to me is why the compiler is smarter in case than in the other and avoids that memory access.

Side note: if you care about speed, you should always explicitly pass the RNG around (`rand(rng, ...)`), which also makes tests completely reproducible, if necessary

4 Likes

This blog post might be relevant:

3 Likes

Thank you for point that out. The serial code above becomes twice as fast if one passes the generator to the function. But how that relates to threading is more confusing. I could not come to a conclusion in a few tests.

There are a few separate but possibly confounding issues here:

1. `rand()` is now thread-safe, but this has an overhead
(see The need for rand speed | Blog by Bogumił Kamiński)

2. You can explicity pass an RNG to `rand()` to bypass this overhead, but beware, if you do this you also bypass thread-safety.

``````using Random

function pi_serial(rng, n::Int)
s = 0
for _ in 1:n
s += rand(rng)^2 + rand(rng)^2 < 1
end
return 4 * s / n
end

s = 0
s += rand(rng)^2 + rand(rng)^2 < 1
end
return 4 * s / n
end

const n = 10^8
const rng = MersenneTwister()

pi_serial(rng, n)       # correct
``````
1. To gain full benefit from threading with OP example:
• explicitly supply a separate RNG for each thread
• accumulate into thread-local variable, rather than accumulating directly into array
``````using Random, Future, BenchmarkTools

function sum_rand_serial(rng, n)
s = 0.0
for i in 1:n
s += rand(rng)
end
s
end

function sum_rand_parallel(rngs, n)
rng = rngs[i]
si = 0.0
si += rand(rng)
end
s[i] = si
end
sum(s)
end

function parallel_rngs(rng::MersenneTwister, n::Integer)
step = big(10)^20
rngs = Vector{MersenneTwister}(undef, n)
rngs = copy(rng)
for i = 2:n
rngs[i] = Future.randjump(rngs[i-1], step)
end
return rngs
end

const rng = MersenneTwister();
@btime sum_rand_serial(rng, N)
@btime sum_rand_parallel(rngs, N)
``````

On my machine using 12 threads, I get ~11x speed up.

7 Likes

Thank you. That function to generate the parallel rngs seems black magic to me.

I will try to understand it, but I wonder if that could be somehow incorporated into the implementation of `rand` in a more user friendly way.

Note that this overhead only matters if the limiting factor in your computation is the generation of random numbers, i.e. if you are only doing trivial calculations with the random values. In a more realistic large-scale computational problem I’m guessing you typically do enough computation per `rand()` call that the overhead is negligible.

4 Likes

Apart from all other issues: Your code disrespects the memory system. It should be

``````julia> function sum_rand_parallel2(n)
s[((i-1)<<3)+1] += rand()
end
end
sum(s)
end

julia> @btime sum_rand_parallel2(N)
617.204 ms (85 allocations: 12.94 KiB)
5.3686410968052244e8

julia> @btime sum_rand_parallel(N)
4.603 s (84 allocations: 11.97 KiB)
5.3686676919503814e8
``````

Explanation: Cores are very fast, and very far away from each other (just like main memory – think like 100ns roundtrip as an effective 150m distance). Once cores start communicating with each other, everything grinds down to a halt. If one core writes something, and another core wants to read (or write) it, then both cores must communicate in order to ensure consistency. The important thing is that, for a CPU, “something” is not an object or address; it is a cacheline, i.e. 64 contiguous aligned bytes.

By using a dense array for your accumulators, all your cores want to access and modify the same cachelines. You may know that these writes/reads don’t overlap/race, but your CPU needs to run expensive coherence protocols to figure that out. If you instead space your accumulators out, then you guarantee that they don’t end up in the same cacheline.

10 Likes

I think I became a little obsessed with that because I had one problem where that appeared to be an issue (I had to generate hundreds of thousands of random orientations of molecules to normalize a sort of orientational-dependent distribution function). I finally used that `Future` stuff (which I do not understand) as someone suggested back then. But my implementation of that changed a lot since then, probably I should check if that makes still any difference.

@greg_plowman and @foobar_lv2, thank you very much for your insights. I learned something else, as usual in this great forum.

One final thing: is the option of generating the random generator for each thread explicitly bad for any reason?

I mean this:

``````function sum_rand_parallel2(n)
rng = MersenneTwister()
si = 0.0
si += rand(rng)
end
s[i] = si
end
sum(s)
end

``````

Which results in (the `greg` functions are the Greg’s above) - with 4 threads:

``````julia> @btime sum_rand_serial_greg(\$rng,2_000_000)
3.181 ms (0 allocations: 0 bytes)
1.0002939045077431e6

julia> @btime sum_rand_parallel_greg(\$rngs,2_000_000)
912.778 μs (22 allocations: 3.11 KiB)
1.0003557432746999e6

julia> @btime sum_rand_parallel2(2_000_000)
880.487 μs (50 allocations: 80.92 KiB)
1.0002736053868829e6

``````

For smaller `N` the time of creating the generator inside the loop appears to make a difference:

``````
julia> @btime sum_rand_parallel_greg(\$rngs,200_000)
88.488 μs (22 allocations: 3.11 KiB)
100052.98751356374

julia> @btime sum_rand_parallel2(200_000)
102.415 μs (50 allocations: 80.92 KiB)
99880.45913126941

``````

but for the larger `N`, apparently, the locality of the generator became an advantage (edit: not really meaningful probably, as I am testing here).