Random number and parallel execution

As a fork of this thread I created a MWE that illustrates a behaviour I have seen frequently and which, clearly, I do not understand.

First, let us sum serially and in parallel some simple computation that does not involve random numbers:

function sum_notrand_serial(n)
 s = 0.
 for i in 1:n
    s += exp(1/i)
 end
 s
end

function sum_notrand_parallel(n)
 nthreads = Threads.nthreads()
 s = zeros(nthreads)
 n_per_thread = n ÷ nthreads
 Threads.@threads for i in 1:nthreads
    for j in 1:n_per_thread
      s[i] += exp(1/((i-1)*n_per_thread + j))
    end
 end
 sum(s)
end

Using 4 threads, I get:

julia> @btime sum_notrand_serial(10_000)
  128.227 μs (0 allocations: 0 bytes)
10010.865744767296

julia> @btime sum_notrand_parallel(10_000)
  44.106 μs (22 allocations: 3.11 KiB)
10010.865744767287

Thus a reasonable speedup.

Now let us compute another sum, but instead of that exp function, I will just sum random numbers:


function sum_rand_serial(n)
  s = 0.
  for i in 1:n
     s += rand()
  end
  s
end

function sum_rand_parallel(n)
  nthreads = Threads.nthreads()
  s = zeros(nthreads)
  n_per_thread = n ÷ nthreads
  Threads.@threads for i in 1:nthreads
     for j in 1:n_per_thread
       s[i] += rand()
     end
  end
  sum(s)
end

Results in:

julia> @btime sum_rand_serial(10_000)
  33.078 μs (0 allocations: 0 bytes)
4984.942412847785

julia> @btime sum_rand_parallel(10_000)
  47.776 μs (22 allocations: 3.11 KiB)
4998.453249343397

The parallel version is slower than the serial one.

What is going on? When this happen and how can it be avoided?

2 Likes

You hit memory bandwidth. There is limited speed of getting data from memory to CPU and you’re code is processing data too fast. There is nothing you can do.

3 Likes

I am not convinced :smile: . Why that happens with rand() and not with other functions? What is special about it?

No, it happens not only with rand.
Year ago I’ve asked similar question: How to run tasks in parallel?

To verify I implemented c version of multithread app and it has the same properties.

2 Likes

I may be wrong, but I have a recollection that it was written somewhere that it’s not safe to call rand() from more than one thread at a time.

I wanted to do it at some point a few months back. Now I didn’t find the place in docs where such restrictions are written.

Ok, I confirm that is a memory bandwith problem (although there is something we can do about it). If we change the inner loop from:

     for j in 1:n_per_thread
       s[i] += rand(Float64)
     end

to

     ts = 0.
     for j in 1:n_per_thread
       ts += rand(Float64)
     end
     s[i] = ts

Time goes from ~40 to ~10 microseconds. That proves that the problem is the access to the memory position of s[i] which is limiting the calculation.

However, for some reason the same limit is not reached if for many non-rand functions. For example, in the initial example, if one changes exp by sin, the overall function is faster than the rand version, and still scales well even without explicitly avoiding the access of s[i] in the inner loop. Actually, that modification has no effect on performance there.

It seems that the compiler is figuring out that s[i] can be accumulated on a processor register or stack position when the function summed up is not a rand(), but does not do that automatically if it is.

Unfortunately the llvm codes are too complicated…

false sharing?

This was fixed in Random: support threads for GLOBAL_RNG access by vtjnash · Pull Request #32407 · JuliaLang/julia · GitHub

5 Likes

That’s odd. The ticket has been closed way before I started with Julia. I may have read some old forum post or maybe there was some documentation that wasn’t updated somewhere.

I’d really like old Julia search results to go away from Google… and Stack Overflow.

1 Like

This is 4x faster than your sum_notrand_serial example. Maybe just try increasing the number of loop iterations to make it similarly expensive? Overhead is more problematic for smaller problems.

Actually, you can’t just increase the number of loop iterations, because that will also increase the number of writes to s[i]. Instead, try e.g. s[i] += rand() + rand() + rand() + rand() to make it 4x more expensive without changing the number of s[i] accesses.

1 Like

Exactly, that is why I mentioned the sin function above. With the code below, in which the non-random function is simplified even further, I get:

Code

function sum_rand_serial(n)
  s = 0.
  for i in 1:n
     s += rand()
  end
  s
end

function sum_rand_parallel(n)
  nthreads = Threads.nthreads()
  s = zeros(nthreads)
  n_per_thread = n ÷ nthreads
  Threads.@threads for i in 1:nthreads
     for j in 1:n_per_thread
       s[i] += rand()
     end
  end
  sum(s)
end

function sum_notrand_serial(n)
  s = 0.
  for i in 1:n
     s += (1/i)^2
  end
  s
end

function sum_notrand_parallel(n)
  nthreads = Threads.nthreads()
  s = zeros(nthreads)
  n_per_thread = n ÷ nthreads
  Threads.@threads for i in 1:nthreads
     for j in 1:n_per_thread
       s[i] += (1/((i-1)*n_per_thread + j))^2
     end
  end
  sum(s)
end

julia> @btime sum_notrand_serial(10_000)
  10.056 μs (0 allocations: 0 bytes)
1.6448340718480652

julia> @btime sum_notrand_parallel(10_000)
  5.489 μs (22 allocations: 3.11 KiB)
1.6448340718480643

julia> @btime sum_rand_serial(10_000)
  33.064 μs (0 allocations: 0 bytes)
5053.117071086374

julia> @btime sum_rand_parallel(10_000)
  43.689 μs (22 allocations: 3.11 KiB)
5008.220633345147

Now the non-random case is 3 times faster than the random case, and the scaling is still much better than in the random case.

But, as I mentioned, that is fixed if one modifies explicitly the loop of the random version to avoid accessing the s[i] variable in the inner loop, using a temporary local variable. That makes clear the fact that the problem is the access to memory.

What is not clear is why the same does not happen with some (many?) non-random functions. Apparently the compiler is smarter with those, and does that automatically, because the explicit avoiding of the access of the s[i] has no effect on the performance (in these cases, I have seen other cases where it does):

Code with temporary variable to accumulate
function sum_rand_serial(n)
  s = 0.
  for i in 1:n
     s += rand()
  end
  s
end

function sum_rand_parallel(n)
  nthreads = Threads.nthreads()
  s = zeros(nthreads)
  n_per_thread = n ÷ nthreads
  Threads.@threads for i in 1:nthreads
     ts = 0.
     for j in 1:n_per_thread
       ts += rand()
     end
     s[i] = ts
  end
  sum(s)
end

function sum_notrand_serial(n)
  s = 0.
  for i in 1:n
     s += (1/i)^2
  end
  s
end

function sum_notrand_parallel(n)
  nthreads = Threads.nthreads()
  s = zeros(nthreads)
  n_per_thread = n ÷ nthreads
  Threads.@threads for i in 1:nthreads
     ts = 0.
     for j in 1:n_per_thread
       ts += (1/((i-1)*n_per_thread + j))^2
     end
     s[i] = ts
  end
  sum(s)
end
julia> @btime sum_notrand_parallel(10_000)
  5.470 μs (22 allocations: 3.11 KiB)
1.6448340718480643

julia> @btime sum_rand_parallel(10_000)
  14.686 μs (22 allocations: 3.11 KiB)
5041.680075280128

Ah, @stevengj, effectively, augmenting the cost of the inner operation reduces this relative lag. I think it is quite clear what is happening. What is not clear to me is why the compiler is smarter in case than in the other and avoids that memory access.

Side note: if you care about speed, you should always explicitly pass the RNG around (rand(rng, ...)), which also makes tests completely reproducible, if necessary

4 Likes

This blog post might be relevant:

3 Likes

Thank you for point that out. The serial code above becomes twice as fast if one passes the generator to the function. But how that relates to threading is more confusing. I could not come to a conclusion in a few tests.

There are a few separate but possibly confounding issues here:

  1. rand() is now thread-safe, but this has an overhead
    (see The need for rand speed | Blog by Bogumił Kamiński)

  2. You can explicity pass an RNG to rand() to bypass this overhead, but beware, if you do this you also bypass thread-safety.

unsafe threaded code
using Random

function pi_serial(rng, n::Int)
    s = 0
    for _ in 1:n
        s += rand(rng)^2 + rand(rng)^2 < 1
    end
    return 4 * s / n
end

function pi_threaded(rng, n::Int)
    s = 0
    Threads.@threads for _ in 1:n
        s += rand(rng)^2 + rand(rng)^2 < 1
    end
    return 4 * s / n
end

const n = 10^8
const rng = MersenneTwister()
println("Num threads: ", Threads.nthreads())

pi_serial(rng, n)       # correct
pi_threaded(rng, n)     # incorrect
  1. To gain full benefit from threading with OP example:
    • explicitly supply a separate RNG for each thread
    • accumulate into thread-local variable, rather than accumulating directly into array
safe threaded code
using Random, Future, BenchmarkTools

function sum_rand_serial(rng, n)
    s = 0.0
    for i in 1:n
        s += rand(rng)
    end
    s
end

function sum_rand_parallel(rngs, n)
    nthreads = Threads.nthreads()
    s = zeros(nthreads)
    n_per_thread = n ÷ nthreads
    Threads.@threads for i in 1:nthreads
        rng = rngs[i]
        si = 0.0
        for j in 1:n_per_thread
            si += rand(rng)
        end
        s[i] = si
    end
    sum(s)
end

function parallel_rngs(rng::MersenneTwister, n::Integer)
    step = big(10)^20
    rngs = Vector{MersenneTwister}(undef, n)
    rngs[1] = copy(rng)
    for i = 2:n
        rngs[i] = Future.randjump(rngs[i-1], step)
    end
    return rngs
end

println("Num threads: ", Threads.nthreads())
const N = Threads.nthreads() * 10^8
const rng = MersenneTwister();
const rngs = parallel_rngs(MersenneTwister(), Threads.nthreads());
@btime sum_rand_serial(rng, N)
@btime sum_rand_parallel(rngs, N)

On my machine using 12 threads, I get ~11x speed up.

7 Likes

Thank you. That function to generate the parallel rngs seems black magic to me.

I will try to understand it, but I wonder if that could be somehow incorporated into the implementation of rand in a more user friendly way.

Note that this overhead only matters if the limiting factor in your computation is the generation of random numbers, i.e. if you are only doing trivial calculations with the random values. In a more realistic large-scale computational problem I’m guessing you typically do enough computation per rand() call that the overhead is negligible.

4 Likes

Apart from all other issues: Your code disrespects the memory system. It should be

julia> function sum_rand_parallel2(n)
         nthreads = Threads.nthreads()
         s = zeros(nthreads<<3)
         n_per_thread = n ÷ nthreads
         Threads.@threads for i in 1:nthreads
            for j in 1:n_per_thread
              s[((i-1)<<3)+1] += rand()
            end
         end
         sum(s)
       end

julia> @btime sum_rand_parallel2(N)
  617.204 ms (85 allocations: 12.94 KiB)
5.3686410968052244e8

julia> @btime sum_rand_parallel(N)
  4.603 s (84 allocations: 11.97 KiB)
5.3686676919503814e8

Explanation: Cores are very fast, and very far away from each other (just like main memory – think like 100ns roundtrip as an effective 150m distance). Once cores start communicating with each other, everything grinds down to a halt. If one core writes something, and another core wants to read (or write) it, then both cores must communicate in order to ensure consistency. The important thing is that, for a CPU, “something” is not an object or address; it is a cacheline, i.e. 64 contiguous aligned bytes.

By using a dense array for your accumulators, all your cores want to access and modify the same cachelines. You may know that these writes/reads don’t overlap/race, but your CPU needs to run expensive coherence protocols to figure that out. If you instead space your accumulators out, then you guarantee that they don’t end up in the same cacheline.

10 Likes

I think I became a little obsessed with that because I had one problem where that appeared to be an issue (I had to generate hundreds of thousands of random orientations of molecules to normalize a sort of orientational-dependent distribution function). I finally used that Future stuff (which I do not understand) as someone suggested back then. But my implementation of that changed a lot since then, probably I should check if that makes still any difference.

@greg_plowman and @foobar_lv2, thank you very much for your insights. I learned something else, as usual in this great forum.

One final thing: is the option of generating the random generator for each thread explicitly bad for any reason?

I mean this:

function sum_rand_parallel2(n)
    nthreads = Threads.nthreads()
    s = zeros(nthreads)
    n_per_thread = n ÷ nthreads
    Threads.@threads for i in 1:nthreads
        rng = MersenneTwister()
        si = 0.0
        for j in 1:n_per_thread
            si += rand(rng)
        end
        s[i] = si
    end
    sum(s)
end

Which results in (the greg functions are the Greg’s above) - with 4 threads:

julia> @btime sum_rand_serial_greg($rng,2_000_000)
  3.181 ms (0 allocations: 0 bytes)
1.0002939045077431e6

julia> @btime sum_rand_parallel_greg($rngs,2_000_000)
  912.778 μs (22 allocations: 3.11 KiB)
1.0003557432746999e6

julia> @btime sum_rand_parallel2(2_000_000)
  880.487 μs (50 allocations: 80.92 KiB)
1.0002736053868829e6

For smaller N the time of creating the generator inside the loop appears to make a difference:


julia> @btime sum_rand_parallel_greg($rngs,200_000)
  88.488 μs (22 allocations: 3.11 KiB)
100052.98751356374

julia> @btime sum_rand_parallel2(200_000)
  102.415 μs (50 allocations: 80.92 KiB)
99880.45913126941

but for the larger N, apparently, the locality of the generator became an advantage (edit: not really meaningful probably, as I am testing here).