Random number and parallel execution

Apart from all other issues: Your code disrespects the memory system. It should be

julia> function sum_rand_parallel2(n)
         nthreads = Threads.nthreads()
         s = zeros(nthreads<<3)
         n_per_thread = n ÷ nthreads
         Threads.@threads for i in 1:nthreads
            for j in 1:n_per_thread
              s[((i-1)<<3)+1] += rand()
            end
         end
         sum(s)
       end

julia> @btime sum_rand_parallel2(N)
  617.204 ms (85 allocations: 12.94 KiB)
5.3686410968052244e8

julia> @btime sum_rand_parallel(N)
  4.603 s (84 allocations: 11.97 KiB)
5.3686676919503814e8

Explanation: Cores are very fast, and very far away from each other (just like main memory – think like 100ns roundtrip as an effective 150m distance). Once cores start communicating with each other, everything grinds down to a halt. If one core writes something, and another core wants to read (or write) it, then both cores must communicate in order to ensure consistency. The important thing is that, for a CPU, “something” is not an object or address; it is a cacheline, i.e. 64 contiguous aligned bytes.

By using a dense array for your accumulators, all your cores want to access and modify the same cachelines. You may know that these writes/reads don’t overlap/race, but your CPU needs to run expensive coherence protocols to figure that out. If you instead space your accumulators out, then you guarantee that they don’t end up in the same cacheline.

10 Likes