Apart from all other issues: Your code disrespects the memory system. It should be
julia> function sum_rand_parallel2(n)
nthreads = Threads.nthreads()
s = zeros(nthreads<<3)
n_per_thread = n ÷ nthreads
Threads.@threads for i in 1:nthreads
for j in 1:n_per_thread
s[((i-1)<<3)+1] += rand()
end
end
sum(s)
end
julia> @btime sum_rand_parallel2(N)
617.204 ms (85 allocations: 12.94 KiB)
5.3686410968052244e8
julia> @btime sum_rand_parallel(N)
4.603 s (84 allocations: 11.97 KiB)
5.3686676919503814e8
Explanation: Cores are very fast, and very far away from each other (just like main memory – think like 100ns roundtrip as an effective 150m distance). Once cores start communicating with each other, everything grinds down to a halt. If one core writes something, and another core wants to read (or write) it, then both cores must communicate in order to ensure consistency. The important thing is that, for a CPU, “something” is not an object or address; it is a cacheline, i.e. 64 contiguous aligned bytes.
By using a dense array for your accumulators, all your cores want to access and modify the same cachelines. You may know that these writes/reads don’t overlap/race, but your CPU needs to run expensive coherence protocols to figure that out. If you instead space your accumulators out, then you guarantee that they don’t end up in the same cacheline.