Exactly, that is why I mentioned the sin
function above. With the code below, in which the non-random function is simplified even further, I get:
Code
function sum_rand_serial(n)
s = 0.
for i in 1:n
s += rand()
end
s
end
function sum_rand_parallel(n)
nthreads = Threads.nthreads()
s = zeros(nthreads)
n_per_thread = n ÷ nthreads
Threads.@threads for i in 1:nthreads
for j in 1:n_per_thread
s[i] += rand()
end
end
sum(s)
end
function sum_notrand_serial(n)
s = 0.
for i in 1:n
s += (1/i)^2
end
s
end
function sum_notrand_parallel(n)
nthreads = Threads.nthreads()
s = zeros(nthreads)
n_per_thread = n ÷ nthreads
Threads.@threads for i in 1:nthreads
for j in 1:n_per_thread
s[i] += (1/((i-1)*n_per_thread + j))^2
end
end
sum(s)
end
julia> @btime sum_notrand_serial(10_000)
10.056 μs (0 allocations: 0 bytes)
1.6448340718480652
julia> @btime sum_notrand_parallel(10_000)
5.489 μs (22 allocations: 3.11 KiB)
1.6448340718480643
julia> @btime sum_rand_serial(10_000)
33.064 μs (0 allocations: 0 bytes)
5053.117071086374
julia> @btime sum_rand_parallel(10_000)
43.689 μs (22 allocations: 3.11 KiB)
5008.220633345147
Now the non-random case is 3 times faster than the random case, and the scaling is still much better than in the random case.
But, as I mentioned, that is fixed if one modifies explicitly the loop of the random version to avoid accessing the s[i]
variable in the inner loop, using a temporary local variable. That makes clear the fact that the problem is the access to memory.
What is not clear is why the same does not happen with some (many?) non-random functions. Apparently the compiler is smarter with those, and does that automatically, because the explicit avoiding of the access of the s[i]
has no effect on the performance (in these cases, I have seen other cases where it does):
Code with temporary variable to accumulate
function sum_rand_serial(n)
s = 0.
for i in 1:n
s += rand()
end
s
end
function sum_rand_parallel(n)
nthreads = Threads.nthreads()
s = zeros(nthreads)
n_per_thread = n ÷ nthreads
Threads.@threads for i in 1:nthreads
ts = 0.
for j in 1:n_per_thread
ts += rand()
end
s[i] = ts
end
sum(s)
end
function sum_notrand_serial(n)
s = 0.
for i in 1:n
s += (1/i)^2
end
s
end
function sum_notrand_parallel(n)
nthreads = Threads.nthreads()
s = zeros(nthreads)
n_per_thread = n ÷ nthreads
Threads.@threads for i in 1:nthreads
ts = 0.
for j in 1:n_per_thread
ts += (1/((i-1)*n_per_thread + j))^2
end
s[i] = ts
end
sum(s)
end
julia> @btime sum_notrand_parallel(10_000)
5.470 μs (22 allocations: 3.11 KiB)
1.6448340718480643
julia> @btime sum_rand_parallel(10_000)
14.686 μs (22 allocations: 3.11 KiB)
5041.680075280128
Ah, @stevengj, effectively, augmenting the cost of the inner operation reduces this relative lag. I think it is quite clear what is happening. What is not clear to me is why the compiler is smarter in case than in the other and avoids that memory access.