what confuses me is that “scope” seems quite different when moving from simple single thread to multi threads… and I cannot found relevant documentations talking specially for this topic…
Not really in that case. Each iteration of the loop creates a new scope, and newly defined variables inside the loop iteration are local. The variables of the outer scope of the loop are shared by the loop iterations. This is the same in both cases.
The problem with the multi-threading there is that the iterations of the loop won’t run necessarily in the same sequence, and maybe try to update a concurrently, so you can get the wrong result, and need to make the access of a safe.
It means that two threads may try to update the variable at the same time, reading from memory the same value, and thus the updates will occur at the same time and the result will be wrong.
This is a simple pattern that does the correct thing:
julia> ntasks = Threads.nthreads()
at = zeros(ntasks)
Threads.@threads for it in 1:ntasks
for i in it:ntasks:100 # simple splitter
b = sin(i)
at[it] += b
end
end
a = sum(at)
-0.1271710136604196
julia> sum(sin(i) for i in 1:100)
-0.12717101366041972
The key is to split a into independent variables to be updated by each thread independently, and reduce the result at the end.
But you can also use Floops.jl, ThreadX, Tullio, or other less “manual” strategies for multi-threading.
It depends on what you do outside of the code you write. The following code is data race free (given some other reasonable assumptions like fun(i) does not introduce data races, ok itself is not a closure, etc.):
function ok()
lk = ReentrantLock()
a = 0.0
Threads.@threads for i in 1:10
b = fun(i)
lock(lk)
try
a += b
finally
unlock(lk)
end
end
end
On the other hand, the following function has a data race
function bad()
b = nothing # added
lk = ReentrantLock()
a = 0.0
Threads.@threads for i in 1:10
b = fun(i)
lock(lk)
try
a += b
finally
unlock(lk)
end
end
end
But the ok function is still bad. Don’t use lock for reduction. Also, a is not type stable. So, fun has to be very slow for multi-threading to be beneficial for this code.
It is very important to note that this requires a commutative (and associative) operator. However, parallel reduction only requires associativity and there are many useful non-commutative reductions (e.g., concatenation).