function parallelkernel!(elca,idof,R,...) # fast typestable kernel
@threads for iel ∈ eachindex(elca)
r = incremental(elca[iel],...)
#lock(lk) do
R[ idof[iel]] .+= r # potential BUG race condition
#end
end
end
I think the code is quite uncomplicated: within a threaded loop I compute vectors r, which I add into larger vector R.
Several threads can thus be accessing the same elements in R, and in my understanding this is the classical “race condition” which I need to avoid. I have run this code hundreds of times, and always got the correct result, though (famous last words).
I believe I cannot use Atomic here, because r is not a primitive type (right?) but Julia’s manual mentions lock. However if I uncomment the do bracket in the above code I get LoadError: TaskFailedException.
Possibly, I have a subtle bug in my code, not visible in the above simplified code. In that case, I am on my own. What I would like to know is, did I completely misunderstand how lock is supposed to be used? The example of usage in the manual does not show the context of a threaded loop, was I wrong in assuming this usage to be the pattern?
In your piece of code if you uncomment #lock(lk) do and #end, you will get a
ERROR: TaskFailedException
....
nested task error: UndefVarError: lk not defined
you need to define lk = ReentrantLock() outside of the loop. An MWE:
using .Threads
function sum_race(n)
arr = Int[]
@threads for i in 1:n
push!(arr, i)
end
sum(arr)
end
function sum_lock(n)
arr = Int[]
lk = ReentrantLock()
@threads for i in 1:n
lock(lk) do
push!(arr, i)
end
end
sum(arr)
end
The first gives you a race, the second doesn’t:
julia> sum_race(1000)
ERROR: TaskFailedException
....
nested task error: BoundsError: attempt to access 295-element Vector{Int64} at index [238]
....
julia> sum_lock(1000)
500500
Maybe you defined lk outside your function. Then it maybe something else.
FYI, reduction using a lock is often a bad idea. With lock, R[idof[iel]] .+= r is not run in parallel. For the code in the OP to be parallelized well, the time takes for R[idof[iel]] .+= r has to be very small relative to incremental(elca[iel],...). If you can keep nthreads copies of R in memory, it may make sense to use solution without the lock. For more information, see:
Yes, that makes sense. Most CPU time is spent on incremental, so it’s not a disaster, but indeed I can see that more time is spent on this addition than when taking risks without a lock. I actually tried exactly what you suggest. I somehow botched it and got bad performance, but I’ll have another go.