Regardless of deprecating threadid() or not, I believe this is the point… Why doesn’t the post recommend dropping threadid() whenever possible? Instead, it recommends dropping @threads in favour of partitioning data into tasks, using fetch, map-reduce, and so forth, which strikes me as a little odd and deviating from making things simple.
Coming mostly from an OpenMP/MPI background, I’m a little overwhelmed with the idea that we should migrate from simple parallel for loops, to a combination of iterators, tasks, fetch, atomics, map-reduce-filter, and all that.
I’m beggining to miss the OpenMP philosophy that we shouldn’t need to modify our scalar code to make it parallel, and only add a simple #pragma. I do agree that @threads is a misnomer, but if anything, I believe we should try to replace it (or also add) something simple, inspired by this philosophy.
Correct me if I’m wrong, but I would say that, at least for scientific computing, combining @threads with perhaps something like Iterators.partition to create suitable indexings for shared buffers (and dropping threadid) is all that is needed.
I think adding a section on using @lmiq’s very nice ChunkSplitters.jl would be a nice PR to submit to the blogpost that gives people an easier transition path.
I guess the reason I didn’t really focus on solutions like that is that I really do just think that it’s a bad pattern to create state before a multithreaded loop and then mutate that state inside the loop. It’s just begging for trouble.
Personally, I think the best solution here is to have a better API than Threads.@threads.
That’s a nice discussion to have. There are situations, for instance when the variable to be reduced is a simple immutable, that are nicely handled by the higher level functions.
But I’m not confortable in letting an interface deal with initialization and updating of a shared state that is a mutable, perhaps large, buffer. My issue was always that for these cases the transparency of the explicit threaded loop with a manually created shared state was less prone to confusion than following the syntax indicated by a library.
Yes, that would be spot-on. It would be a simple solution to this possible bug.
Well, the pattern of having a mutable state and processing it with various functions already arises in relation to the need reusing the same chunk of memory to avoid allocations of temporary arrays (as it is currently easier than stack allocations). I agree that it is perhaps error prone, but it is what I’m doing in performance-sensitive parts of my code anyway, so using the same pattern in multithreaded makes sense to me.
I haven’t tried task and fetch, but the post says that this even introduces type instability, whichs makes me wonder about allocations and overall performance.
Maybe this discussion is a little too abstract, and we should have some sort of multithreaded benchmarks with concrete examples of different kind. It’s not the same to be coding, say, an asynchronous web-server app or a PDE solver.
Great and informative article! I have two questions:
The updated example in the Julia manual under “Using @threads without data races” presents a solution by using Threads.@spawn and by using atomics. Does this mean that Threads.@threads can and should only be used safely with atomics in the future? For example, why could such a simple code not be safe?
s = zeros(100)
Threads.@threads for i in 1:100
x = map(sqrt, 1:i)
s[i] = sum(x)
end
s
I really enjoyed the simplicity of Threads.@threads but the article somehow implies that it should rather be banned from the toolkit.
2. The article recommends using the great ThreadsX.jl package, yet the last code-commit has been 2 years ago and there are quite a few open issues. Is that package still maintained and safe to use?
Thanks for the quick reply! Does this generally mean that using these kinds of patterns without relying on Threads.threadid() is safe? I kind of miss simple examples in the manual of how to use Threads.@threads safely. The article and the manual seem to imply that Threads.@threads is a foot gun that you should better stay away from.
Safety in parallel executions is something very case-specific, so one cannot claim something like that in general. What is should be avoided is the use threadid() to index an output buffer. It is threadid() which probably would better not be in the toolkit, except for internals, maybe.
Your code is safe because it uses the iterator of the loop, i, to index the output array s, meaning that s[i] is well defined position in the s vector independently of which thread is working on it. That is the whole point of the safety concerns of that article.
But isn’t it safe to say then that if you have a loop that iterates over i, which also serves as the index to an output array s, and the loop does not rely on threadid(), such a loop is generally safe? I see this as a commonly used approach, which is also easy to understand.
I posted my little example because I was wondering what would have to happen before the line s[i] = sum(x) for the code not to be safe. Maybe this modified code is a bit more general:
s = Array{T}(undef, length)
@threads for i in 1:length(s)
temp = compute_stuff()
s[i] = temp
end
How could such a code not be thread safe, given that compute_stuff() is single threaded and does not rely on threadid()?
If compute_stuff updates global variables or has other side effects it could be less than thread safe.
Generally speaking though, the article mostly focuses on reductions, which has been the most common use case for threadid() in conjunction with @threads, and then the macro isn’t particularly easy to use right and efficiently. If your code does a “for each” rather than a reduction, @threads is often fine.
“The general template that this incorrect code follows is something like the following:” (then the @threads-pattern code).
The following Julia code using the general template (as mentioned in the Blog) is repeatedly tested via julia -t 7 threads.jl in a bash for-loop. However, the output does not change in different runs (at least in my tests).
using Base.Threads: nthreads, @threads, threadid
f(i) = (sleep(0.001); i)
states = [ 0 for _ in 1:nthreads() ]
N = 100
@threads for x in 1:N
tid = threadid()
old_val = states[tid]
new_val = old_val + f(x)
states[tid] = new_val
end
@show sum(states)
For the @sync-@spawn-pattern (the 2nd example in the Blog): the following Julia code is repeatedly tested via julia -t 7 spawn.jl in a bash for-loop. The output does change very frequently in different runs.
using Base.Threads: nthreads, @spawn, threadid
f(i) = (sleep(0.001); i)
let states = [ 0 for _ in 1:nthreads() ], N = 100
@sync for i in 1:N
@spawn begin
tid = threadid()
old_val = states[tid]
new_val = old_val + f(i)
states[tid] = new_val
end
end
@show sum(states)
end
Thus, I mean the general template in the Blog is not reproducible. The @threads-pattern code above always gives correct results in my tests. More info about my tests:
Even if your code appears to be correct in practice assuming a specific, pinned, version of the compiler and other dependencies (such as LLVM or Julia package code), the bug might manifest as a miscompilation otherwise. Related: Benign Data Races Considered Harmful | Bartosz Milewski's Programming Cafe
julia> function s()
f(i) = (sleep(0.001); i)
N=100
states = zeros(nthreads())
@threads for x in 1:N
tid = threadid()
old_val = states[tid]
new_val = old_val + f(x)
states[tid] = new_val
end
return sum(states)
end
s (generic function with 1 method)
julia> s()
5050.0
julia> s()
3314.0
julia> s()
3683.0
Which is avoided by chunking:
julia> using ChunkSplitters
function s_chunks()
f(i) = (sleep(0.001); i)
N=100
states = zeros(nthreads())
@threads for (tid, inds) in enumerate(chunks(1:N; n=nthreads()))
for x in inds
old_val = states[tid]
new_val = old_val + f(x)
states[tid] = new_val
end
end
return sum(states)
end
s_chunks (generic function with 1 method)
julia> s_chunks()
5050.0
julia> s_chunks()
5050.0
julia> s_chunks()
5050.0
julia> s_chunks()
5050.0
or by locking (much worse for performance, of course):
julia> function s()
f(i) = (sleep(0.001); i)
N=100
states = zeros(nthreads())
lk = ReentrantLock()
@threads for x in 1:N
@lock lk begin
tid = threadid()
old_val = states[tid]
new_val = old_val + f(x)
states[tid] = new_val
end
end
return sum(states)
end
s (generic function with 1 method)
julia> s()
5050.0
julia> s()
5050.0
julia> s()
5050.0
if it is invoked many times in the Julia code repeatedly, then the incorrect results are observed.
if it is invoked only once in the Julia code. The Julia code, however, is executed by a bash for-loop loop repeatedly, then we only see the first output, which seems always correct.
Anyway, I’m happy to see the incorrect results and thank you again.
Note that the blogpost explicitly warns against falling for these kinds of easily-mistaken impressions, with some very concrete examples of how this might happen:
Many existing uses of thread local state happen to be relatively robust and give correct answers only because the functions they are calling during execution do not yield. One may then think “well, I can just avoid this problem by making sure my code doesn’t yield”, but we think this is a bad and unsustainable idea, because whether or not a function call will yield is not stable, obvious, or easily inspectable.