PSA: Thread-local state is no longer recommended; Common misconceptions about threadid() and nthreads()

I have two problems:

  1. How efficient is Better fix: Work directly with tasks solution compared to the original buggy solution?
  2. How should I reform the @threads-style code to the @spawn-style, if the pre-allocated states is used.

Here is my specific problem code, the variable queue is my pre-allocated states. And performance is critical in my application.

Thank you!

You can use something like this:

2 Likes

This post got me thinking for a while:

Regardless of deprecating threadid() or not, I believe this is the point… Why doesn’t the post recommend dropping threadid() whenever possible? Instead, it recommends dropping @threads in favour of partitioning data into tasks, using fetch, map-reduce, and so forth, which strikes me as a little odd and deviating from making things simple.

Coming mostly from an OpenMP/MPI background, I’m a little overwhelmed with the idea that we should migrate from simple parallel for loops, to a combination of iterators, tasks, fetch, atomics, map-reduce-filter, and all that.

I’m beggining to miss the OpenMP philosophy that we shouldn’t need to modify our scalar code to make it parallel, and only add a simple #pragma. I do agree that @threads is a misnomer, but if anything, I believe we should try to replace it (or also add) something simple, inspired by this philosophy.

Correct me if I’m wrong, but I would say that, at least for scientific computing, combining @threads with perhaps something like Iterators.partition to create suitable indexings for shared buffers (and dropping threadid) is all that is needed.

16 Likes

I think adding a section on using @lmiq’s very nice ChunkSplitters.jl would be a nice PR to submit to the blogpost that gives people an easier transition path.

I guess the reason I didn’t really focus on solutions like that is that I really do just think that it’s a bad pattern to create state before a multithreaded loop and then mutate that state inside the loop. It’s just begging for trouble.

Personally, I think the best solution here is to have a better API than Threads.@threads.

3 Likes

That’s a nice discussion to have. There are situations, for instance when the variable to be reduced is a simple immutable, that are nicely handled by the higher level functions.

But I’m not confortable in letting an interface deal with initialization and updating of a shared state that is a mutable, perhaps large, buffer. My issue was always that for these cases the transparency of the explicit threaded loop with a manually created shared state was less prone to confusion than following the syntax indicated by a library.

4 Likes

Yes, that would be spot-on. It would be a simple solution to this possible bug.

Well, the pattern of having a mutable state and processing it with various functions already arises in relation to the need reusing the same chunk of memory to avoid allocations of temporary arrays (as it is currently easier than stack allocations). I agree that it is perhaps error prone, but it is what I’m doing in performance-sensitive parts of my code anyway, so using the same pattern in multithreaded makes sense to me.

I haven’t tried task and fetch, but the post says that this even introduces type instability, whichs makes me wonder about allocations and overall performance.

Maybe this discussion is a little too abstract, and we should have some sort of multithreaded benchmarks with concrete examples of different kind. It’s not the same to be coding, say, an asynchronous web-server app or a PDE solver.

2 Likes

Great and informative article! I have two questions:

  1. The updated example in the Julia manual under “Using @threads without data races” presents a solution by using Threads.@spawn and by using atomics. Does this mean that Threads.@threads can and should only be used safely with atomics in the future? For example, why could such a simple code not be safe?
s = zeros(100)

Threads.@threads for i in 1:100
           x = map(sqrt, 1:i) 
           s[i] = sum(x)
       end
s       

I really enjoyed the simplicity of Threads.@threads but the article somehow implies that it should rather be banned from the toolkit.
2. The article recommends using the great ThreadsX.jl package, yet the last code-commit has been 2 years ago and there are quite a few open issues. Is that package still maintained and safe to use?

1 Like

It is safe.

(Note, though, that this implementation will likely suffer from false sharing, but that’s just a performance concern.)

1 Like

Thanks for the quick reply! Does this generally mean that using these kinds of patterns without relying on Threads.threadid() is safe? I kind of miss simple examples in the manual of how to use Threads.@threads safely. The article and the manual seem to imply that Threads.@threads is a foot gun that you should better stay away from.

Safety in parallel executions is something very case-specific, so one cannot claim something like that in general. What is should be avoided is the use threadid() to index an output buffer. It is threadid() which probably would better not be in the toolkit, except for internals, maybe.

Your code is safe because it uses the iterator of the loop, i, to index the output array s, meaning that s[i] is well defined position in the s vector independently of which thread is working on it. That is the whole point of the safety concerns of that article.

4 Likes

But isn’t it safe to say then that if you have a loop that iterates over i, which also serves as the index to an output array s, and the loop does not rely on threadid(), such a loop is generally safe? I see this as a commonly used approach, which is also easy to understand.

I posted my little example because I was wondering what would have to happen before the line s[i] = sum(x) for the code not to be safe. Maybe this modified code is a bit more general:

s = Array{T}(undef, length)

@threads for i in 1:length(s)
    temp = compute_stuff()
    s[i] = temp
end

How could such a code not be thread safe, given that compute_stuff() is single threaded and does not rely on threadid()?

I think this was answered:

1 Like

If compute_stuff updates global variables or has other side effects it could be less than thread safe.

Generally speaking though, the article mostly focuses on reductions, which has been the most common use case for threadid() in conjunction with @threads, and then the macro isn’t particularly easy to use right and efficiently. If your code does a “for each” rather than a reduction, @threads is often fine.

3 Likes

I do not understand the logic in this post (PSA: Thread-local state is no longer recommended).

  • in the first sample you are using the @threads-pattern for incorrect code, which is hardly reproducible for incorrect results.
  • in the second sample you are using the @sync-@spawn-pattern to demonstrate the incorrectness, which is surely reproducible for incorrect results.

Could you please give an easily reproducible sample using the @threads-pattern for incorrect results?
Thank!


Here are some details about my tests of @threads-pattern and @sync-@spawn-pattern.

In the section “The buggy pattern” in this Blog (PSA: Thread-local state is no longer recommended) it reads

“The general template that this incorrect code follows is something like the following:” (then the @threads-pattern code).

The following Julia code using the general template (as mentioned in the Blog) is repeatedly tested via julia -t 7 threads.jl in a bash for-loop. However, the output does not change in different runs (at least in my tests).

using Base.Threads: nthreads, @threads, threadid

f(i) = (sleep(0.001); i)

states = [ 0 for _ in 1:nthreads() ]
N = 100
@threads for x in 1:N
  tid = threadid()
  old_val = states[tid]
  new_val = old_val + f(x)
  states[tid] = new_val
end
@show sum(states)

For the @sync-@spawn-pattern (the 2nd example in the Blog): the following Julia code is repeatedly tested via julia -t 7 spawn.jl in a bash for-loop. The output does change very frequently in different runs.

using Base.Threads: nthreads, @spawn, threadid

f(i) = (sleep(0.001); i)

let states = [ 0 for _ in 1:nthreads() ], N = 100
  @sync for i in 1:N
    @spawn begin
      tid = threadid()
      old_val = states[tid]
      new_val = old_val + f(i)
      states[tid] = new_val
    end
  end
  @show sum(states)
end

Thus, I mean the general template in the Blog is not reproducible. The @threads-pattern code above always gives correct results in my tests. More info about my tests:

  • CPU: AMD EPYC 7763 64-Core
  • OS: Red Hat Enterprise Linux 8.8
  • Julia: version 1.10.2

I didn’t look at your code, however, …

I think you’re missing two things:

  1. Often-correct ≠ correct.

  2. Even if your code appears to be correct in practice assuming a specific, pinned, version of the compiler and other dependencies (such as LLVM or Julia package code), the bug might manifest as a miscompilation otherwise. Related: Benign Data Races Considered Harmful |   Bartosz Milewski's Programming Cafe

I get inconsistent results with this:

julia> function s()
       f(i) = (sleep(0.001); i)
       N=100
       states = zeros(nthreads())
       @threads for x in 1:N
         tid = threadid()
         old_val = states[tid]
         new_val = old_val + f(x)
         states[tid] = new_val
       end
       return sum(states)
       end
s (generic function with 1 method)

julia> s()
5050.0

julia> s()
3314.0

julia> s()
3683.0

Which is avoided by chunking:

julia> using ChunkSplitters
       function s_chunks()
       f(i) = (sleep(0.001); i)
       N=100
       states = zeros(nthreads())
       @threads for (tid, inds) in enumerate(chunks(1:N; n=nthreads()))
         for x in inds
           old_val = states[tid]
           new_val = old_val + f(x)
           states[tid] = new_val
          end
       end
       return sum(states)
       end
s_chunks (generic function with 1 method)

julia> s_chunks()
5050.0

julia> s_chunks()
5050.0

julia> s_chunks()
5050.0

julia> s_chunks()
5050.0

or by locking (much worse for performance, of course):

julia> function s()
       f(i) = (sleep(0.001); i)
       N=100
       states = zeros(nthreads())
       lk = ReentrantLock()
       @threads for x in 1:N
         @lock lk begin
         tid = threadid()
         old_val = states[tid]
         new_val = old_val + f(x)
         states[tid] = new_val
         end
       end
       return sum(states)
       end
s (generic function with 1 method)

julia> s()
5050.0

julia> s()
5050.0

julia> s()
5050.0
1 Like

Thanks! For the function s():

  • if it is invoked many times in the Julia code repeatedly, then the incorrect results are observed.
  • if it is invoked only once in the Julia code. The Julia code, however, is executed by a bash for-loop loop repeatedly, then we only see the first output, which seems always correct.
    Anyway, I’m happy to see the incorrect results and thank you again.

Relevant PR (will document data races as UB):

Note that the blogpost explicitly warns against falling for these kinds of easily-mistaken impressions, with some very concrete examples of how this might happen:

Don’t try to reason about yielding

Many existing uses of thread local state happen to be relatively robust and give correct answers only because the functions they are calling during execution do not yield. One may then think “well, I can just avoid this problem by making sure my code doesn’t yield”, but we think this is a bad and unsustainable idea, because whether or not a function call will yield is not stable, obvious, or easily inspectable.

7 Likes