PSA: Thread-local state is no longer recommended; Common misconceptions about threadid() and nthreads()

Mason · July 6, 2023, 6:12pm

Here’s a new blogpost we’ve been working on for some weeks now intended to explain and help fix a common and incorrect pattern that has been showing up in multithreaded julia code

The TL;DR is that if you’ve written code with things like state[Threads.threadid()] you’ve probably created a race condition and you should fix it immediately

Shout out to @jling for pointing out the scale of this problem, and for encouraging us to think hard about how it should be approached and fixed.

jling · July 6, 2023, 6:29pm

thanks everyone involved to help brainstorming recommendations and for bearing with me through out the process

pitsianis · July 6, 2023, 9:28pm

Thank you for this write-up.

Can you provide some comments for the following cases:

When the running time of each iteration is very unpredictable, how can one select a good value for the chunks_per_thread?
What can be done with nested loops that can be parallelized?

There is a very elegant solution to these problems employed by cilk_for of OpenCilk with work stealing and a loop grainsize hint. Can we utilize such techniques with Julia as well?

jgreener64 · July 6, 2023, 9:30pm

Is it still a problem if the number of things you loop over is equal to the number of threads? For example, is this general pattern safe?

n_threads = Threads.nthreads()
N = 100
out_threads = [0.0 for _ in 1:n_threads]

Threads.@threads for thread_id in 1:n_threads
    for i in thread_id:n_threads:N
        # Do something with i to get x, not involving thread_id
        out_threads[thread_id] += x
    end
end

# Some reduction of out_threads

This pattern has been passing tests for me on a number of Julia versions, but the blog post and particularly footnote 1 suggest it might not be the way to go.

Mason · July 6, 2023, 9:35pm

Yes, that is still a problem, and in some sense it’s made even worse by the fact that the problem is very unlikely to actually manifest, meaning that as you point out, your tests may convince you that your code is working fine when in reality there is a nasty bug hiding in it that would only be encountered under some somewhat strange and hard to reproduce circumstances.

Mason · July 6, 2023, 9:46pm

Generally speaking, unless the setup time is very costly relative to the already somewhat costly operation of @spawning a task, it’s best in this situation to simply spawn very many tasks and let things work themselves out.

For situations where the setup per-thread is so costly you don’t want to make chunks_per_thread relatively big, then yes a work stealing pipeline would be best. We currently don’t have workstealing in base julia, but FoldsThreads.jl does have a work-stealing executor you can use.

Another option would be something that’s apparently not technically ‘work-stealing’ but I don’t have a good name for it so I’ll just call it “a poor-man’s workstealing” which is to queue up chunks of work in a Channel and then have only nthreads() different tasks which each are take!-ing work from that channel.

Something like this:

chunks = Iterators.partition(data, basesize)

ch = Channel{eltype(chunks)}(Inf)
foreach(chunk -> put!(ch, chunk), chunks)
close(ch)

tasks = map(1:Threads.nthreads()) do _
    Threads.@spawn begin
        state = ...
        for chunk in ch
            for x in chunk
                state = op(state, x)
            end
        end
        return state
    end
end

states = fetch.(tasks)

or something equivalent. this could be written up as tchannel_mapreduce function or whatever.

In general, this is going to require some discretion on your part as the programmer, but you can usually get away with just parallelizing the outermost loop. Sometimes you’ll need a more custom approach though.

Yeah, from what I hear, Clik has some very nice APIs. Julia’s Base.Threads has some good bones and good infrastructure, but some really horrible end-user APIs which is the whole source of this debacle and the reason this blogpost had to be written.

I think the Transducers.jl ecosystem has a pretty compelling story about parallelism and I hope more of its ideas and interfaces end up in julia itself, but I guess time will tell.

jgreener64 · July 6, 2023, 10:05pm

Thanks, in that case this PSA was definitely useful and I need to change some code.

I’m now struggling to see any non-trivial use case for @threads that doesn’t involve :static, @spawn or atomics. Is that correct? The only example I can see in the Julia docs is writing the thread ID to an array.

Perhaps some of the loud warnings in the blog could be added to the docs, maybe that’s already occurring somewhere on GitHub.

lmiq · July 6, 2023, 10:30pm

I think this is perfectly fine, because you are not using the function threadid(). You can (should) call that ichunk to avoid confusion and there will be no concurrency issues there.

(You there have just used a simple way to divide the workload in chunks).

Note that you can use the same pattern using different variable names, an neither the task IDs or the number of chunks need to be the same as the number of threads:

n_threads = Threads.nthreads()
N = 100
nchunks = 20
out_chunks = [0.0 for _ in 1:nchunks]

Threads.@threads for ichunk in 1:nchunks
    for i in ichunk:nchunks:N
        # Do something with i to get x, not involving thread_id
        out_chunks[ichunk] += x
    end
end

This is what ChunkSplitters.jl does when using the :scatter chunk strategy. In fact, increasing the number of chunks can improve load balancing.

Ps. Your question is somehow ambiguous, because the answer to the introduction is yes, there’s a problem, but not with the actual code snippet you posted.

Mason · July 6, 2023, 10:49pm

ugh yeah I did not read that code carefully enough, yeah it looks like it’s fine, sorry about the mixup @jgreener64.

Maybe I should take this opportunity to point out that reasoning about race conditions and multi-threading is really hard! That’s why we should be aiming to write code that doesn’t depend on the low-level details of multithreading.

I think the whole idea of setting up state before a loop that a threaded loop then mutates is just generally a bad idea and we should have abstractions that avoid it. Humans are not particularly good at eyeballing some code and saying “yeah that’s safe” or “nah that’s unsafe”

carstenbauer · July 7, 2023, 7:35am

@Mason, great to see that someone is trying to explain some of the common, but somewhat subtle, issues that commonly arise when using Base.Threads. I agree with your statement above that (at least as of now) the Base.Threads end-user API is “horrible” and full of footguns and misnomers (@threads in particular). So once again, great effort!

However, allow me to put on my HPC hat and add one complementary point: As much as I like task-based multithreading (for the reasons you’ve mentioned) we absolutely need (and will likely always need) tools that let us control the task-thread mapping more directly!

The abstraction “think about tasks not threads” is fine for many use-cases but also puts all the burden (or trust) for obtaining an efficient mapping on a given system on the dynamic scheduler. And our current scheduler, in particular if you only count Base, is currently pretty limited, e.g. it barely has any end-user options. Especially in HPC, where you want to “performance engineer”, I can often easily do a better job in establishing an efficient mapping between well-defined tasks and the hardware threads (i.e. CPU-cores) available in this machine (I can pin software threads to CPU-cores via ThreadPinning.jl and thus have full control where I place my sticky tasks). In contrast to the scheduler, I know everything about my tasks (my workload) and everything about the machine, which gives me a competitive advantage.
Of course, the scheduler will get smarter over time (I’m dreaming of a NUMA-aware scheduler ) and for many people/workloads/machines the pros of using the dynamic scheduler outweigh the cons. I just hope we won’t forget about the cons, especially for HPC scenarios.

(To give a specific example, hardware-performance monitoring with LIKWID.jl wouldn’t be possible at all if we couldn’t place tasks on specific threads. Of course, from a task-based multithreading POV, the solution would be to make LIKWID task-aware. But, realistically speaking, this is not going to happen any time soon (or ever).)

carstenbauer · July 7, 2023, 7:39am

Another orthogonal point: Users are often used to thinking about threads (e.g. coming from OpenMP) and much less so about task-based multithreading with a M:N mapping abstraction. If we want to make them think in terms of tasks, we need to give them excellent tools and documentation. Currently we clearly underdeliver here.

(Your post is a valuable step in the right direction!)

NiclasMattsson · July 7, 2023, 8:16am

From the blog post:

Quickfix: Replace @threads with @threads :static

However, this is not a good long term solution because

It’s relying on guard rails to prevent otherwise incorrect code to be correct

@threads :static is not cooperative or composable, that means that if code inside your @threads :static loop also does multithreading, your code could be reduced to worse than single-threaded speeds, or even deadlock.

I admit to being somewhat unconvinced by point 1 above. I happen to like the guardrails on my apartment staircase, even if they limit my freedom in plunging head-first into the gap five floors down to the ground.

So if I ensure that I avoid multithreading inside the inner-loops of my threaded code, could @threads :static actually be a proper long-term solution for easy dummy-proof multithreading?

And a clarification: I assume “multithreading in the inner-loop” includes BLAS operations like matrix multiplication. This may be harder to avoid, so maybe this is the main counterargument to @threads :static?

Sukera · July 7, 2023, 8:41am

The issue is that it’s extremely hard to avoid that, especially when considering version upgrades. You may be able to reasonably convince yourself that a given function call won’t spawn new tasks, but that’s not necessarily true of the next version of your dependency. So while :static can be ok for now, it’s not future proof and thus a much more brittle solution than designing a piece of code to be robust to those issues in the first place (there’s multiple names for this concept, but I like “correctness by design”).

On top of this, it’s quite likely that :static either isn’t a good fit for the problem in the first place, or leaves performance on the table (which is why things like :dynamic exist in the first place).

Yes, that’s the main way these issues are encountered today, but the general sentiment hold even for non-BLAS code. It’s really hard to consistently ensure that code you’re calling isn’t parallelizing internally (which again could leave performance on the table/become a bottleneck).

carstenbauer · July 7, 2023, 8:43am

BLAS multithreading is completely irrelevant here. It’s not composable anyway, whether you use @threads :dynamic or @threads :static.
@threads :static is a kind-of fine solution for “application code”. But if you use it in a library, it doesn’t compose.

lmiq · July 7, 2023, 9:01am

I admittedly had a hard time finding code that produces the wrong results, even with the dynamic scheduling, using threadids. My main issue with it is that one needs to be sure that the code one is running inside each task does not return the scheduler, launches new threads, or migrate internally between threads, and that’s not guaranteed by virtually any function, not mentioning things like map, broadcasts, etc, that could become automatically parallel in future Julia releases.

Mason · July 7, 2023, 3:31pm

Yeah, perhaps that wasn’t the best phrasing. What I really meant was that its relying on implicit and non-obvious guardrails to stop you from plunging to the ground. One problem I see with the whole @threads :static thing is that it’s not at all clear to someone reading it that the :static is required for the code to be correct.

If someone comes along and is trying to improve your code, or is copying and adapting your code to a new problem, they might think “I see no reason why we should use a static schedule here, the dynamic scheduler is great, lets just remove this pesky argument” and then subtly break the code.

You also need to make sure that nobody else is calling your code from a multithreaded context, since thats also a way you can accidentally end up gimping someone else’s performance.

All that said, so long as you understand the situation it’s not such a giant problem to use @threads :static. It exists for a reason and we discuss it in this blog post as an option for a reason. I want to steer people away from it because I think it’s not a great way to approach this issue, but it does have its place and it will work.

Mason · July 7, 2023, 4:41pm

Yes absolutely. When thinking about performance, the hardware and implementation details are very very important and cannot be ignored.

The thing we wanted to point out though was that when thinking about algorithmic correctness, you need to reason about the tasks first and not the threads, because you can have more tasks than threads and tasks can pause and switch.

tbeason · July 7, 2023, 4:41pm

From the blogpost:

using Threads: nthreads, @spawn
function tmapreduce(f, op, itr; chunks_per_thread::Int = 2, kwargs...)
    chunk_size = max(1, length(itr) ÷ chunks_per_thread)
    tasks = map(Iterators.partition(itr, chunk_size)) do chunk
        @spawn mapreduce(f, op, chunk; kwargs...)
    end
    mapreduce(fetch, op, tasks; kwargs...)
end

Pasting in my REPL:

julia> using Threads: nthreads, @spawn
ERROR: ArgumentError: Package Threads not found in current path, maybe you meant `import/using .Threads`.
- Otherwise, run `import Pkg; Pkg.add("Threads")` to install the Threads package.
Stacktrace:
 [1] macro expansion
   @ .\loading.jl:1630 [inlined]
 [2] macro expansion
   @ .\lock.jl:267 [inlined]
 [3] require(into::Module, mod::Symbol)
   @ Base .\loading.jl:1611

julia> function tmapreduce(f, op, itr; chunks_per_thread::Int = 2, kwargs...)
           chunk_size = max(1, length(itr) ÷ chunks_per_thread)
           tasks = map(Iterators.partition(itr, chunk_size)) do chunk
               @spawn mapreduce(f, op, chunk; kwargs...)
           end
           mapreduce(fetch, op, tasks; kwargs...)
       end
ERROR: LoadError: UndefVarError: `@spawn` not defined
in expression starting at REPL[2]:4

(@v1.9) pkg> add Threads
    Updating registry at `C:\Users\beasont\.julia\registries\General.toml`
ERROR: The following package names could not be resolved:
 * Threads (not found in project, manifest or registry)
   Suggestions: ThreadsX ThreadSafeDicts ThreadedScans ThreadedSparseCSR ThreadedSparseArrays SlackThreads CheapThreads FoldsThreads ThreadPools ThreadTools Xorg_libpthread_stubs_jll ThreadedIterables ThreadingUtilities MultiThreadedCaches ThreadLocalCounters

Mason · July 7, 2023, 4:42pm

Oops, good catch @tbeason, that should have been Base.Threads!

Should be fixed once this PR is merged: Some small fixes to PSA-dont-use-threadid.md by MasonProtter · Pull Request #1919 · JuliaLang/www.julialang.org · GitHub

There’s also a typo in the chunk_size line that’ll be addressed in that PR.

Edit: it’s now merged and live!

Palli · July 7, 2023, 8:52pm

Thanks for writing this up. I and many programmers know that (naked) threads are dangerous, so I was really hoping Julia did things right with @threads for loops. Is my understanding correct, you should always suspect code using it to be buggy (unless it’s using the newer @threads :static, and then always ok?), or in fact that such code is always known know to be buggy?

Another option: Use a package which handles this correctly
We encourage people to check out various multithreading libraries like Transducers.jl (or various frontends like ThreadsX.jl, FLoops.jl, and Folds.jl), and MultiThreadedCaches.jl.

@elrod Polyester.jl has been on my radar as a replacement for @threads. Since it’s not on the list of (apparently) approved packages listed there, is it then know to be buggy too? Or if it’s actually known to (always) produce correct code when it’s used, then you may want to add it to the list in the post. It would be good to have a full block-list of known threading packages to avoid and/or a full allow-list, e.g. are all the dependencies ok, such as ThreadingUtilities.jl?

That’s too bad, since the threading API is no longer marked experimental and we can’t then get rid of it; until 2.0. I’m not sure if we can fix already-written code in 2.0, by e.g. making :static implied then? What would your ideal API be like, is there one already available in some package, and what would be your proposal for Base, to mitigate things in 1.x?

I want to know how to best write (fast) threaded/parallel code, without relying on any package…, because in some situations, well mainly benchmarks, they may not be allowed, such as at the Debian benchmark game. I only scanned your article, I think using @threads is actually ok in some cases such as the benchmarks there, since nothing else is running (but all such code would potentially, or actually, be wrong, with other concurrent code running?):

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/fannkuchredux-julia-2.html

PSA: Thread-local state is no longer recommended; Common misconceptions about threadid() and nthreads()

Quickfix: Replace @threads with @threads :static

Quickfix: Replace `@threads` with `@threads :static`