Sharp edge with `Threads.threadid()` and task migration

Your code might get rescheduled on a different thread between the check and the access.

4 Likes

Edit: note that this was moved from another thread. Which hopefully explains why it feels a bit out-of-place as a first reply on this thread :sweat:


Wait, what!? That’s unexpected. Why doesn’t Threads.threadid() just abstract to the task ID instead? It feels strange that a broadly-understood system call like “threadid” would expose arbitrary internal details, especially when tasks are Julia’s main abstraction for parallelism.

It makes me wonder what the intended use case for threadid() actually is—because in practice, this seems more likely to cause confusion than provide value. (Sorry for the rant—just trying to work through my thoughts here!)

This pattern has even been erroneously recommended in previous official julia blogposts.

Ugh! Well, at least I know I’m not misremembering learning this (and then using that officially recommended pattern in all my code).

Why not just fix threadid() to align with how people are actually using it?

1 Like

Reading through a code search for language:julia /threadid/ is pretty terrifying. So much concurrency-based corruption, everywhere :see_no_evil: That is a really really sharp edge. Is it possible for us to patch that in Julia to match the usage == task ID? (I’m happy to write a PR for this)

1 Like

I don’t think this would fix the situation in the vast majority of cases — the most common flavor of that anti-pattern is to do:

tls = Array{Any}(undef, Threads.nthreads())
# ...
tls[Threads.threadid()] = # ...

The number of threads used to be bounded (and sometimes still is), but the number of tasks is definitely not. Perhaps in some ways two bugs would be better than one (especially given that the subsequent bounds error would be much noisier), but it would also be lying about its own name.

1 Like

Yes I think having threadid() match the ID of the task would be infinitely better. Better a BoundsError you can see than a cache corruption you cannot.

1 Like

Perhaps an even better solution would be to remove threadid() altogether, with an error message that points users to that blogpost. I see few cases where someone uses it in a way that actually matches what they mean. Removing it with an informative error message would be even better than a BoundsError (which itself would still be infinitely better than silent cache corruption). If even myself, a highly active community member, didn’t realise this behavior of threadid() (which in most other languages is constant in a scope), I feel like 99% of the general userbase wouldn’t know.

This also looks like this the solution by GoLang, whose concurrency model is very similar to Julia: their FAQ:

I think it would be really smart to formally deprecate Threads.threadid(). Just seems super easy to abuse and generate difficult-to-debug caching issues. Especially if threadid() is no longer constant, I don’t see a reason to keep it around other than to print a message teaching the user how to upgrade.

I’m sure many multithreaded libraries written before Julia 1.8 have some sort of silent concurrency issue due to abuse of threadid. Better to make those errors loud so they can get fixed!

2 Likes

The blog post does not address a situation for wraping an arithmetic library where scratch space is needed for operations. Acquiring a lock can incur a significant overhead for fast operations like to compare whether two numbers are equal.

For example, I don’t see how I could have made OpenSSLGroup.jl multithreaded without relying on a get_ctx function that provides a context object based on each thread. It is for instance even needed for comparing two points as can be seen here. Any task switching or locking would add overhead.

I know that what I currently have is not very safe, but I don’t see I could made it without relying on Threads.threadid() to maintain performance. I wish that there were a macro that could be placed in front of a function body that ensures that function does not yield or error if such effect could not be proven.

1 Like

If you’re interested in some archeology, an early version of the patch that allows adding threads initially removed nthreads and a warning was also tried in julia#48589. These functions — both nthreads and threadid — do still do what they say on the tin and there are programs that use them correctly.

Change is hard and there are many tradeoffs. I think the fastest and easiest way to help socialize this is to add a lint warning akin to 1-based indexing. It’s a similar sort of problem — this code might be just fine, but folks are often using it in ways that don’t take into account the entire universe of possible behaviors.

image

2 Likes

I appreciate the suggestion about linting, but I’m concerned about error visibility in legacy code. A linter warning won’t catch runtime issues in existing codebases that are silently corrupting their cache. At least with 1:length(A), you get a clear BoundsError that points you to the problem.

Plus, while 1:length(x) has legitimate uses in code that knows its array types, silent task migration under threadid() is almost never what you want. If there is a need to support these specific deprecated patterns, we should create a new, clearly-named function for it - not silently change the behavior of a previously-recommended API. That’s just asking for untraceable bugs in production.

And unlike array indexing where we have Base.require_one_based_indexing(), there’s no way to verify that none of my dependencies are using threadid() incorrectly and potentially generating corrupt results based on code written before 1.8.

1 Like

Could you store the context in thread-local storage instead? Tasks · The Julia Language

It could look something like this:

function get_ctx()
    return get!(OpenSSLContext, task_local_storage(), :ctx)::OpenSSLContext
end

This would initialize a new context for every new task rather than once per thread for the lifetime of the program, so there’s a little extra overhead, similar to calling your __init__ function right before every @threads loop. Is that prohibitive for you or not?

Another alternative would be to initialize nthreads() contexts like you currently do, but store them in a Channel that tasks can take from and put back to. This only has to happen once for the lifetime of each task, i.e., you don’t have to lock for every operation. Perhaps something like this would work:

function get_ctx()
    return get!(task_local_storage(), :ctx) do
        t = current_task()
        ctx = take!(CONTEXT_CHANNEL)
        Threads.@spawn begin
            # put ctx back once the task using it is finished
            wait(t)
            put!(CONTEXT_CHANNEL, ctx)
        end
        ctx
    end::OpenSSLContext
end

Which of these has least overhead depends on how costly it is to allocate a new context object compared to interacting with a Channel.

1 Like

The blog post is rather focused on reduction operations. For scratch space I have used the following pattern:

Threads.@threads for ...
    scratch_data = get!(() -> Vector{Int}(undef, scratch_size),
                        task_local_storage(),
                        :data)::Vector{Int}
    ...
end
3 Likes

Task local storage looks interesting. I did not know about it as I used that blog post as a reference to see alternatives for Threads.threadid(). I will try the task_local_storage approach someday with more time to spare and report whether that introduces performance regressions. It would also allow me to eliminate the __init__ block and compile time hacks I needed to do.

The CONTEXT_CHANNEL approach seems would add significant overhead to get context from the channel and schedule a short-lived task to put it back in.

1 Like

Note that a Threads.@threads loop only spawns approximately nthreads() tasks and partitions the work among them. The tasks live as long as the loop and perform approximately n / nthreads() iterations each. I agree that the channel-based pattern would be prohibitive if each task only did a single iteration, like in the naive for i in 1:n; Threads.@spawn begin ... pattern, but when using @threads on a loop that’s big enough to warrant multithreading it seems less clear. Would be interesting to see the benchmark.

1 Like

I think the most simple alternative is to divide the workload into ntasks chunks and index the chunks. This can be rather easily implemented by hand, but it is what ChunkSplitters.jl implements.

The higher lever alternative, OhMyThreads.jl also uses that approach and addresses these issues.

(decoupling the number of tasks from nthreads() has the additional advantage of controlling the number of tasks independently of how julia was initialized. That can be used used to reduce the parallelization -very useful to benchmark scaling- or to improve load balancing when the tasks are heterogeneous, using ntasks >> nthreads).

2 Likes

Splitting workload into chunks would be only possible if I exported array API. That would be quite limiting and also hard to maintain.

I think the task_local_storage is the solution for my case. I just need to check whether overhead of retrieving the context does not exceed the time it takes to do the operation itself.

1 Like

For performance, make sure to type assert as soon as you load, since task_local_storage is backed by an untyped IdDict and therefore not type stable. That is,

task_local_storage(:ctx)::OpenSSLContext
# or
task_local_storage()[:ctx]::OpenSSLContext
# or
get!(OpenSSLContext, task_local_storage(), :ctx)::OpenSSLContext

(I edited my post above to use the more elegant get!(f, collection, key) form after seeing @GunnarFarneback’s post.)

1 Like

Notice that I use get! rather than get so the freshly generated scratch spaces are stored in task_local_storage and reused.

3 Likes

Oh, I thought that’s what get did. Read the docstring too quickly and didn’t notice that distinction. Fixing the above posts. Thanks!

One annoyance I have with task_local_storage is that it starts fresh for every single task.

So you can’t use it the same way you could use cache[Threads.threadid()] unless you have a long-running task… in which case you’d be better off passing a local variable instead. If you are rapidly Threads.@spawning many workloads, it would just get reset each time.

Like if you try to store a hash ID in task_local_storage():

julia> let n=1000
           [Threads.@spawn(
                   get!(() -> rand(UInt64), task_local_storage(), :ctx)
               ) for _ in 1:n] |>
             Base.Fix1(map, fetch) |>
             unique |>
             length
       end

1000

Each new Threads.@spawn gets its own ID. So in other words I would be recreating the cache every single time I spawn a new task.

Like, it just seems as though task_local_storage() is a way to do quantum tunnelling of variables around your codebase (bad). But for the one legitimate usecase, caching, I can’t see an effective use.

(Though for some reason Threads.@threads is able to create fewer tasks than the number of iterations, so there it seems like you actually can share the cache across loop iterations. Weird.)