Sharp edge with `Threads.threadid()` and task migration

ScopedValues don’t have this limitation, they’re inherited on spawned tasks.

1 Like

That’s totally normal and good though. You should basically never create n separate tasks for a threaded loop over n items unless you statically know n is small.

I felt like I at some point read that Threads.@threads was discouraged in favor of working directly with the task interface. Maybe it was for a different reason though.

Also,

Why?

Oversubscription - unless n is known small, it’s very unlikely that you’re going to have that many CPU cores available for doing the actual work in parallel. The default scheduling behavior of @threads slices the iteration space evenly, into chunks, up to the number of threads julia was started with. If you want to perform some form of automatic load balancing of an uneven workload, it’s better to use the (new-ish) @threads :greedy which spawns as many tasks as threads are available but instead takes from the iteration space once it’s done with one iteration.

2 Likes

Yes, Threads.@threads is quite bad. Working with tasks is nice if you know what you’re doing and are avoiding things like insane levels of oversubscription, and feel like re-writing a lot of code.

Otherwise, I strongly recommend using something like OhMyThreads.jl.

It’s very very very inefficient.

it’s not OS-level oversubscription, right? Since Julia has a scheduler. I thought it was more like if the time it takes to run the task is small relative to the startup time of creating the task, it’s quite inefficient, but it matters more about the size of the work each task is doing, not the number of them. So spawning N tasks for N pieces of work is fine if each of those pieces of work is big enough.

1 Like

Is there any imaginable responsible use of Threads.threadid() if current_task().sticky == false?

Afaiu we only migrate tasks on yield-points, and not on GC stop-the-world, not on random blocking syscalls, and especially never preemptively, right?

Is this a guarantee we’re willing to stick to?

If that is not supposed to be a guarantee, i.e. if we want to long-term permit migration on preemption, then a valid implementation would be Threads.threadid() = current_task().sticky ? Threads.the_real_threadid() : rand(1:Threads.nthreads()) ?

That would be a reasonable migration strategy to find all the misuses (have a command-line switch to either return what’s currently returned, or throw an error if the task is not sticky, or try to make the task sticky which should heal the bug).

3 Likes

Maybe the blogpost announcing composable multithreading should also put this in a warning because it currently gives the example of

import Base.Threads.@spawn

function fib(n::Int)
    if n < 2
        return n
    end
    t = @spawn fib(n - 2)
    return fib(n - 1) + fetch(t)
end

with the only caveat being that innefficiency in this calculation is from the non-memoized recursive algorithm, rather than the @spawn which it seems to imply is fine. I think that blog post (still highly ranked on Google) is also what caused me to pick up that apparently bad habit of using lots of @spawn. :sleepy:

Another quote from that blog:

This is a good compromise! Even less breaking, while still allowing people to use threadid() if they know what they are doing. Completely safe with the previously officially-recommended use of it.

e.g.,

os_threadid() = #= current behavior of threadid() =#

function threadid()
    if !current_task.sticky
        error(
            "You used `threadid()` in a non-sticky task. " *
            "This is no longer supported as its abuse frequently results in silent cache corruption. " *
            "If you meant to use this, please update to `Threads.os_threadid()`. " *
            "Alternatively, use sticky threads: `Threads.@threads :sticky`."
        )
    end
    return os_threadid()  # Explicit call preferred, but allowed if sticky
end

Is there a possibility this won’t immediately get shot down if I open to a PR for this? Happy to do so and have triage consider it. @Oscar_Smith @mbauman

1 Like

That’s right.

I’d still argue it’s bad. If you’re synchronizing these tasks, you’re basically creating a gigantic amount of heap allocated garbage that can’t be reclaimed until the whole calculation is over. This is because each task will get put into a Channel or a Vector, and each Task in julia is a pretty heavy beast. It’s carrying around a stack, task_local_storage, an RNG, and a dozen other random things.

Furthermore, as observed above, task local storage is useless if you’re spawning n tasks. Sure, the impact of this might not be so bad if each task performs a lot of work, but there’s really no advantage of doing it this way. You’d be better off using GreedyScheduler or @threads :greedy if that’s the case.

None of that sounds that heavy to me :slight_smile:. It sounds like microseconds of startup and maybe a few hundred bytes of data. That might be a lot for a few arithmetic operations, and that might be basically nothing for some code that does network IO or some matrix operations or such.

My point is people use threading and tasks for a lot of different things and I don’t think general advice should always be geared towards writing a BLAS or something like that.

1 Like

In the limit it might be fine, but even then it’s just going to approach the performance of a GreedyScheduler.

There’s also no guarantee that Julia’s scheduler will handle it intelligently either. I’d want to test lots of things to make sure you’re not getting e.g. constant task interrupts, tasks migration, and just inefficient planning.

Im not saying it’s never fine, but it is almost never the best way to do it, and it’s certainly not something we should endorse that people should use if they don’t really know what they’re doing or how to balance and measure the various tradeoffs.

2 Likes

I’ve had such issues. I’ve solved it with a global cache:

const cachesize = 10   # or nthreads() 
const cache = [Int[] for _ in 1:cachesize]

Then I use an atomic to access them:

index = Atomic{Int}(length(cache))
@sync for _ in 1:length(cache)
      @spawn begin
          tmpvec = cache[atomic_sub!(index, 1)]
          ... use tmpvec ...
      end
end

This is quite fast.

If the number of tasks is larger than cachesize it’s possible to protect the cache access with a Base.Semaphore (and acquire/release) and some atomic counter (or pop!/push! on cache) hidden away in a getcache function, or, simpler, as suggested above, to write them to a Channel, and take! a cache entry when needed, and put! them back afterwards.

1 Like

I don’t think I would support a deprecation of threadid, even just a soft & partial deprecation like this with a warning

it does just what it says it does; removing it would remove legitimate and documented functionality of the language.

yes it can be misused, but so can pretty much any other tool.

1 Like

I am also having second thoughts on whether task_local_storage() would be the best solution for OpenSSLGroups.jl by following the discussion. I would rather have some limited form of documentation on how to use threadid properly. In particular the blog post claims (and I guess it is also somewhere in the docs) that one should not rely on the behavour that the functions would not yield. But is it true in general?

For instance if I use a pattern:

n = threadid()
ctx = GLOBAL_CACHE[n]
ret = @ccall libcrypto.EC_POINT_cmp(
        group::Ptr{Cvoid}, 
        pointer(x)::Ptr{Cvoid}, 
        pointer(y)::Ptr{Cvoid}, 
        ctx::Ptr{Cvoid}
    )::Cint

are there any potential that in the future there would be a yield point between theese three lines of code? Wouldn’t it be possible to give some guarantees that one could follow?

The issue is that the current behavior is dangerously divergent from the original documented behavior. It’s the trolley problem of breaking changes. Do we change to a single annoying error, or do we let the trolley cause massive amounts of silent memory corruption throughout the ecosystem?

Currently we are letting the trolley go straight. By pulling the lever, sure we might cause some loud errors, but we’ll save more chaos overall.

are you referring to the recommended use patterns or to the docstring attached to the function? the docstring itself seems accurate as far back as I can find any docs

the buffer[tid] situation is very unfortunate. but I think the right thing there is to work on publicizing the issue, making remedies extremely easy to follow, and proactively fixing the problem throughout the ecosystem (and steps have already been taken in this direction with the PSA last year and the publication of OhMyThreads.jl). I do not think that changing threadid to lie to its users is the right solution.

I mean I kind of feel like the current behavior is lying to the users. I think many many people (including me, a very active developer here) did not know that threadid can change within a task. This is why GoLang, with a similar concurrency pattern, literally doesn’t even have a .threadid function - because doing so will result in abuse. (And that’s without putting up an official blogpost encouraging such behavior)

Many people might just see this pattern in other people’s codebases, and pick it up from there. APIs have inertia to them. If you ask ChatGPT how to solve this today, it will immediately give you the old API (which again, was officially recommended on Julia blogs) –

LLMs are statistical models of text - this just shows you that this pattern is most correlated with a multithreaded cache when aggregated over the internet. And more directly, actually a lot of people use LLMs to learn about languages. And this code will run without error!! The pattern is already out there and heavily in use. Passive solutions won’t work.

Making API changes that drastically change behavior like this in silent ways is just a colossal mistake. If this corruption resulted in a BoundsError, it’s not a big deal - just make a linter warning and be done. But generating silent memory issues like this from officially-sourced material is *terrible*.

1 Like

It seems the LLM is aware of the concurrency since it uses an atomic?

It’s not so much about atomics, it’s that Threads.threadid() can change in a single scope. So if I write something like

Base.@lock counter_lock begin
    i = counter[Threads.threadid()]
    i_new = do_work(i)
    counter[Threads.threadid()] = i_new
end

Even though I locked access to counter, I can corrupt it because Threads.threadid() can change between getting i and setting i_new!

Now a user could prevent this with id = Threads.threadid(), and then carry id around, but that’s an extremely sharp edge to balance on. Many people (including myself) just assume threadid() is constant within a task (because it used to be!), so manually storing it in a variable isn’t needed.