No more threadid indexing? [thread-local storage]

ufechner7 · August 12, 2025, 5:14am

Thanks for the explanation. But the documentation does not mention any of these two options. Perhaps it needs an update?

carstenbauer · August 12, 2025, 5:44am

I guess it could, maybe even should. These packages are much younger than this part of the docs.

ufechner7 · August 12, 2025, 5:46am

Yes, it works fine if you use Threads.nthreads(:default) + Threads.nthreads(:interactive) buffers in Julia 1.12rc1. But it shall not be mentioned because it is discouraged.

foobar_lv2 · August 12, 2025, 8:51am

Yeah, I think the current state sucks a lot: Convenience functions for “simply thread this damn loop” should be part of Base.

That being said, the old model via Threads.threadid() really sucks as well: The @threads macro does both splitting of input into chunks, and does all the scheduling stuff. However, often one wants temp variables that live over the entire chunk, instead of the item. That’s what Threads.threadid() was used for.

Suppose you have code like

ran = 1:100
Threads.@threads for i in ran
       tid = Threads.threadid()
       #do something with i, tid
end

Then you can instead use

ran = 1:100
import ChunkSplitters
@sync for (tid, chunk) in ChunkSplitters.index_chunks(ran; n=Threads.nthreads())
       Threads.@spawn for i in chunk
              #do something with i, tid
       end
end

That’s not complicated at all! Except that you must adopt a dependency, and remember the ChunkSplitters API.

And, truth be told, ChunkSplitters as a dependency sucks. It is well-maintained, small and has no further dependencies; but the code is just on the border of “too simple/small to pay for software supply chain hassle → do it yourself / vendor” and “so annoying to get right by hand that bugs will be prevalent and one is sorely tempted to make do with Threads.@threads”.

Since Threads.@threads is, for legacy reasons, part of base/stdlib, we should definitely subsume some of chunksplitters functionality into base/stdlib. This does not even introduce real new functionality (since Threads.@threads already has chunk-splitting logic! That just needs to be exported separately). Otherwise people will continue to use the terrible legacy Threads.@threads.

Mason · August 12, 2025, 11:43am

I definitely think we can and should upstream a lot of Chunksplitters.jl and OhMyThreads.jl to Base (preferably a Base.Tasks submodule which is separate from Base.Threads, and then we can work on deprecating Base.Threads or at least pointing people away from it).

The main challenges are just the usual bikeshedding that comes with making a major contribution to Base. We’ll need to simplify a lot of these APIs and figure out which experiments were successful and which were unnecessary. There’ll probably also be a lot of naming concerns and the usual stuff like that.

The problem is just finding someone with the time and will to do it.

Mason · August 12, 2025, 12:05pm

ping @ufechner7, it’d be really helpful to me to know what you tried, or ideally, maybe you could show me some relevant benchmark inputs to your getMeasurementsP function, then I can try out some different options and suggest them.

If you don’t want to say what you tried, and don’t want to show a MWE benchmark example, I can give at least one more suggestion that I think has not yet been mentioned (and needs to be added to the OhMyThreads docs): v1.12 has a new OncePerThread (and OncePerTask) object that should be fast and simple to use.

~~Here’s an example of what the OhMyThreads.jl documentation matmulsums example would look like using OncePerThread:~~

#don't use
# function matmulsums_opt(As::Vector{Matrix{T}}, Bs::Vector{Matrix{T}}) where {T}
#    N = size(first(As), 1)
#    sums = zeros(T, length(As))
#    thread_state = OncePerThread(() -> Matrix{Float64}(undef, N, N))
#    
#    Threads.@threads for i ∈ eachindex(As, Bs)
#        C = thread_state()
#        mul!(C, As[i], Bs[i])
#        sums[i] = sum(C)
#    end
#    sums
# end

tamasgal · August 12, 2025, 12:26pm

Copy the code, keep the reference (and also the Git commit hash), it’s MIT and you have no dependencies. That’s why we have MIT I am also getting old and tired that my software packages break from one day to the other because dependencies are updated and break my code (you cannot rely on SemVer), so I am more and more the copy-guy and get rid of stuff which can potentially make me do work so that my (finished) software still does the same thing as before

foobar_lv2 · August 12, 2025, 12:49pm

This works until your mul! runs on a GPU / async or tries to acquire a lock (for example because of an innocuous-looking debug printline that acquires an io-lock, causes the task to yield, and then it gets re-scheduled with different thread-id); then you have a nice race condition, because your task will suspend, the thread can be re-used for a different task while the matmul is still running, which then gets the same C.

OncePerThread, OncePerTask, task-local, scoped variables, etc are very suspicious – don’t use them if you can avoid it. There are cases where passing context via ordinary method parameters sucks, like logging; then sure, use these tools. But that should be rare.

Nothing wrong with the one simple pattern that always works:

import ChunkSplitters
function matmulsums_opt(As::Vector{Matrix{T}}, Bs::Vector{Matrix{T}}) where {T}
    N = size(first(As), 1)
    sums = zeros(T, length(As))
    @sync for (tid, chunk) in ChunkSplitters.index_chunks(1:length(As); n=Threads.nthreads())
        Threads.@spawn begin
            C = Matrix{Float64}(undef, N, N)
            for i in chunk
                mul!(C, As[i], Bs[i])
                sums[i] = sum(C)
            end
        end
    end
    sums
end

Yeah, vendoring is an alternative to accepting the dep. But this specific functionality is so basic and ubiquitously useful that it should be shipped by the language.

PS. To clarify my point,

julia> function foo()
       t= Threads.threadid()
       print("aah")
       tt=Threads.threadid()
       @assert t == tt
       end
julia> rs = []; for i=1:1000 push!(rs, Threads.@spawn foo()) end;
aahaaha....
julia> rs
1000-element Vector{Any}:
 Task (failed) @0x00007f7b119b13c0
 Task (failed) @0x00007f7b119b14b0
 ⋮
 Task (failed) @0x00007f7b11bdfd00
 Task (done) @0x00007f7b11bdfdf0
 Task (failed) @0x00007f7b11bdfee0
 Task (failed) @0x00007f7b10f48010

Mason · August 12, 2025, 1:03pm

You’re right, I should have thought more about that before posting. OncePerThread is the wrong primitive to use here.

I do agree about OncePerThread, but I find it a lot harder to agree with your skepticism around OncePerTask / task-local.

Ultimately, I just think it really is important to provide useful abstractions over these sorts of loops, but I also do agree that it’s really important to get the abstractions right.

There’s definitely a lot of room for just using ChunkSplitters directly (or its shittier cousin, Iterators.partition), and that is the best option for having more transparent code with maximal control.

I think though that there’s a very strong desire from users for an abstraction that says “parallelize this loop, and do the right thing to give me proper caching of a re-usable buffer”.

My experience has been that if we don’t give them good options, and tell them they need to write big verbose multi-tiered iteration strategies involving index_chunks, @spawn and fetch, they’re just going to keep getting pouty threads like this one where people are just looking for some trick to get the state[threadid()] hack working again.

hz-xiaxz · August 12, 2025, 1:20pm

Why is OhMyThreads.jl slow here. I would expect it doesn’t bring much latency?

Mason · August 12, 2025, 1:25pm

It really just depends on what the OP was doing. OhMyThreads should always be possible to configure to be as fast as the pattern he was using originally, but it’s certainly also possible to use some heavy option that ends up causing a lot of overhead.

For instance, if you use task local storage, but then spawn way more tasks than you actually need, that’ll cause a lot of unnecessary overhead.

foobar_lv2 · August 12, 2025, 1:27pm

Yes, we are very much in agreement.

I just think that a Base variant of ChunkSplitters.index_chunks with some bike-shedding and tutorials would solve the issue for most people. Most usecases really don’t need anything except for chunkForThreads(ran::UnitRange), maybe enumerate, and @sync+@spawn.

We simply needs to provide the tools in Base, and teach people that threaded loops must become double-loops: One outer loop over chunks, and an inner loop over the elements of the chunk, and initialization of re-usable state obviously goes inside the outer loop, inside the spawn, and outside the inner loop.

Heck, we could probably even cook up something that looks like

@sync for chunk in adaptive_chunks(worklist)
@spawn for item in chunk
#...do something
end
end

that implements work-stealing. (i.e. if the time taken for items is very heterogenuous, then chunks can be split again later; for that, the iterator over items in the chunk would be stateful and use atomic updates to indicate its progress).

(but doing the chunkForThreads by hand with all the divrem will produce off-by-one errors for normal people. I know that I would introduce multiple different off-by-one errors until I exhaustively test the code and then never touch it again.)

stevengj · August 12, 2025, 1:59pm

What’s the advantage of ChunkSplitters.index_chunks(x; n) over Iterators.partition(eachindex(x), cld(length(x), n))? Could we just add a new length=n keyword argument to Iterators.partition so that it does the division for you? And make sure @threads supports PartitionIterator over indexable collections (e.g. by adding a getindex method for PartitionIterator)?

It just seems like beefing up Iterators.partition as needed would be easier than bike-shedding a whole new API.

foobar_lv2 · August 12, 2025, 2:30pm

I think a different name, instead of keyword arguments would be easier to remember. These are really two different jobs!

It gives a smoother partition. That’s good for threading, and bad for simd / branch prediction. Different job, different tool.

julia> x=1:100;n=7; length.(collect(Iterators.partition(eachindex(x), cld(length(x), n))))
7-element Vector{Int64}:
 15
 15
 15
 15
 15
 15
 10
julia> length.(collect(ChunkSplitters.index_chunks(x; n)))
7-element Vector{Int64}:
 15
 15
 14
 14
 14
 14
 14

bjarthur · August 14, 2025, 5:54pm

perhaps people would get less confused if Threads.@threads were renamed to Threads.@tasks.

digital_carver · August 14, 2025, 7:16pm

Even if the old name has to be maintained for backwards compatibility, adding @tasks as an alias to @threads, and then recommending @tasks as the preferred usage in @threads’s docstring, would go a good way towards making @tasks the more prevalent option in the ecosystem.

carstenbauer · August 15, 2025, 5:12am

Indeed. However, renaming alone won’t cut it. It needs to provide more options. Like @tasks from OhMyThreads.jl. (That’s why we created it)

Topic		Replies	Views
Behavior of threads General Usage multithreading	33	1779	March 24, 2023
Behavior of `Threads.@threads for` loop General Usage multithreading	30	5139	February 14, 2022
PSA: Thread-local state is no longer recommended; Common misconceptions about threadid() and nthreads() General Usage announcement , multithreading	58	4134	April 30, 2024
Overhead of `Threads.@threads` Performance question , multithreading	30	5415	March 13, 2021
Why has simple threading using `threadid` become so complex in v1.12...? Performance	39	1323	August 15, 2025

No more threadid indexing? [thread-local storage]

Related topics