No more threadid indexing? [thread-local storage]

Thanks for the explanation. But the documentation does not mention any of these two options. Perhaps it needs an update?

I guess it could, maybe even should. These packages are much younger than this part of the docs.

Yes, it works fine if you use Threads.nthreads(:default) + Threads.nthreads(:interactive) buffers in Julia 1.12rc1. But it shall not be mentioned because it is discouraged.

Yeah, I think the current state sucks a lot: Convenience functions for “simply thread this damn loop” should be part of Base.

That being said, the old model via Threads.threadid() really sucks as well: The @threads macro does both splitting of input into chunks, and does all the scheduling stuff. However, often one wants temp variables that live over the entire chunk, instead of the item. That’s what Threads.threadid() was used for.

Suppose you have code like

ran = 1:100
Threads.@threads for i in ran
       tid = Threads.threadid()
       #do something with i, tid
end

Then you can instead use

ran = 1:100
import ChunkSplitters
@sync for (tid, chunk) in ChunkSplitters.index_chunks(ran; n=Threads.nthreads())
       Threads.@spawn for i in chunk
              #do something with i, tid
       end
end

That’s not complicated at all! Except that you must adopt a dependency, and remember the ChunkSplitters API.

And, truth be told, ChunkSplitters as a dependency sucks. It is well-maintained, small and has no further dependencies; but the code is just on the border of “too simple/small to pay for software supply chain hassle → do it yourself / vendor” and “so annoying to get right by hand that bugs will be prevalent and one is sorely tempted to make do with Threads.@threads”.

Since Threads.@threads is, for legacy reasons, part of base/stdlib, we should definitely subsume some of chunksplitters functionality into base/stdlib. This does not even introduce real new functionality (since Threads.@threads already has chunk-splitting logic! That just needs to be exported separately). Otherwise people will continue to use the terrible legacy Threads.@threads.

5 Likes

I definitely think we can and should upstream a lot of Chunksplitters.jl and OhMyThreads.jl to Base (preferably a Base.Tasks submodule which is separate from Base.Threads, and then we can work on deprecating Base.Threads or at least pointing people away from it).

The main challenges are just the usual bikeshedding that comes with making a major contribution to Base. We’ll need to simplify a lot of these APIs and figure out which experiments were successful and which were unnecessary. There’ll probably also be a lot of naming concerns and the usual stuff like that.

The problem is just finding someone with the time and will to do it.

14 Likes

ping @ufechner7, it’d be really helpful to me to know what you tried, or ideally, maybe you could show me some relevant benchmark inputs to your getMeasurementsP function, then I can try out some different options and suggest them.

If you don’t want to say what you tried, and don’t want to show a MWE benchmark example, I can give at least one more suggestion that I think has not yet been mentioned (and needs to be added to the OhMyThreads docs): v1.12 has a new OncePerThread (and OncePerTask) object that should be fast and simple to use.

Here’s an example of what the OhMyThreads.jl documentation matmulsums example would look like using OncePerThread:

#don't use
# function matmulsums_opt(As::Vector{Matrix{T}}, Bs::Vector{Matrix{T}}) where {T}
#    N = size(first(As), 1)
#    sums = zeros(T, length(As))
#    thread_state = OncePerThread(() -> Matrix{Float64}(undef, N, N))
#    
#    Threads.@threads for i ∈ eachindex(As, Bs)
#        C = thread_state()
#        mul!(C, As[i], Bs[i])
#        sums[i] = sum(C)
#    end
#    sums
# end
1 Like

Copy the code, keep the reference (and also the Git commit hash), it’s MIT and you have no dependencies. That’s why we have MIT :wink: I am also getting old and tired that my software packages break from one day to the other because dependencies are updated and break my code (you cannot rely on SemVer), so I am more and more the copy-guy and get rid of stuff which can potentially make me do work so that my (finished) software still does the same thing as before :laughing:

This works until your mul! runs on a GPU / async or tries to acquire a lock (for example because of an innocuous-looking debug printline that acquires an io-lock, causes the task to yield, and then it gets re-scheduled with different thread-id); then you have a nice race condition, because your task will suspend, the thread can be re-used for a different task while the matmul is still running, which then gets the same C.

OncePerThread, OncePerTask, task-local, scoped variables, etc are very suspicious – don’t use them if you can avoid it. There are cases where passing context via ordinary method parameters sucks, like logging; then sure, use these tools. But that should be rare.

Nothing wrong with the one simple pattern that always works:

import ChunkSplitters
function matmulsums_opt(As::Vector{Matrix{T}}, Bs::Vector{Matrix{T}}) where {T}
    N = size(first(As), 1)
    sums = zeros(T, length(As))
    @sync for (tid, chunk) in ChunkSplitters.index_chunks(1:length(As); n=Threads.nthreads())
        Threads.@spawn begin
            C = Matrix{Float64}(undef, N, N)
            for i in chunk
                mul!(C, As[i], Bs[i])
                sums[i] = sum(C)
            end
        end
    end
    sums
end

Yeah, vendoring is an alternative to accepting the dep. But this specific functionality is so basic and ubiquitously useful that it should be shipped by the language.

PS. To clarify my point,

julia> function foo()
       t= Threads.threadid()
       print("aah")
       tt=Threads.threadid()
       @assert t == tt
       end
julia> rs = []; for i=1:1000 push!(rs, Threads.@spawn foo()) end;
aahaaha....
julia> rs
1000-element Vector{Any}:
 Task (failed) @0x00007f7b119b13c0
 Task (failed) @0x00007f7b119b14b0
 ⋮
 Task (failed) @0x00007f7b11bdfd00
 Task (done) @0x00007f7b11bdfdf0
 Task (failed) @0x00007f7b11bdfee0
 Task (failed) @0x00007f7b10f48010
1 Like

You’re right, I should have thought more about that before posting. OncePerThread is the wrong primitive to use here.

I do agree about OncePerThread, but I find it a lot harder to agree with your skepticism around OncePerTask / task-local.

Ultimately, I just think it really is important to provide useful abstractions over these sorts of loops, but I also do agree that it’s really important to get the abstractions right.

There’s definitely a lot of room for just using ChunkSplitters directly (or its shittier cousin, Iterators.partition), and that is the best option for having more transparent code with maximal control.

I think though that there’s a very strong desire from users for an abstraction that says “parallelize this loop, and do the right thing to give me proper caching of a re-usable buffer”.

My experience has been that if we don’t give them good options, and tell them they need to write big verbose multi-tiered iteration strategies involving index_chunks, @spawn and fetch, they’re just going to keep getting pouty threads like this one where people are just looking for some trick to get the state[threadid()] hack working again.

Why is OhMyThreads.jl slow here. I would expect it doesn’t bring much latency?

It really just depends on what the OP was doing. OhMyThreads should always be possible to configure to be as fast as the pattern he was using originally, but it’s certainly also possible to use some heavy option that ends up causing a lot of overhead.

For instance, if you use task local storage, but then spawn way more tasks than you actually need, that’ll cause a lot of unnecessary overhead.

Yes, we are very much in agreement.

I just think that a Base variant of ChunkSplitters.index_chunks with some bike-shedding and tutorials would solve the issue for most people. Most usecases really don’t need anything except for chunkForThreads(ran::UnitRange), maybe enumerate, and @sync+@spawn.

We simply needs to provide the tools in Base, and teach people that threaded loops must become double-loops: One outer loop over chunks, and an inner loop over the elements of the chunk, and initialization of re-usable state obviously goes inside the outer loop, inside the spawn, and outside the inner loop.

Heck, we could probably even cook up something that looks like

@sync for chunk in adaptive_chunks(worklist)
@spawn for item in chunk
#...do something
end
end

that implements work-stealing. (i.e. if the time taken for items is very heterogenuous, then chunks can be split again later; for that, the iterator over items in the chunk would be stateful and use atomic updates to indicate its progress).

(but doing the chunkForThreads by hand with all the divrem will produce off-by-one errors for normal people. I know that I would introduce multiple different off-by-one errors until I exhaustively test the code and then never touch it again.)

What’s the advantage of ChunkSplitters.index_chunks(x; n) over Iterators.partition(eachindex(x), cld(length(x), n))? Could we just add a new length=n keyword argument to Iterators.partition so that it does the division for you? And make sure @threads supports PartitionIterator over indexable collections (e.g. by adding a getindex method for PartitionIterator)?

It just seems like beefing up Iterators.partition as needed would be easier than bike-shedding a whole new API.

2 Likes

I think a different name, instead of keyword arguments would be easier to remember. These are really two different jobs!

It gives a smoother partition. That’s good for threading, and bad for simd / branch prediction. Different job, different tool.

julia> x=1:100;n=7; length.(collect(Iterators.partition(eachindex(x), cld(length(x), n))))
7-element Vector{Int64}:
 15
 15
 15
 15
 15
 15
 10
julia> length.(collect(ChunkSplitters.index_chunks(x; n)))
7-element Vector{Int64}:
 15
 15
 14
 14
 14
 14
 14
4 Likes

perhaps people would get less confused if Threads.@threads were renamed to Threads.@tasks.

3 Likes

Even if the old name has to be maintained for backwards compatibility, adding @tasks as an alias to @threads, and then recommending @tasks as the preferred usage in @threads’s docstring, would go a good way towards making @tasks the more prevalent option in the ecosystem.

4 Likes

Indeed. However, renaming alone won’t cut it. It needs to provide more options. Like @tasks from OhMyThreads.jl. (That’s why we created it)

3 Likes