Thanks for the explanation. But the documentation does not mention any of these two options. Perhaps it needs an update?
I guess it could, maybe even should. These packages are much younger than this part of the docs.
Yes, it works fine if you use Threads.nthreads(:default) + Threads.nthreads(:interactive)
buffers in Julia 1.12rc1. But it shall not be mentioned because it is discouraged.
Yeah, I think the current state sucks a lot: Convenience functions for âsimply thread this damn loopâ should be part of Base.
That being said, the old model via Threads.threadid()
really sucks as well: The @threads
macro does both splitting of input into chunks, and does all the scheduling stuff. However, often one wants temp variables that live over the entire chunk, instead of the item. Thatâs what Threads.threadid()
was used for.
Suppose you have code like
ran = 1:100
Threads.@threads for i in ran
tid = Threads.threadid()
#do something with i, tid
end
Then you can instead use
ran = 1:100
import ChunkSplitters
@sync for (tid, chunk) in ChunkSplitters.index_chunks(ran; n=Threads.nthreads())
Threads.@spawn for i in chunk
#do something with i, tid
end
end
Thatâs not complicated at all! Except that you must adopt a dependency, and remember the ChunkSplitters API.
And, truth be told, ChunkSplitters as a dependency sucks. It is well-maintained, small and has no further dependencies; but the code is just on the border of âtoo simple/small to pay for software supply chain hassle â do it yourself / vendorâ and âso annoying to get right by hand that bugs will be prevalent and one is sorely tempted to make do with Threads.@threadsâ.
Since Threads.@threads
is, for legacy reasons, part of base/stdlib, we should definitely subsume some of chunksplitters functionality into base/stdlib. This does not even introduce real new functionality (since Threads.@threads already has chunk-splitting logic! That just needs to be exported separately). Otherwise people will continue to use the terrible legacy Threads.@threads
.
I definitely think we can and should upstream a lot of Chunksplitters.jl and OhMyThreads.jl to Base (preferably a Base.Tasks
submodule which is separate from Base.Threads
, and then we can work on deprecating Base.Threads
or at least pointing people away from it).
The main challenges are just the usual bikeshedding that comes with making a major contribution to Base
. Weâll need to simplify a lot of these APIs and figure out which experiments were successful and which were unnecessary. Thereâll probably also be a lot of naming concerns and the usual stuff like that.
The problem is just finding someone with the time and will to do it.
ping @ufechner7, itâd be really helpful to me to know what you tried, or ideally, maybe you could show me some relevant benchmark inputs to your getMeasurementsP
function, then I can try out some different options and suggest them.
If you donât want to say what you tried, and donât want to show a MWE benchmark example, I can give at least one more suggestion that I think has not yet been mentioned (and needs to be added to the OhMyThreads docs): v1.12 has a new OncePerThread
(and OncePerTask
) object that should be fast and simple to use.
Hereâs an example of what the OhMyThreads.jl documentation matmulsums example would look like using OncePerThread
:
#don't use
# function matmulsums_opt(As::Vector{Matrix{T}}, Bs::Vector{Matrix{T}}) where {T}
# N = size(first(As), 1)
# sums = zeros(T, length(As))
# thread_state = OncePerThread(() -> Matrix{Float64}(undef, N, N))
#
# Threads.@threads for i â eachindex(As, Bs)
# C = thread_state()
# mul!(C, As[i], Bs[i])
# sums[i] = sum(C)
# end
# sums
# end

And, truth be told, ChunkSplitters as a dependency sucks. It is well-maintained, small and has no further dependencies; but the code is just on the border of âtoo simple/small to pay for software supply chain hassle â do it yourself / vendorâ and âso annoying to get right by hand that bugs will be prevalent and one is sorely tempted to make do with Threads.@threadsâ.
Copy the code, keep the reference (and also the Git commit hash), itâs MIT and you have no dependencies. Thatâs why we have MIT I am also getting old and tired that my software packages break from one day to the other because dependencies are updated and break my code (you cannot rely on SemVer), so I am more and more the copy-guy and get rid of stuff which can potentially make me do work so that my (finished) software still does the same thing as before

Hereâs an example of what the OhMyThreads.jl documentation matmulsums example would look like using
OncePerThread
:
This works until your mul!
runs on a GPU / async or tries to acquire a lock (for example because of an innocuous-looking debug printline that acquires an io-lock, causes the task to yield, and then it gets re-scheduled with different thread-id); then you have a nice race condition, because your task will suspend, the thread can be re-used for a different task while the matmul is still running, which then gets the same C.
OncePerThread, OncePerTask, task-local, scoped variables, etc are very suspicious â donât use them if you can avoid it. There are cases where passing context via ordinary method parameters sucks, like logging; then sure, use these tools. But that should be rare.
Nothing wrong with the one simple pattern that always works:
import ChunkSplitters
function matmulsums_opt(As::Vector{Matrix{T}}, Bs::Vector{Matrix{T}}) where {T}
N = size(first(As), 1)
sums = zeros(T, length(As))
@sync for (tid, chunk) in ChunkSplitters.index_chunks(1:length(As); n=Threads.nthreads())
Threads.@spawn begin
C = Matrix{Float64}(undef, N, N)
for i in chunk
mul!(C, As[i], Bs[i])
sums[i] = sum(C)
end
end
end
sums
end

Copy the code, keep the reference (and also the Git commit hash), itâs MIT and you have no dependencies.
Yeah, vendoring is an alternative to accepting the dep. But this specific functionality is so basic and ubiquitously useful that it should be shipped by the language.
PS. To clarify my point,
julia> function foo()
t= Threads.threadid()
print("aah")
tt=Threads.threadid()
@assert t == tt
end
julia> rs = []; for i=1:1000 push!(rs, Threads.@spawn foo()) end;
aahaaha....
julia> rs
1000-element Vector{Any}:
Task (failed) @0x00007f7b119b13c0
Task (failed) @0x00007f7b119b14b0
âŽ
Task (failed) @0x00007f7b11bdfd00
Task (done) @0x00007f7b11bdfdf0
Task (failed) @0x00007f7b11bdfee0
Task (failed) @0x00007f7b10f48010

This works until your
mul!
runs on a GPU / async or tries to acquire a lock (for example because of an innocuous-looking debug printline that acquires an io-lock, causes the task to yield, and then it gets re-scheduled with different thread-id); then you have a nice race condition, because your task will suspend, the thread can be re-used for a different task while the matmul is still running, which then gets the same C.
Youâre right, I should have thought more about that before posting. OncePerThread
is the wrong primitive to use here.

OncePerThread, OncePerTask, task-local, scoped variables, etc are very suspicious â donât use them if you can avoid it. There are cases where passing context via ordinary method parameters sucks, like logging; then sure, use these tools. But that should be rare.
I do agree about OncePerThread
, but I find it a lot harder to agree with your skepticism around OncePerTask
/ task-local
.
Ultimately, I just think it really is important to provide useful abstractions over these sorts of loops, but I also do agree that itâs really important to get the abstractions right.
Thereâs definitely a lot of room for just using ChunkSplitters directly (or its shittier cousin, Iterators.partition
), and that is the best option for having more transparent code with maximal control.
I think though that thereâs a very strong desire from users for an abstraction that says âparallelize this loop, and do the right thing to give me proper caching of a re-usable bufferâ.
My experience has been that if we donât give them good options, and tell them they need to write big verbose multi-tiered iteration strategies involving index_chunks
, @spawn
and fetch
, theyâre just going to keep getting pouty threads like this one where people are just looking for some trick to get the state[threadid()]
hack working again.
Why is OhMyThreads.jl slow here. I would expect it doesnât bring much latency?
It really just depends on what the OP was doing. OhMyThreads should always be possible to configure to be as fast as the pattern he was using originally, but itâs certainly also possible to use some heavy option that ends up causing a lot of overhead.
For instance, if you use task local storage, but then spawn way more tasks than you actually need, thatâll cause a lot of unnecessary overhead.

My experience has been that if we donât give them good options,
Yes, we are very much in agreement.
I just think that a Base variant of ChunkSplitters.index_chunks
with some bike-shedding and tutorials would solve the issue for most people. Most usecases really donât need anything except for chunkForThreads(ran::UnitRange)
, maybe enumerate
, and @sync
+@spawn
.
We simply needs to provide the tools in Base, and teach people that threaded loops must become double-loops: One outer loop over chunks, and an inner loop over the elements of the chunk, and initialization of re-usable state obviously goes inside the outer loop, inside the spawn, and outside the inner loop.
Heck, we could probably even cook up something that looks like
@sync for chunk in adaptive_chunks(worklist)
@spawn for item in chunk
#...do something
end
end
that implements work-stealing. (i.e. if the time taken for items is very heterogenuous, then chunks can be split again later; for that, the iterator over items in the chunk would be stateful and use atomic updates to indicate its progress).
(but doing the chunkForThreads by hand with all the divrem
will produce off-by-one errors for normal people. I know that I would introduce multiple different off-by-one errors until I exhaustively test the code and then never touch it again.)

I just think that a Base variant of
ChunkSplitters.index_chunks
with some bike-shedding and tutorials would solve the issue for most people.
Whatâs the advantage of ChunkSplitters.index_chunks(x; n)
over Iterators.partition(eachindex(x), cld(length(x), n))
? Could we just add a new length=n
keyword argument to Iterators.partition
so that it does the division for you? And make sure @threads
supports PartitionIterator
over indexable collections (e.g. by adding a getindex
method for PartitionIterator
)?
It just seems like beefing up Iterators.partition
as needed would be easier than bike-shedding a whole new API.

It just seems like beefing up
Iterators.partition
as needed would be easier than bike-shedding a whole new API.
I think a different name, instead of keyword arguments would be easier to remember. These are really two different jobs!

Whatâs the advantage of
ChunkSplitters.index_chunks(x; n)
overIterators.partition(eachindex(x), cld(length(x), n))
?
It gives a smoother partition. Thatâs good for threading, and bad for simd / branch prediction. Different job, different tool.
julia> x=1:100;n=7; length.(collect(Iterators.partition(eachindex(x), cld(length(x), n))))
7-element Vector{Int64}:
15
15
15
15
15
15
10
julia> length.(collect(ChunkSplitters.index_chunks(x; n)))
7-element Vector{Int64}:
15
15
14
14
14
14
14
perhaps people would get less confused if Threads.@threads
were renamed to Threads.@tasks
.
Even if the old name has to be maintained for backwards compatibility, adding @tasks
as an alias to @threads
, and then recommending @tasks
as the preferred usage in @threads
âs docstring, would go a good way towards making @tasks
the more prevalent option in the ecosystem.

perhaps people would get less confused if
Threads.@threads
were renamed toThreads.@tasks
.
Indeed. However, renaming alone wonât cut it. It needs to provide more options. Like @tasks
from OhMyThreads.jl. (Thatâs why we created it)