After looking at https://github.com/JuliaLang/julia/pull/55793 and the Threads.threadid()
mess, it looks to me like PerThread / PerTask don’t really cut the cake.
PerTask was possible before, but has the disadvantage that you have one value per task, which sucks if the values are heavy / you have many tasks.
PerThread was possible before as well, using Threads.threadid()
, with the issue around thread migration. PerThread does nothing to alleviate that – basically because it uses Threads.threadid() under the hood.
Imagine code like
get!(()->(yield; #= some calculation=#; nothing), caches[Threads.threadid()]::Dict{String, Any}, "someKey")
Oh, the thread-id changed during the get!-call and we have a data race in the dictionary (probably possible to corrupt memory / pop a shell).
I think there is a justified desire for the following behavior (call it reusable-task-local):
- If you grab your reusable-task-local thing, then it’s yours. If you ask again, you will get the same object, and it will not be handed out to another task until your task is done.
- If a task is done, then the reusable-task-local thing will be reused for another task.
- When grabbing your task-local-thing, there should be a fastpath (you have already grabbed your thing and you have not been migrated in between). That should be inlineable and involve no active synchronization (i.e. a load-acquire is fine, a mutex shared between tasks is not!).
What do you think about these desired behaviors?
I think that should be quite possible to build.
A fast read-path can be done by something like
function Base.getindex(storage::ReusableStableThreadLocal{T})
task = current_task()
tid = Threads.threadid()
holders = storage.holders
items = storage.items
holder = holders[tid]
if task === holder
#very fast path -- we already grabbed the item and have not been migrated!
return items[tid]
elseif Base.istaskdone(holder) && !haskey(task_local_storage(), storage)
#common case: A previous task finished on this thread, and now we get to inherit its item.
holders[tid] = task
item = items[tid]
task_local_storage(storage, item)
return item
else
#slow path, may involve a mutex. Will set up the conditions such that the next invocation can probably hit the fast path.
end
end
In the above construct, probably no atomics are required on the fast-path and common case path. If there is no task migration (e.g. because the tasks don’t yield), then the slow path is never hit.
PS. I think in order to support resizing of the threadpool, we may need an atomic load on holders = storage.holders
and items = holders.items
, and maybe a lock on the “common case”.
I’m not sure how well we currently support atomic loads of 2 x object-ref?
Theoretically that should be the same price as non-atomic loads: All modernish arm64 and amd64 have zero-cost 16 byte aligned atomic loads and stores. (funnily enough, intel/amd got that feature retro-actively – all existing chips ever had these loads/stores atomic in practice, and years later the spec was updated to guarantee that behavior. So many cycles burned in the meantime…)