PerThread / Threads.threadid() desired API

After looking at https://github.com/JuliaLang/julia/pull/55793 and the Threads.threadid() mess, it looks to me like PerThread / PerTask don’t really cut the cake.

PerTask was possible before, but has the disadvantage that you have one value per task, which sucks if the values are heavy / you have many tasks.

PerThread was possible before as well, using Threads.threadid(), with the issue around thread migration. PerThread does nothing to alleviate that – basically because it uses Threads.threadid() under the hood.

Imagine code like

get!(()->(yield; #= some calculation=#; nothing), caches[Threads.threadid()]::Dict{String, Any}, "someKey")

Oh, the thread-id changed during the get!-call and we have a data race in the dictionary (probably possible to corrupt memory / pop a shell).


I think there is a justified desire for the following behavior (call it reusable-task-local):

  1. If you grab your reusable-task-local thing, then it’s yours. If you ask again, you will get the same object, and it will not be handed out to another task until your task is done.
  2. If a task is done, then the reusable-task-local thing will be reused for another task.
  3. When grabbing your task-local-thing, there should be a fastpath (you have already grabbed your thing and you have not been migrated in between). That should be inlineable and involve no active synchronization (i.e. a load-acquire is fine, a mutex shared between tasks is not!).

What do you think about these desired behaviors?


I think that should be quite possible to build.

A fast read-path can be done by something like

function Base.getindex(storage::ReusableStableThreadLocal{T})
    task = current_task()
    tid = Threads.threadid()
    holders = storage.holders
    items = storage.items
    holder = holders[tid]
    if task === holder
        #very fast path -- we already grabbed the item and have not been migrated!
        return items[tid]
    elseif Base.istaskdone(holder) &&  !haskey(task_local_storage(), storage)
        #common case: A previous task finished on this thread, and now we get to inherit its item.
        holders[tid] = task
        item = items[tid]
        task_local_storage(storage, item)
        return item
    else
        #slow path, may involve a mutex. Will set up the conditions such that the next invocation can probably hit the fast path.
    end
end

In the above construct, probably no atomics are required on the fast-path and common case path. If there is no task migration (e.g. because the tasks don’t yield), then the slow path is never hit.

PS. I think in order to support resizing of the threadpool, we may need an atomic load on holders = storage.holders and items = holders.items, and maybe a lock on the “common case”.

I’m not sure how well we currently support atomic loads of 2 x object-ref?

Theoretically that should be the same price as non-atomic loads: All modernish arm64 and amd64 have zero-cost 16 byte aligned atomic loads and stores. (funnily enough, intel/amd got that feature retro-actively – all existing chips ever had these loads/stores atomic in practice, and years later the spec was updated to guarantee that behavior. So many cycles burned in the meantime…)

3 Likes

Ok, it appears that we’ll need to wait for some more teething pains around @atomic to pass:

julia> mutable struct A
       @atomic x::UInt128
       end

julia> f(a)=@atomic :monotonic a.x

julia> @code_llvm f(A(1))
; Function Signature: f(Main.A)
;  @ REPL[2]:1 within `f`
define void @julia_f_1242(ptr noalias nocapture noundef nonnull sret(i128) align 16 dereferenceable(16) %sret_return, ptr noundef nonnull align 16 dereferenceable(16) %"a::A") #0 {
top:
; ┌ @ Base_compiler.jl:78 within `getproperty`
   %"a::A.x" = load atomic i128, ptr %"a::A" monotonic, align 16
   store i128 %"a::A.x", ptr %sret_return, align 16
   ret void
; └
}

julia> @code_native f(A(1))
[...]
	lock		cmpxchg16b	xmmword ptr [rsi]
[...]

This is really bad code for a machine with AVX. The code_llvm is fine, but the machine code sucks – a normal aligned 16 byte load is atomic on these machines, and llvm knows about the alignment.

Per godbolt, clang 18 emits the same (bad) code, and clang 19 does the right thing (vmovdqa instead of lock cmpxchg16b) for C++ with std::atomic.

This doesn’t really matter for the C/C++ people because they can just use immintrin.h – but it matters a lot for us, since we can’t really do that with non-isbitstype.

PS. Also per godbolt: For “generic” arm64 targets, we get the bad ldxp/stxp sequence (which is probably correct?); with -mcpu=apple-m1 we get the good ldp starting with clang 15 (with a dmb barrier if we want stronger memory ordering than monotonic). So the 16 byte atomic pain appears to be mostly x86_64, and appears to be in the process of going away.

2 Likes