Lock free sorted linked list

Hi!

I’m trying to implement a lock free sorted linked list as shown in 9.8 Nonblocking synchronization of the book titled The art of multiprocessor programming.

I need something like AtomicMarkableReference or maybe atomically update both the reference and the mark flag.

Is there a way to achieve this without using a spinlock?

3 Likes

This is a very good question.

The canonical way of doing something like this is to steal bits from the pointer, and then use 64 bit CAS. (32 bit systems need different considerations, I’m only thinking about 64 bit now)

There are two issues with that:

  1. That requires runtime support: The garbage collector needs to be taught about that for the mark phase. There was a project to bring GC plugins for custom types, but afaiu that never got in.

  2. That may require codegen support, depending on your performance requirements: Basically a better variant of unsafe_pointer_to_objref (i.e. one that can be taught to trust the user-provided type and that can be annotated with some info about GC-rooting).

What I would consider doing is something like the following:

mutable struct Container_for_cmpxchg16b{T} #T must be reference-type
  flag::UInt64
  actual_value::T 
  shadow_value::Any
end

Then you use the cmpxchg16b instruction to simultaneously CAS flag and actual_value, via pointer-slinging / pointer_from_objref(container). Cf Threads.atomic_cas! for the code you need in the llvmcall.

There is a second issue: There is a write-barrier for generational garbage collection. You can see the write barrier by inspecting code for

julia> mutable struct C{T}
       x::T
       end

julia> g(c, x) = begin c.x=x; 1 end
g (generic function with 3 methods)

julia> @code_native g(C("a"), "b")
...

What this boils down to is: If your container is old-generation and you use CAS to plug in a young-generation value, then some GC rooting is required. I’m sure there are better ways using the C-api, but my boring naive approach would be to follow a successful CAS-write up with a non-atomic write to shadow_value.

Now shadow_value and actual_value may diverge, due to lack of atomicity. The worst-case effect of that should be that the object pointed at by shadow_value has artificially extended lifespan, i.e. a temporary memory leak.

Sorry that this is so ugly :frowning:

Maybe somebody else has better ideas?

PS. You can of course do as java7 claims to have been doing, i.e.

Implementation note: This implementation maintains markable references by creating internal objects representing “boxed” [reference, boolean] pairs.

I haven’t looked into current jvm implementations. Eating an entire additional allocation + memory indirection seems very expensive to me, and completely defies the point of the data type. (but yes, without dedicated runtime/compiler support, this is what they must do)

PPS. Yep, java still does it the boxed way. You can relatively directly port jdk/src/java.base/share/classes/java/util/concurrent/atomic/AtomicMarkableReference.java at 64bbae75121ccf80c02a0960e2db62eb558052e6 · openjdk/jdk · GitHub to julia.

Have you looked into using @atomic to build an object like AtomicMarkableReference? You’ll probably have to roll your own functions like they do instead of direct-object access through the usual getproperty/setproperty! to ensure correctness, but that should be in the realm of possibility.

1 Like

I would second this suggestion.

mutable struct AtomicPair{T}
    @atomic v::Pair{T, Bool}
end

This will do the right thing without resorting to pointer-tagging. Outside of doing some unsafe shenanigans this is likely the best you can get without adding a special type to the language.

On Julia 1.12+ you will get 16-byte atomics:

julia> code_llvm((AtomicPair{Int},)) do x
          x.v
       end
; Function Signature: var"#11"(Main.AtomicPair{Int64})
;  @ REPL[9]:2 within `#11`
define void @"julia_#11_695"(ptr noalias nocapture noundef nonnull sret({ i64, i8 }) align 8 dereferenceable(16) %sret_return, ptr noundef nonnull align 16 dereferenceable(16) %"x::AtomicPair") #0 {
top:
; ┌ @ Base.jl:49 within `getproperty`
   %0 = load atomic i128, ptr %"x::AtomicPair" unordered, align 16
   store i128 %0, ptr %sret_return, align 8
   ret void
; └
}

On prior versions there will be a lock involved.

1 Like

Thanks! This looks promising I’ll give it a try.

I did look at @atomic but I had no idea how to write the reference and the marked bit a single atomic operation.

Thank you for your answer! This seems to be a bit complicated to me. I think I’ll try the other approach @atomic Pair{T, Bool} and see if that works.

:t_rex: quote=“melevy, post:7, topic:115157”]
I think I’ll try the other approach @atomic Pair{T, Bool} and see if that works.
[/quote]

I think this will end up using locks in many cases.

Can you @vchuravy give more detailed advice?

Especially comparing boxed reference-CAS to your proposal? The difference is that I use a mutable struct for the pair, in order to ensure that it is not inlined into the AtomicPair, and the compiler doesn’t get any ideas about introducing a lock.

The boxed version would be

julia> mutable struct CPair{T,S}
       const x::T 
       const y::S
       end

julia> mutable struct AtomicPairBoxed{T,S}
       @atomic current:: CPair{T,S}
       end

julia> (::Type{AtomicPairBoxed})(x,y) = AtomicPairBoxed(CPair(x,y))
julia> function atomicCas!(ap::AtomicPairBoxed,  xexpected, yexpected, xnew, ynew)
       current = @atomic ap.current
       (current.x == xexpected && current.y == yexpected) || return (current.x, current.y, false)
       replacement = CPair(xnew, ynew)
       while true
       (newcurrent , success) = replaceproperty!(ap, :current, current, replacement, :sequentially_consistent, :sequentially_consistent)
       success && return (current.x, current.y, true)
       #ok, we need to loop -- somebody else replaced the value under us. Because of ABA we can't just fail, need to retry.
       current = newcurrent
       (current.x == xexpected && current.y == yexpected) || return (current.x, current.y, false)
       end
       end

Does the generated code use atomic load even if the type parameter is a struct instead of Int?

BTW, I only see Upcoming release: v1.11.0-beta2 (May 29, 2024). Where’s this new 1.12+ version?

Also, I checked and on my version 1.10.4 the AtomicPair{T} uses locks.

So to clarify why locks are often really bad:

If there is a lock, then read access will need to take the lock, which is a memory write. Therefore, different read-only threads will have contention.

How catastrophic that is will depend on your workload.

The above boxed variant uses the typical pattern for performance: Copy-on-write. This always requires allocation for writes, and always requires garbage collection, irregardless of the language (e.g. the linux kernel has GC for such uses!).

The exception is if your items are small enough that your architecture supports atomics on it. That means 16 bytes on all reasonable systems (I think we have dropped support for other archs?). In fact, atomicity of read/writes is free on modern x86 (aligned sse2 16 byte load/store afaiu was always atomic in practice, and intel+amd recently updated the spec to make this guaranteed).

Since you want objref + bool, you are in the very special situation where cow / indirection is not needed, but is potentially a pain for the compiler/language (important difference to @vchuravy 's example: It’s a tuple that is mixed between bitstype and objref).

I am very unhappy with the julia behavior of @atomic in structs. In my opinion this should never emit a lock – either outline / copy-on-write, or throw an error if we don’t want to outline and cannot support inline atomics (e.g. because it’s more than 16 bytes, or it’s more than 8 bytes on an armv7 variant that lacks support, or …).

The julia behavior is similar to C++ behavior, which I think sucks bigtime. Most of the time, julia will throw up on bullshit (stuff that cannot be done on the hardware) and require the user to handle it. It should do the same here.

I agree with you regarding silently using locks for @atomic. My code shows no performance gain when using a lock free linked list using @atomic instead of using a fine grained locked linked list. I would have never thought that the reason is that both use locks.

There is no release yet, but if you use juliaup it should be available as the nightly version.

Thanks! I switched to the nightly build and it looks like @atomic uses ‘load atomic’ at llvm level.

Unfortunately my code crashes with the nightly version:

A reproducible example for the crash would be appreciated!

Here is the code that crashes. You may need to give it a few tries though. It should crash in a few seconds, but sometimes it just hangs. I have a 16 core machine, so that may also affect the outcome.

crash.jl (21.2 KB)

Please open an issue on the bug tracker.

Good find on this. It reproduces for me.

@melevy are you sure the segfault isn’t caused by push!ing to a vector from multiple threads?

1 Like

I’m pretty sure that there’s no concurrent push!ing to a vector. There are a few push! calls on a vector in the code but all of them are done from a single thread. The concurrent push! calls are on the LockFreeSortedLinkedList.

Accidentally I pasted some parts twice into the file when created the narrowed down code… I clean this up before submitting an issue.

5 Likes