Hazard pointers - in C++26 vs Julia - concurrency/atomics e.g. for UInt128

Palli · January 16, 2024, 12:48pm

I hadn’t seen this concept, and wanted to let others know, and ask how it contrasts to Julia, but to answer my own question:

In a multithreaded computing environment, hazard pointers are one approach to solving the problems posed by dynamic memory management of the nodes in a lock-free data structure. These problems generally arise only in environments that don’t have automatic garbage collection.

So Julia doesn’t need it, nor likely much of the new support in C++20, 23 and coming in 26. But you CAN program in Julia at a lower level, avoiding the GC, and then you might…

One way to deal with concurrency is to only use immutables, e.g. ruling out the default Dict. but recent HAMT implementations helps with that. With no mutables (that also includes the default mutable arrays/vectors, but not default strings) there are NO bug or security issues regarding concurrency.

Running Julia as default, i.e. not with -t auto, or some number other than 1, avoids some bug issues. But concurrency is frequently wanted for speed.

Then you need to lock and/or use atomic access. Types up to U/Int64 allow atomic access with the default alignment, but it’s not clear to me with e.g. UInt128. @Elrod On all modern cached CPUs, I believe no problem, assuming the type doesn’t split across cache lines, but it doesn’t have 128-bit alignment, only 64-bit. I think that decision for Julia was done to not waste space, emulating other languages such as C++. It’s been a little hard for me to dig into if it’s atomic in Julia or other languages, as it depends on the CPU more than the lang spec. ChatGPT has npot been very helpful, basically “it depends” on the CPU. But I want a guarantee (for all (modern) CPU, and not just x86. Though even just knowing for most common, e.g. x86 and ARM would be nice.

Is the rule that you are unsafe for larger types/structs? Without explicit locking? I want to know for without, or “lock-free”. Note, I think BigFloat/BigInt are actually safe, since hidden behind a (no large-than) 64-bit pointer.

E.g. a rational of Int64, the default Rational is likely not safe. I’m asking because I’m implementing a new one, SIMD rational…

To motivate the use of hazard pointers, let us consider a problem: we want to implement a concurrent key-value map that satisfy the Write-Rarely-Read-Many (WRRM) property. WRRM maps are commonly used as caches, and in use cases such as a lookup table for currency to exchange rates where the map is looked up many times but only updated at a comparatively much slower rate.

In other words, we want to implement a multithreaded-access map interface, using a single threaded map object such as std::map.

1.4 Rationale for inclusion in C++26
This proposal is based on the experience gained from hazard pointer implementation in the Facebook Folly open-source library (since 2016) and its heavy use in production (since 2017)

Why in std:
https://www.reddit.com/r/cpp/comments/sywjf3/p2530_hazard_pointers_belong_in_the_standard/

I’m nominally a coauthor, although just because of my work on the original hazard pointer paper. The paper gives a pitch for “sooner rather than later”, but doesn’t really emphasize the answer to “why include it” […]

It was a bit too much for me to start a 5 hour video “Implementing Hazard Pointers in Rust” on a language I don’t (currently) use:

“Brief” introduction:

Elrod · January 16, 2024, 2:16pm

Threads.Atomic{UInt128} has 16 byte alignment:

julia> x = Threads.Atomic{UInt128}(4);

julia> @cl x[]
julia> @code_llvm debuginfo = :none x[]
define void @julia_getindex_306(i128* noalias nocapture noundef nonnull sret(i128) align 8 dereferenceable(16) %0, {}* noundef nonnull align 8 dereferenceable(16) %1) #0 {
top:
  %2 = bitcast {}* %1 to i128*
  %rv.i = load atomic i128, i128* %2 acquire, align 16
  store i128 %rv.i, i128* %0, align 8
  ret void
}

so our load here is align 16.

If you mark a field @atomic, the mutable struct will be padded appropriately:

julia> mutable struct Foo
           x::Int
           @atomic y::UInt128
           z::Int
       end

julia> mutable struct Bar
           x::Int
           y::UInt128
           z::Int
       end

julia> sizeof(Foo), sizeof(Bar)
(48, 32)

but the Foo struct itself still has 8-byte alignment instead of 16, defeating the benefit of padding it.
Thus, loading requires a lock/unlock:

julia> gety(x::Foo) = x.y
gety (generic function with 1 method)

julia> @cl gety(Foo(1,1,1))
julia> @code_llvm debuginfo = :none gety(Foo(1, 1, 1))
define void @julia_gety_393(i128* noalias nocapture noundef nonnull sret(i128) align 8 dereferenceable(16) %0, {}* noundef nonnull align 8 dereferenceable(48) %1) #0 {
top:
  %2 = bitcast {}* %1 to i8*
  %3 = getelementptr inbounds i8, i8* %2, i64 24
  call void @jl_lock_value({}* %1)
  %4 = bitcast i8* %3 to i128*
  %5 = load i128, i128* %4, align 8
  call void @jl_unlock_value({}* %1)
  store i128 %5, i128* %0, align 8
  ret void
}

instead of a single atomic operation.

I would suggest sticking with the primitives, and checking generated code as I did above to get an idea whether codegen was good.
You could file an issue over the alignment of Foo being too small to allow efficient atomic access (or over any other cases with suboptimal codegen you encounter).

Palli · January 16, 2024, 3:52pm

I usually trust you since you’re the expert, but I saw:

So it only seems to be the “minimum” guaranteed limit. Maybe 16 IS guaranteed on x86, however, but I want that guarantee always(?). Do you think I always have it despite that, or maybe that don’t need it?

It should be 16-byte aligned (by default), but I’m not sure it is/guaranteed on all platforms:

One reasons is that most SSE2 instructions on X86 require the data to be 128 bit aligned. This design decision would have been made for performance reasons and to avoid overly complex (and hence slow and big) hardware.

FYI, existing on x86:
https://www.felixcloutier.com/x86/lddqu

LDDQU — Load Unaligned Integer 128 Bits

I do see in Julia.h (implying it might not always be):

// int128 is 16 bytes aligned on aarch64

In jl_compute_field_offsets (MAX_ALIGN is only 8):

github.com

JuliaLang/julia/blob/b058146cafec8405aa07f9647642fd485ea3b5d7/src/datatype.c#L694


      
                      // and contains non-pointer fields of non-zero sizes.
                      haspadding = 1;
                  }
                  if (!zeroinit)
                      zeroinit = ((jl_datatype_t*)fld)->zeroinit;
                  npointers += fld_npointers;
              }
          }
          else {
              fsz = sizeof(void*);
              if (fsz > MAX_ALIGN)
                  fsz = MAX_ALIGN;
              al = fsz;
              desc[i].isptr = 1;
              zeroinit = 1;
              npointers++;
              if (!jl_pointer_egal(fld)) {
                  // this somewhat poorly named flag says whether some of the bits can be non-unique
                  haspadding = 1;
              }
          }

It seems to be fixed in Zig, but not in Rust (any why not for a long time? Seems like a simple fix):

github.com/rust-lang/rust

i128 / u128 are not compatible with C's definition.

opened 01:42AM - 19 Sep 18 UTC

emilio

A-LLVM A-ffi T-compiler C-bug A-abi

While fixing various bindgen bugs related to `long double`, `int128`, (https://g…ithub.com/rust-lang-nursery/rust-bindgen/issues/1370, etc). I realized that the following Rust program, in my x86_64 Linux machine: ```rust #[repr(C)] struct Foo { f: u128, } fn main() { println!("Align: {}", ::std::mem::align_of::<Foo>()); } ``` Prints `8`. While the following C program: ```c #include <stdio.h> struct foo { unsigned __int128 t; }; int main() { printf("Align: %ld\n", _Alignof(struct foo)); } ``` Prints `16` on the same system. This is pretty unexpected, and means that i128 / u128 are not really usable for FFI / alignment purposes.

u128/i128 are not FFI-safe when they should be. That’s a bug. That’s why this issue exists and is still open. We also don’t claim that they are FFI safe, so it’s not a soundness bug.

github.com/ziglang/zig

make the alignment of >= 128 bit integer types 16 bytes to prevent failure to properly align u128 when using 16-bit cmpchxg

opened 06:46AM - 01 Aug 19 UTC

closed 01:13AM - 21 Aug 19 UTC

andrewrk

enhancement stage1 arch-x86_64

`@cmpxchg` on x86_64 supports u128 integers. However `@alignOf(u128)` reports 8,… and without `align(16)` on the `u128`, the code segfaults at runtime. This alignment is reported by LLVM as the "ABI alignment". Is this alignment incorrect? Is the atomic alignment requirement higher? Investigation is needed. Here's the test case: ```zig test "128-bit cmpxchg" { if (builtin.arch != .x86_64) { return error.SkipZigTest; } var x: u128 = 1234; // segfault at runtime can be fixed by adding align(16) here if (@cmpxchgWeak(u128, &x, 99, 5678, .SeqCst, .SeqCst)) |x1| { expect(x1 == 1234); } else { @panic("cmpxchg should have failed"); } while (@cmpxchgWeak(u128, &x, 1234, 5678, .SeqCst, .SeqCst)) |x1| { expect(x1 == 1234); } expect(x == 5678); expect(@cmpxchgStrong(u128, &x, 5678, 42, .SeqCst, .SeqCst) == null); expect(x == 42); } ```

@cmpxchg on x86_64 supports u128 integers. However @alignOf(u128) reports 8, and without align(16) on the u128, the code segfaults at runtime.

[Now fixed, but if not in Julia, then would also segfault, if using that instruction, so since it’s not likely actually is 16-byte aligned. I suppose that instruction IS used.]

Would you say all structs should be 16 bytes (no larger), 8, 4 etc. for concurrent programs (or use locking)?

Arrays are 64-byte aligned, if large, otherwise only 16-byte aligned, so larger 17, or 32byte has the possibility of straddling cache-line boundary.

mkitti · January 17, 2024, 5:41pm

If you can obtain a pointer from somewhere, you can use unsafe_wrap to create a Julia array. I created ArrayAllocators.jl to play with the concept.

https://mkitti.github.io/ArrayAllocators.jl/stable/api/#ArrayAllocators.MemAlign

julia> using ArrayAllocators

julia> A = Array{Int}(MemAlign(2^16), 16);

julia> pointer(A)
Ptr{Int64} @0x00000000032e0000

julia> B = Array{Int}(MemAlign(2^8), 16);

julia> pointer(B)
Ptr{Int64} @0x00000000025c1100

julia> C = Array{Int}(MemAlign(2^4), 16);

julia> pointer(C)
Ptr{Int64} @0x0000000000728fd0

jameson · January 24, 2024, 5:30am

FWIW, the failure here was on the C side, not on Julia’s side: we could not get the supported C compilers to correctly emit this 16 byte load across all of our supported targets. We are fairly close now though and could probably “flip the switch” on these

Topic		Replies	Views
Lock free sorted linked list General Usage atomic , concurrency	18	529	June 6, 2024
Notes on lock-free programming Teaching & Outreach	6	1173	May 27, 2021
Question regarding Julia's planned approach to composable, safe, and easy multithreading Internals & Design multithreading	14	3745	December 23, 2019
Using Tasks 101 Teaching & Outreach tutorials , task	15	900	July 30, 2024
Using atomic operations in Julia 1.6 General Usage question , threads	10	1281	September 6, 2022

Related topics