Union structs type tag memory layout

Structs containing isbits fields of type Union use the isbits Union structs optimization. I was surprised to learn that

The type tag byte for a type’s Union field is stored directly after the field’s computed Union memory.

Take the following example (which is a simplification of my real problem with different types, but same memory layout):

struct MyStruct
    a::Union{Int32, UInt32}
    b::Union{Int32, UInt32}
    c::Union{Int32, UInt32}
    d::Union{Int32, UInt32}
end

where the Unions effectively double the needed space, as this kind of gets transformed into

struct MyStruct
    a::NTuple{4, UInt8}
    a_type_tag::UInt8
    b::NTuple{4, UInt8}
    b_type_tag::UInt8
    c::NTuple{4, UInt8}
    c_type_tag::UInt8
    d::NTuple{4, UInt8}
    d_type_tag::UInt8
end

adding three bytes of padding to each (Union) field resulting in 32 bytes.

Instead, I would have expected that all type tags are appended at the end of the struct resulting in something like

struct MyStruct
    a::NTuple{4, UInt8}
    b::NTuple{4, UInt8}
    c::NTuple{4, UInt8}
    d::NTuple{4, UInt8}
    a_type_tag::UInt8
    b_type_tag::UInt8
    c_type_tag::UInt8
    d_type_tag::UInt8
end

with 20 bytes. And this ratio of needing 8/5 of the memory isn’t even the worst case as the Union fields could also have more demanding alignment constraints going towards larger regular types or even SIMD types (I think a memory efficiency of 65/128 is the limit with AVX 512, but anyway, it can get close to 50%).

To achieve optimum packing, fields can be ordered by their alignment constraints. This is impossible when the type tags immediately follow the Union field as can be seen above.

I think that the “always at the end of the struct” proposal would be no loss in generality. To achieve the current behavior, you could move each Union field into its own struct and use that struct instead as the field’s type, so

struct EitherInt32
    val::Union{Int32, UInt32}
end
struct MyStruct
    a::EitherInt32
    b::EitherInt32
end

would effectively result in

struct MyStruct
    a::NTuple{4, UInt8}
    a_type_tag::UInt8
    b::NTuple{4, UInt8}
    b_type_tag::UInt8
end

In general we can’t reorder fields, or influence padding, because we want to be compatible to C, but in this case I think we could, as C doesn’t have these kind of unions.

So why are the type tags immediately following the union fields? Is it to not need to look up an offset? Is it for better data locality and therefore higher cache efficiency? Although if it is the latter, I would assume (but measurements would be needed), that the smaller memory footprint on average is better for caching efficiency than the slightly higher locality.

Is there a possibility to change the DataType’s layout at runtime with some API to define the proposed ordering manually from Julia? This would probably be internal and would need tweaking on updates which would be fine in my case.

Assume I would be doing everything manually, e.g. use a non-Union placeholder type instead of the Union and doing the typecast on every access (the data structure will be read-only when performance matters). Would this have any performance disadvantages compared to the available union optimization, e.g. any optimizations which this misses out?

1 Like

Cool idea! It seems to me like it might also be possible to pack multiple unions into a single discriminator tag (e.g. if a struct has a 3-union field then a 2-union field, this could be contained in 2+1=3 bits, so a single 8-bit extra field instead of two).

Further to that, would it be possible to look for existing padding in the struct layout, and “squeeze” the union tags into the existing padding (where possible)?

This is a bit of a tangent, but I’d be interested to hear if it’s possible to use reflection/Core.Compiler to inspect what is currently done: it would be cool to include this information in About.jl.

1 Like

That is a cool idea, but beware of aliasing / race conditions. Today it is safe to concurrently mutate different union-typed fields of mutable structs; if you bit-packed the discriminators then you’d either make this slow or unsafe. Remember the @simd-BitArray disaster. (@simd used to imply :ivdep which doesn’t work for writes to BitArray, because writes to different indices interfere)

I think no. Doing this manually gives you more power to do other optimizations, like e.g.

  1. More aggressive bitpacking. Say, you want to have x::Union{A,B} with struct A v::Int32 end and struct B v::Int32 end. Often, you don’t need full 32 bits, 31 bits suffice. So you could make x::C with struct C v::Int32 end, and use unpack(c::C)::Union{A,B} = c.v < 0 ? A((c.v<<1) >>>1) : B(c.v), possibly even using that in getproperty(c::C, s::Symbol).
  2. If you write the branching code yourself, you have much better control over branches if/ c?a:b vs conditional moves ifelse. You can also figure out good vectorized code.

That being said, such micro-optimizations risk being a waste of time until you benchmark (and then, consider whether to get rid of the union on a higher algorithmic / datastructure level).

2 Likes

This is a fascinating idea. Besides the problems mentioned by @foobar_lv2, working on bit level could also slightly slow down accesses due to the bit operations. I think as a first step, we should focus on optimizing on byte level.

That’s a great idea! If we should modify the Union optimization anyway, we can just use the first unused byte (either from the beginning of the struct or after the Union field). The implicit fallback would be the previously suggested appending after the last field. This way we would either create the same layout or improve both memory efficiency and cache locality in non-dense structs.

Oh indeed. I was mainly thinking of immutable structs.

Potentially, but I get the impression with CPUs these days it’s frequently getting the data in registers to operate on that’s the bottle neck, not doing the compute.

I would think that iszero(tag & mask) would end up taking less than a cycle, and so if you have say a list of elements, it seems entirely likely to me that the win from having more in L1 etc. outweighs the cost of the extra & mask.

I agree that this can be faster. However, I meant that as long as we have free bytes anyway (due to padding) it can be faster to just use them instead of cramming everything into one byte and e.g. leave three bytes empty. They unused bytes will be loaded into the cache line anyway in this case.

But after optimizing on the byte level, I totally agree that it makes sense to optimize on the bit level for the cases where no unused bytes are available.

1 Like

While it’s true it does give you some new optimization opportunities, be aware that unsafe_load / unsafe_store! does actually disable some optimizations, in particular type based alias analysis (TBAA). Maybe you could get it back by using an @assume_effects declaration though, I’m not sure at which level this is modelled.

1 Like

Wouldn’t the flip side of this be that for data types that are smaller than the word size, an adjacent byte tag would be accessed along with its field’s raw bytes in one go, whereas trailing byte tags strictly require a 2nd access like your example where the raw bytes occupy 1 word?

I know we occasionally want C layout interoperability, which is why we don’t do any layout optimizations (or allow the compiler to do them). But this is not a super common case. I think that there’s room for a decorator like @c_layout struct ... end to opt in to C interop. Or, for something more backwards-compatible (but more annoying otherwise), @free_layout struct ... end to opt out of interop.

In either case, we could add a is_c_interoperable(::Type) function that indicates whether a type is interoperable. We could then use it in ccall and friends to throw rather than permit incompatible types to be passed (maybe with a keyword to disable this restriction).

Although without a strictly-defined layout, it might change across versions or architectures which might cause headaches (e.g., with serialized data). I guess the serializer could at least warn if writing a re-layout-able type.

Going further, for non-interpop types we could auto-define an interop variant and a converter between the two. It could be used at ccall to translate to/from the suitable layout (or throw if not possible, for some reason). But I expect this could have complicated edge cases, so maybe better to shelve this idea for later.

2 Likes

IIUC the proposals here wouldn’t affect C interop.

Yes, because these are already not interoperable (as discussed). But once that ship has sailed (or on request or except on opt-out), why not optimize more thoroughly? As others discussed, there can be situations where it makes sense to place the tags adjacent to fields or at the end (or sometimes somewhere in-between) so one-size might not fit all.

It will be two accesses either way, as the data and the type tag need to be used separately. Even if the data is an UInt8 and the type tag is an UInt8, the CPU needs to address both separately with an 8-bit instruction so they will probably end up in two different registers and the CPU can’t make use of a 16-bit instruction.
Or if it would initially, there would be masking and shifting needed to access the data (the tag could be addressed directly with an 8-bit instruction within the 16-bit value in the register, but not the value) again resulting in using a second register.

You can follow the access path in jl_get_nth_field for a simple non-atomic, non-locked, non-singleton, non-pointer access:

  • both the type ty and the value read from memory address v + offs are needed to create the new bits value in line 1849
  • the type ty is coming from the nth entry of the union type (line 1830) after reading the UInt8 type tag (line 1829). The -1 is because the field size fsz already accounts for the tag byte (line 738)
  • v and offs are read separately via lines 1815 and 1820

This is similar to the isbits Union Memory in jl_memoryrefget where the corresponding value and type tag already are non-adjacent (only the corresponding arrays are):

  • both eltype and data are needed to create the new bits value in line 381
  • the eltype is coming from the nth entry of the union type (line 358) after reading the UInt8 type tag (line 357)
  • the data is read separately in line 359
3 Likes