Union structs type tag memory layout

PatrickHaecker · November 12, 2025, 7:53am

Structs containing isbits fields of type Union use the isbits Union structs optimization. I was surprised to learn that

The type tag byte for a type’s Union field is stored directly after the field’s computed Union memory.

Take the following example (which is a simplification of my real problem with different types, but same memory layout):

struct MyStruct
    a::Union{Int32, UInt32}
    b::Union{Int32, UInt32}
    c::Union{Int32, UInt32}
    d::Union{Int32, UInt32}
end

where the Unions effectively double the needed space, as this kind of gets transformed into

struct MyStruct
    a::NTuple{4, UInt8}
    a_type_tag::UInt8
    b::NTuple{4, UInt8}
    b_type_tag::UInt8
    c::NTuple{4, UInt8}
    c_type_tag::UInt8
    d::NTuple{4, UInt8}
    d_type_tag::UInt8
end

adding three bytes of padding to each (Union) field resulting in 32 bytes.

Instead, I would have expected that all type tags are appended at the end of the struct resulting in something like

struct MyStruct
    a::NTuple{4, UInt8}
    b::NTuple{4, UInt8}
    c::NTuple{4, UInt8}
    d::NTuple{4, UInt8}
    a_type_tag::UInt8
    b_type_tag::UInt8
    c_type_tag::UInt8
    d_type_tag::UInt8
end

with 20 bytes. And this ratio of needing 8/5 of the memory isn’t even the worst case as the Union fields could also have more demanding alignment constraints going towards larger regular types or even SIMD types (I think a memory efficiency of 65/128 is the limit with AVX 512, but anyway, it can get close to 50%).

To achieve optimum packing, fields can be ordered by their alignment constraints. This is impossible when the type tags immediately follow the Union field as can be seen above.

I think that the “always at the end of the struct” proposal would be no loss in generality. To achieve the current behavior, you could move each Union field into its own struct and use that struct instead as the field’s type, so

struct EitherInt32
    val::Union{Int32, UInt32}
end
struct MyStruct
    a::EitherInt32
    b::EitherInt32
end

would effectively result in

struct MyStruct
    a::NTuple{4, UInt8}
    a_type_tag::UInt8
    b::NTuple{4, UInt8}
    b_type_tag::UInt8
end

In general we can’t reorder fields, or influence padding, because we want to be compatible to C, but in this case I think we could, as C doesn’t have these kind of unions.

So why are the type tags immediately following the union fields? Is it to not need to look up an offset? Is it for better data locality and therefore higher cache efficiency? Although if it is the latter, I would assume (but measurements would be needed), that the smaller memory footprint on average is better for caching efficiency than the slightly higher locality.

Is there a possibility to change the DataType’s layout at runtime with some API to define the proposed ordering manually from Julia? This would probably be internal and would need tweaking on updates which would be fine in my case.

Assume I would be doing everything manually, e.g. use a non-Union placeholder type instead of the Union and doing the typecast on every access (the data structure will be read-only when performance matters). Would this have any performance disadvantages compared to the available union optimization, e.g. any optimizations which this misses out?

tecosaur · November 12, 2025, 9:47am

Cool idea! It seems to me like it might also be possible to pack multiple unions into a single discriminator tag (e.g. if a struct has a 3-union field then a 2-union field, this could be contained in 2+1=3 bits, so a single 8-bit extra field instead of two).

Further to that, would it be possible to look for existing padding in the struct layout, and “squeeze” the union tags into the existing padding (where possible)?

This is a bit of a tangent, but I’d be interested to hear if it’s possible to use reflection/Core.Compiler to inspect what is currently done: it would be cool to include this information in About.jl.

foobar_lv2 · November 12, 2025, 10:59am

That is a cool idea, but beware of aliasing / race conditions. Today it is safe to concurrently mutate different union-typed fields of mutable structs; if you bit-packed the discriminators then you’d either make this slow or unsafe. Remember the @simd-BitArray disaster. (@simd used to imply :ivdep which doesn’t work for writes to BitArray, because writes to different indices interfere)

I think no. Doing this manually gives you more power to do other optimizations, like e.g.

More aggressive bitpacking. Say, you want to have x::Union{A,B} with struct A v::Int32 end and struct B v::Int32 end. Often, you don’t need full 32 bits, 31 bits suffice. So you could make x::C with struct C v::Int32 end, and use unpack(c::C)::Union{A,B} = c.v < 0 ? A((c.v<<1) >>>1) : B(c.v), possibly even using that in getproperty(c::C, s::Symbol).
If you write the branching code yourself, you have much better control over branches if/ c?a:b vs conditional moves ifelse. You can also figure out good vectorized code.

That being said, such micro-optimizations risk being a waste of time until you benchmark (and then, consider whether to get rid of the union on a higher algorithmic / datastructure level).

PatrickHaecker · November 12, 2025, 2:35pm

This is a fascinating idea. Besides the problems mentioned by @foobar_lv2, working on bit level could also slightly slow down accesses due to the bit operations. I think as a first step, we should focus on optimizing on byte level.

That’s a great idea! If we should modify the Union optimization anyway, we can just use the first unused byte (either from the beginning of the struct or after the Union field). The implicit fallback would be the previously suggested appending after the last field. This way we would either create the same layout or improve both memory efficiency and cache locality in non-dense structs.

tecosaur · November 12, 2025, 3:16pm

Oh indeed. I was mainly thinking of immutable structs.

Potentially, but I get the impression with CPUs these days it’s frequently getting the data in registers to operate on that’s the bottle neck, not doing the compute.

I would think that iszero(tag & mask) would end up taking less than a cycle, and so if you have say a list of elements, it seems entirely likely to me that the win from having more in L1 etc. outweighs the cost of the extra & mask.

PatrickHaecker · November 12, 2025, 3:23pm

I agree that this can be faster. However, I meant that as long as we have free bytes anyway (due to padding) it can be faster to just use them instead of cramming everything into one byte and e.g. leave three bytes empty. They unused bytes will be loaded into the cache line anyway in this case.

But after optimizing on the byte level, I totally agree that it makes sense to optimize on the bit level for the cases where no unused bytes are available.

Mason · November 12, 2025, 4:07pm

While it’s true it does give you some new optimization opportunities, be aware that unsafe_load / unsafe_store! does actually disable some optimizations, in particular type based alias analysis (TBAA). Maybe you could get it back by using an @assume_effects declaration though, I’m not sure at which level this is modelled.

Benny · November 12, 2025, 4:59pm

Wouldn’t the flip side of this be that for data types that are smaller than the word size, an adjacent byte tag would be accessed along with its field’s raw bytes in one go, whereas trailing byte tags strictly require a 2nd access like your example where the raw bytes occupy 1 word?

mikmoore · November 13, 2025, 8:03pm

I know we occasionally want C layout interoperability, which is why we don’t do any layout optimizations (or allow the compiler to do them). But this is not a super common case. I think that there’s room for a decorator like @c_layout struct ... end to opt in to C interop. Or, for something more backwards-compatible (but more annoying otherwise), @free_layout struct ... end to opt out of interop.

In either case, we could add a is_c_interoperable(::Type) function that indicates whether a type is interoperable. We could then use it in ccall and friends to throw rather than permit incompatible types to be passed (maybe with a keyword to disable this restriction).

Although without a strictly-defined layout, it might change across versions or architectures which might cause headaches (e.g., with serialized data). I guess the serializer could at least warn if writing a re-layout-able type.

Going further, for non-interpop types we could auto-define an interop variant and a converter between the two. It could be used at ccall to translate to/from the suitable layout (or throw if not possible, for some reason). But I expect this could have complicated edge cases, so maybe better to shelve this idea for later.

tecosaur · November 14, 2025, 2:26am

IIUC the proposals here wouldn’t affect C interop.

mikmoore · November 14, 2025, 4:17am

Yes, because these are already not interoperable (as discussed). But once that ship has sailed (or on request or except on opt-out), why not optimize more thoroughly? As others discussed, there can be situations where it makes sense to place the tags adjacent to fields or at the end (or sometimes somewhere in-between) so one-size might not fit all.

PatrickHaecker · November 14, 2025, 4:28am

It will be two accesses either way, as the data and the type tag need to be used separately. Even if the data is an UInt8 and the type tag is an UInt8, the CPU needs to address both separately with an 8-bit instruction so they will probably end up in two different registers and the CPU can’t make use of a 16-bit instruction.
Or if it would initially, there would be masking and shifting needed to access the data (the tag could be addressed directly with an 8-bit instruction within the 16-bit value in the register, but not the value) again resulting in using a second register.

You can follow the access path in jl_get_nth_field for a simple non-atomic, non-locked, non-singleton, non-pointer access:

both the type ty and the value read from memory address v + offs are needed to create the new bits value in line 1849
the type ty is coming from the nth entry of the union type (line 1830) after reading the UInt8 type tag (line 1829). The -1 is because the field size fsz already accounts for the tag byte (line 738)
v and offs are read separately via lines 1815 and 1820

This is similar to the isbits Union Memory in jl_memoryrefget where the corresponding value and type tag already are non-adjacent (only the corresponding arrays are):

both eltype and data are needed to create the new bits value in line 381
the eltype is coming from the nth entry of the union type (line 358) after reading the UInt8 type tag (line 357)
the data is read separately in line 359

PatrickHaecker · February 1, 2026, 6:55am

I had an idea how this could in theory be simulated with the “tag always follows next union isbitstype” still in place.

The idea is to not have them after each other, but pack the unions into each other kind of like an Matryoshka. This has some similarity to (tail) recursion, too. So for a four byte union, you get the following memory layout

▯▯▯▯▮▮▮▮⚑⚐

or, to see the packing better, with parentheses separating the data structures from the inside to the outside

((▯▯▯▯((▮▮▮▮)⚑))⚐)

where ▮ is a byte of the inner struct with ⚑ being the union tag byte determining the inner struct’s type and ▯ being a byte of the outer struct with ⚐ being the union tag byte determining the outer struct’s type. The right (black) struct corresponds to the left union type tag (black) and vice-versa.

If black and white depict structs with an even or odd index, you get

▯▯▯▯▮▮▮▮▯▯▯▯▮▮▮▮⚑⚐⚑⚐

for four structs. You only need to take care that the union type tags go from right-to-left when using the structs from left-to-right.

Regarding packing, this is exactly what we want, as we can now pack four structs with their union type tag into 5 slots with each slot having 4 bytes. So now we can create any kind of AbstractVector with a large number of values over the previous data structure having both superb memory and locality efficiency.

This is the code:

struct U
    u1::UInt8
    u2::UInt8
    u3::UInt8
    u4::UInt8
end
struct I
    i1::Int8
    i2::Int8
    i3::Int8
    i4::Int8
end
struct UI
    ui::Union{U, I}
end

struct U_UI
    u::U
    ui::UI
end
struct I_UI
    i::I
    ui::UI
end
struct UI_UI
    ui_ui::Union{U_UI, I_UI}
end

struct U_UI_UI
    u::U
    ui_ui::UI_UI
end
struct I_UI_UI
    i::I
    ui_ui::UI_UI
end
struct UI_UI_UI
    ui_ui_ui::Union{U_UI_UI, I_UI_UI}
end

struct U_UI_UI_UI
    u::U
    ui_ui_UI::UI_UI_UI
end
struct I_UI_UI_UI
    i::I
    ui_ui_ui::UI_UI_UI
end
struct UI_UI_UI_UI
    ui_ui_ui_ui::Union{U_UI_UI_UI, I_UI_UI_UI}
end

Wit this, we can annotate the simple version above:

▯▯▯▯▮▮▮▮⚑⚐
    └U─┘
    └─UI─┘
└──I_UI──┘
└─UI_UI───┘

Please use About.jl to follow through the memory layout.

This is obviously a bit type-heavy, but I think everything besides the definitions of the structs you really care about, U and I in this example, can be generated by a macro. And as it’s linear in the number of types used in the union and linear in the number of values packed together, it should be okay. So besides stressing the compiler, this is probably primarily a problem when you need to debug this.

However, and this is probably a showstopper in many cases at the moment, it only works if there is no need for memory alignment in U and I. With the following definitions (replacing the corresponding once above)

struct U
    u::UInt32
end
struct I
    i::Int32
end

UI needs to have 3 bytes of padding after the union type tag when used in an array to have u and i 4-byte-aligned. U_UI and all other types which add a Union to the embedded Union do not recognize that the inner type has unused trailing padding bytes and instead add their union type tag after the padding bytes of the inner type just to add another three padding bytes afterwards.

Note, that the current approach is neither needed for C compatibility (there is none for union type tags) nor for alignment in arrays (would be 4-byte-aligned either way). To me, this sounds like we should either have the “use next free byte optimization” discussed above or the “tail padding type tag optimization” as outlined in this reply.

Topic		Replies	Views
Memory `isbits` Union Optimization Performance array , arrays	5	179	February 23, 2025
Memory layout of structs with mutable fields Performance	42	3598	March 3, 2019
Is the "isbits union" optimization enabled in Julia 1.1? General Usage	6	712	August 4, 2019
Why are unions of concrete types not concrete? General Usage question , type	6	420	March 2, 2023
How to tell if a type is an efficient small type union? General Usage	11	1154	December 20, 2018

Union structs type tag memory layout

Related topics