Approaches to sentinel values in immutables

I don’t know, I would like to know the answer, too! Otherwise this seems to work fine. EDIT: simplified code

using SmallCollections

mutable struct Node{T}
    x::T
    children::SmallVector{3,Node{T}}
    Node(x::T, children::Node{T}...) where T = new{T}(x, children)
    @generated SmallCollections.default(::Type{Node{T}}) where T = new{T}()
end
julia> n1, n2, n3 = Node(1), Node(2), Node(3);

julia> n4 = Node(4, n1, n2);

julia> n5 = Node(5, n3);

julia> n6 = Node(6, n4, n5);

julia> n6.children[2].children[1]
Node{Int64}(3, Node{Int64}[])
1 Like

I think this is equivalent to just

@generated function SmallCollections.default(::Type{Node{T}}) where T
    Node{T}()
end

Regarding world ages, I’d only expect it to be a problem if you redefine the Node type later. Generated functions are stuck in the world age they were defined in, or the world age the module was precompiled in if they’re in a package (and not in the world age they were first invoked/compiled in, as one might expect), so I think this method signature will only match the Node type of that world age, and the object that’s created the first time for a given T and reused later is of that same Node type.

1 Like

We were not, thanks for clarifying.

I think so, I think the #undef elements are identified by null pointers (someone smarter could verify what isassigned does) so there’s no need for an extra array of type tags for small Unions. If we’re only dealing with floating point types, then an even faster and compact alternative to everything mentioned so far is to leverage their built-in NaN values (Base has various aliases for values of supported types, isnan for checks); NaNStatistics.jl uses them as hardware-supported missings.

I haven’t checked (and I won’t, so Nullables.jl exists for anyone who wants to), but my guess is Union{T, Nothing} would sometimes be better than a nullable type in the context of arrays. Nullable strictly puts its tag in each element, so the tags are interspersed with the data in an array. The isbits optimization for small Unions arranges the data and tags in separate arrays, similar to structures of arrays. In cases where that optimization can’t apply, you’re dealing with pointers and a type identifier either way, just not sure if Nullable’s Bool tag has any savings over a Union’s or a fully abstract type’s tag because I don’t know how to even access those.

You were trying to do immutables with 0 allocations, so pointers are ideally avoided. A struct of 3 Union{T, Nothing} fields would have the same structure as a NTuple{3, Nullable{T}} and probably perform the same, the missing link is actually the inability of a concrete Tuple or an immutable, inlineable array to have small Union elements; each concrete tuple type is strictly a structure of concrete types, so a tuple of unions is actually an abstract supertype of them. StaticArrays are currently built on them, but ideally they’re not so they can leverage the same isbits optimization as Arrays.

Yes, here’s how you can check this (it’s been true every time I checked, but obviously you’d have to go to the implementation to confirm that this is actually an invariant):

julia> v = Vector(undef, 1)
1-element Vector{Any}:
 #undef

julia> unsafe_load(Ptr{UInt64}(pointer(v)))
0x0000000000000000

Needless to say, this is an implementation detail and shouldn’t be relied on by users, but in principle, it should eliminate the need for separate tags for containers with eltype Union{NonIsbits,Singleton}, though I have no idea whether the implementation exploits this.

1 Like

This got me curious how Julia is doing this optimization for certain instances of Union but not others. It turns out it literally only does this if you do Vector{Union{T,Nothing}}. If you so much as have a single abstraction in the way, like Vector{Wrapper{Union{T,Nothing}}}, this optimization goes away.

So it seems like it was patched in to the compiler to force Union{T,Nothing} to be fast in very limited circumstances.

Check it out:


julia> using BenchmarkTools, Random

julia> struct Optional{T}
           hasvalue::Bool
           value::T
           Optional{T}() where {T}=new{T}(false)
           Optional{T}(v::T) where {T}=new{T}(true, v)
       end

julia> struct Wrapper{T}
           x::T
       end

julia> vu  = Vector{Union{Float64,Nothing}}(undef, 1_000_000);
       vo  = Vector{Optional{Float64}}(undef, 1_000_000);
       vwu = Vector{Wrapper{Union{Float64,Nothing}}}(undef, 1_000_000);
       vwo = Vector{Wrapper{Optional{Float64}}}(undef, 1_000_000);
       for i in 1:1_000_000
           if rand() < 0.5
               x = randn()
               vu[i]  = x
               vo[i]  = Optional(x)
               vwu[i] = Wrapper{Union{Float64,Nothing}}(x)
               vwo[i] = Wrapper{Optional{Float64}}(Optional(x))
           else
               vu[i]  = nothing
               vo[i]  = Optional{Float64}()
               vwu[i] = Wrapper{Union{Float64,Nothing}}(nothing)
               vwo[i] = Wrapper{Optional{Float64}}(Optional{Float64}())
           end
       end

julia> Base.summarysize(vu)
9000040

julia> Base.summarysize(vo)
16000040

julia> Base.summarysize(vwu)
16000040

julia> Base.summarysize(vwo)
16000040

julia> sum_union(v)      = sum(x -> isnothing(x) ? 0.0 : x, v);

julia> sum_optional(v)   = sum(o -> o.hasvalue   ? o.value : 0.0, v);

julia> sum_wunion(v)     = sum(w -> isnothing(w.x) ? 0.0 : w.x, v);

julia> sum_woptional(v)  = sum(w -> w.x.hasvalue ? w.x.value : 0.0, v);

julia> @btime sum_union($vu)
  404.186 μs (0 allocations: 0 bytes)

julia> @btime sum_optional($vo)
  870.628 μs (0 allocations: 0 bytes)

julia> @btime sum_wunion($vwu)
  6.838 ms (0 allocations: 0 bytes)

julia> @btime sum_woptional($vwo)
  902.897 μs (0 allocations: 0 bytes)
1 Like

Ok. Let me then ask: how would one do this manually? Like if I want the sentinel value to be as efficient as the trick that Julia spits out for Vector{Union{T,Float64}}. Because ultimately I’m not working with this precise type.

StructArrays.jl should work.

I did try but it didn’t get the same speed. It sounds like Vector{Union{Float64,Nothing}} is literally a single Memory block with a 1 byte tag attached to each element. Whereas StructArrays would still be allocating two separate Memory blocks, so they couldn’t be SIMDified as neatly.

I suppose that’s the goal with packages like Moshi.jl (there’s several other such algebraic data type packages). Personally I never used it, but try to benchmark it.

Would be curious to see an example.

Thanks. I just tried. It seems Moshi.jl doesn’t use the same low-level trick that Vector{Union{T,Nothing}} uses as it has the same memory as the explicit Optional{T} struct (though slightly worse speed).

I guess to match Vector{Union{T,Nothing}} we’d need some Ptr hacking.

1 Like

As far as the Array is concerned, it doesn’t have a small Union element type, just a concrete one, so it’s storing a simple element1-element2-element3-.... The isbits optimization is still happening there, but in a field per concrete struct instance, so elementN has the structure of dataN-tagN. That’s what I meant by likening it to Vector{Nullable}.

Seems so, I spoke wrong when I said separate arrays, it’s really 2 contiguous sections of 1 Memory instance.

julia> x = Union{Float64, Nothing}[1, nothing, 3]; x.ref.mem
3-element Memory{Union{Nothing, Float64}}:
 1.0
  nothing
 3.0

julia> unsafe_load.(Ptr{Float64}(x.ref.mem.ptr), 1:3) # data section
3-element Vector{Float64}:
 1.0
 0.0
 3.0

julia> unsafe_load.(Ptr{UInt8}(x.ref.mem.ptr), length(x)*sizeof(eltype(x)) .+ (1:3)) # tag section
3-element Vector{UInt8}:
 0x01
 0x00
 0x01

I’d expect SIMD to work given the data and tags are separate contiguous sections for either isbits Memory or StructArrays, are you sure it’s not better caching for 1 pointer to 1 small enough Memory versus 2 pointers to 2 smaller Memory? If it is, a large enough Memory that doesn’t fit in cache would make the difference disappear.

Ok I got the manual low-level equivalent working using some Ptr hacking:

julia> using BenchmarkTools, Random

julia> buf = Vector{UInt8}(undef, 1_000_000 * 9);

julia> vref = Vector{Union{Float64,Nothing}}(undef, 1_000_000);

julia> GC.@preserve buf begin
           vptr = Ptr{Float64}(pointer(buf))
           tptr = Ptr{UInt8}(pointer(buf) + 8*1_000_000)
           for i in 1:1_000_000
               if rand() < 0.5
                   x = randn()
                   unsafe_store!(vptr, x, i)
                   unsafe_store!(tptr, 0x01, i)
                   vref[i] = x
               else
                   unsafe_store!(vptr, 0.0, i)
                   unsafe_store!(tptr, 0x00, i)
                   vref[i] = nothing
               end
           end
       end

julia> sum_union(v) = sum(x -> isnothing(x) ? 0.0 : x, v);

julia> function sum_packed_simd(buf)
           n = length(buf) ÷ 9
           s = 0.0
           GC.@preserve buf begin
               vptr = Ptr{Float64}(pointer(buf))
               tptr = Ptr{UInt8}(pointer(buf) + 8n)
               @inbounds @simd for i in 1:n
                   s += ifelse(unsafe_load(tptr,i) == 0x01,
                               unsafe_load(vptr,i), 0.0)
               end
           end
           s
       end;

Gives similar numbers between the two:

julia> Base.summarysize(vref)
9000040

julia> Base.summarysize(buf)
9000040

julia> @btime sum_union($vref)
  214.917 μs (0 allocations: 0 bytes)
719.9927313976573

julia> @btime sum_packed_simd($buf)
  204.708 μs (0 allocations: 0 bytes)
719.992731397663

(And just to clarify, I’m not necessarily trying to optimize sums over masked arrays. Was mostly curious about Vector{Union{T,Nothing}} as an example of fast sentinel values - but why Union{T,Nothing} as a struct field is so bad comparatively.)

This discussion reminds me of a recent talk I’ve seen which may hold important insights into such DataStructures optimization problems.
The TL;DR is that StructOfArray and ArrayOfStruct is important, and that allocation should be avoided entirely by managing a chunk of memory directly. This sounds along the lines of aplavin’s suggestions. Now here is the link:

@MilesCranmer, I’m getting basically the same performance with a plain structarray,

sa = StructArray(val=rand(n), flag=rand(Bool, n))
sum_sa(sa) = sum(x -> x.flag ? x.val : 0.0, sa)

It has a bit more space overhead – size 9000096 instead of 9000040 – but should be negligible for any nontrivial array.
The memory storage for val and flag can be put into a single memory block like

sa_opt = let
	buf = Memory{UInt8}(undef, n*9)
	sa = StructArray(
		val=reinterpret(Float64, view(buf, 1:8n)),
		flag=reinterpret(Bool, view(buf, (8n+1):9n)),
	)
end

keeping the same interface (the same sum_sa function) and performance.
This can be useful to save on the number of allocations, but doesn’t help with reducing summarysize below 9000096.

Thanks, I see. Not sure what went wrong with my test.

Just to be clear I am not working with Vectors though. My interest in Vector{Union{T,Nothing}} was to see how it was boosting performance for sentinel values, relative to the poor performance of Union{T,Nothing} in structs. Though I’m not sure the lessons apply to general structs like I’m dealing with.

This doesn’t seem to work in packages? I do in fact get some world age errors when trying to do the @generated-as-cache trick:

@generated sentinel_node(::N) where {N} = N()

Logs:

Precompiling DynamicExpressions...
Info Given DynamicExpressions was explicitly requested, output will be shown live 
ERROR: LoadError: MethodError: no method matching DynamicExpressions.NodeUtilsModule.NodeIndex{UInt16, 2}()
The applicable method may be too new: running in world age 26796, while current world is 28754.

Closest candidates are:
  DynamicExpressions.NodeUtilsModule.NodeIndex{_T, _D}() where {_T, _D} (method too new to be called from this world context.)
   @ DynamicExpressions ~/PermaDocuments/SymbolicRegressionMonorepo/DynamicExpressions.jl/src/NodeUtils.jl:152

Could be because this NodeIndex type was defined in another submodule. Or maybe because the @generated-as-cache is not compatible with precompilation? (The sentinel_node does get hit by the precompilation btw).

Which leaves the cycle method or Nullable{T}() as the possibilities for zero-allocation type-stable approaches to sentinel values in immutables.

Enzyme does it, for one: Enzyme.jl/src/analyses/activity.jl at 4508f6ce441bad05d9362fd1b13aabeb3cffcc52 · EnzymeAD/Enzyme.jl · GitHub

Not sure why it doesn’t work in your case, but I can imagine things get tricky if the generated function is used in precompilation workflows.

What if you try something like this:

sentinel_node_nongen(::Type{N}) where {N} = N()
@generated sentinel_node(N) = sentinel_node_nongen(N)

That way, you’re not explicitly referencing the type in the generated function.