I don’t know, I would like to know the answer, too! Otherwise this seems to work fine. EDIT: simplified code
using SmallCollections
mutable struct Node{T}
x::T
children::SmallVector{3,Node{T}}
Node(x::T, children::Node{T}...) where T = new{T}(x, children)
@generated SmallCollections.default(::Type{Node{T}}) where T = new{T}()
end
@generated function SmallCollections.default(::Type{Node{T}}) where T
Node{T}()
end
Regarding world ages, I’d only expect it to be a problem if you redefine the Node type later. Generated functions are stuck in the world age they were defined in, or the world age the module was precompiled in if they’re in a package (and not in the world age they were first invoked/compiled in, as one might expect), so I think this method signature will only match the Node type of that world age, and the object that’s created the first time for a given T and reused later is of that same Node type.
I think so, I think the #undef elements are identified by null pointers (someone smarter could verify what isassigned does) so there’s no need for an extra array of type tags for small Unions. If we’re only dealing with floating point types, then an even faster and compact alternative to everything mentioned so far is to leverage their built-in NaN values (Base has various aliases for values of supported types, isnan for checks); NaNStatistics.jl uses them as hardware-supported missings.
I haven’t checked (and I won’t, so Nullables.jl exists for anyone who wants to), but my guess is Union{T, Nothing} would sometimes be better than a nullable type in the context of arrays. Nullable strictly puts its tag in each element, so the tags are interspersed with the data in an array. The isbits optimization for small Unions arranges the data and tags in separate arrays, similar to structures of arrays. In cases where that optimization can’t apply, you’re dealing with pointers and a type identifier either way, just not sure if Nullable’s Bool tag has any savings over a Union’s or a fully abstract type’s tag because I don’t know how to even access those.
You were trying to do immutables with 0 allocations, so pointers are ideally avoided. A struct of 3 Union{T, Nothing} fields would have the same structure as a NTuple{3, Nullable{T}} and probably perform the same, the missing link is actually the inability of a concrete Tuple or an immutable, inlineable array to have small Union elements; each concrete tuple type is strictly a structure of concrete types, so a tuple of unions is actually an abstract supertype of them. StaticArrays are currently built on them, but ideally they’re not so they can leverage the same isbits optimization as Arrays.
Yes, here’s how you can check this (it’s been true every time I checked, but obviously you’d have to go to the implementation to confirm that this is actually an invariant):
Needless to say, this is an implementation detail and shouldn’t be relied on by users, but in principle, it should eliminate the need for separate tags for containers with eltype Union{NonIsbits,Singleton}, though I have no idea whether the implementation exploits this.
This got me curious how Julia is doing this optimization for certain instances of Union but not others. It turns out it literally only does this if you do Vector{Union{T,Nothing}}. If you so much as have a single abstraction in the way, like Vector{Wrapper{Union{T,Nothing}}}, this optimization goes away.
So it seems like it was patched in to the compiler to force Union{T,Nothing} to be fast in very limited circumstances.
Check it out:
julia> using BenchmarkTools, Random
julia> struct Optional{T}
hasvalue::Bool
value::T
Optional{T}() where {T}=new{T}(false)
Optional{T}(v::T) where {T}=new{T}(true, v)
end
julia> struct Wrapper{T}
x::T
end
julia> vu = Vector{Union{Float64,Nothing}}(undef, 1_000_000);
vo = Vector{Optional{Float64}}(undef, 1_000_000);
vwu = Vector{Wrapper{Union{Float64,Nothing}}}(undef, 1_000_000);
vwo = Vector{Wrapper{Optional{Float64}}}(undef, 1_000_000);
for i in 1:1_000_000
if rand() < 0.5
x = randn()
vu[i] = x
vo[i] = Optional(x)
vwu[i] = Wrapper{Union{Float64,Nothing}}(x)
vwo[i] = Wrapper{Optional{Float64}}(Optional(x))
else
vu[i] = nothing
vo[i] = Optional{Float64}()
vwu[i] = Wrapper{Union{Float64,Nothing}}(nothing)
vwo[i] = Wrapper{Optional{Float64}}(Optional{Float64}())
end
end
julia> Base.summarysize(vu)
9000040
julia> Base.summarysize(vo)
16000040
julia> Base.summarysize(vwu)
16000040
julia> Base.summarysize(vwo)
16000040
julia> sum_union(v) = sum(x -> isnothing(x) ? 0.0 : x, v);
julia> sum_optional(v) = sum(o -> o.hasvalue ? o.value : 0.0, v);
julia> sum_wunion(v) = sum(w -> isnothing(w.x) ? 0.0 : w.x, v);
julia> sum_woptional(v) = sum(w -> w.x.hasvalue ? w.x.value : 0.0, v);
julia> @btime sum_union($vu)
404.186 μs (0 allocations: 0 bytes)
julia> @btime sum_optional($vo)
870.628 μs (0 allocations: 0 bytes)
julia> @btime sum_wunion($vwu)
6.838 ms (0 allocations: 0 bytes)
julia> @btime sum_woptional($vwo)
902.897 μs (0 allocations: 0 bytes)
Ok. Let me then ask: how would one do this manually? Like if I want the sentinel value to be as efficient as the trick that Julia spits out for Vector{Union{T,Float64}}. Because ultimately I’m not working with this precise type.
I did try but it didn’t get the same speed. It sounds like Vector{Union{Float64,Nothing}} is literally a single Memory block with a 1 byte tag attached to each element. Whereas StructArrays would still be allocating two separate Memory blocks, so they couldn’t be SIMDified as neatly.
I suppose that’s the goal with packages like Moshi.jl (there’s several other such algebraic data type packages). Personally I never used it, but try to benchmark it.
Thanks. I just tried. It seems Moshi.jl doesn’t use the same low-level trick that Vector{Union{T,Nothing}} uses as it has the same memory as the explicit Optional{T} struct (though slightly worse speed).
I guess to match Vector{Union{T,Nothing}} we’d need some Ptr hacking.
As far as the Array is concerned, it doesn’t have a small Union element type, just a concrete one, so it’s storing a simple element1-element2-element3-.... The isbits optimization is still happening there, but in a field per concrete struct instance, so elementN has the structure of dataN-tagN. That’s what I meant by likening it to Vector{Nullable}.
Seems so, I spoke wrong when I said separate arrays, it’s really 2 contiguous sections of 1 Memory instance.
I’d expect SIMD to work given the data and tags are separate contiguous sections for either isbitsMemory or StructArrays, are you sure it’s not better caching for 1 pointer to 1 small enough Memory versus 2 pointers to 2 smaller Memory? If it is, a large enough Memory that doesn’t fit in cache would make the difference disappear.
Ok I got the manual low-level equivalent working using some Ptr hacking:
julia> using BenchmarkTools, Random
julia> buf = Vector{UInt8}(undef, 1_000_000 * 9);
julia> vref = Vector{Union{Float64,Nothing}}(undef, 1_000_000);
julia> GC.@preserve buf begin
vptr = Ptr{Float64}(pointer(buf))
tptr = Ptr{UInt8}(pointer(buf) + 8*1_000_000)
for i in 1:1_000_000
if rand() < 0.5
x = randn()
unsafe_store!(vptr, x, i)
unsafe_store!(tptr, 0x01, i)
vref[i] = x
else
unsafe_store!(vptr, 0.0, i)
unsafe_store!(tptr, 0x00, i)
vref[i] = nothing
end
end
end
julia> sum_union(v) = sum(x -> isnothing(x) ? 0.0 : x, v);
julia> function sum_packed_simd(buf)
n = length(buf) ÷ 9
s = 0.0
GC.@preserve buf begin
vptr = Ptr{Float64}(pointer(buf))
tptr = Ptr{UInt8}(pointer(buf) + 8n)
@inbounds @simd for i in 1:n
s += ifelse(unsafe_load(tptr,i) == 0x01,
unsafe_load(vptr,i), 0.0)
end
end
s
end;
(And just to clarify, I’m not necessarily trying to optimize sums over masked arrays. Was mostly curious about Vector{Union{T,Nothing}} as an example of fast sentinel values - but why Union{T,Nothing} as a struct field is so bad comparatively.)
This discussion reminds me of a recent talk I’ve seen which may hold important insights into such DataStructures optimization problems.
The TL;DR is that StructOfArray and ArrayOfStruct is important, and that allocation should be avoided entirely by managing a chunk of memory directly. This sounds along the lines of aplavin’s suggestions. Now here is the link:
It has a bit more space overhead – size 9000096 instead of 9000040 – but should be negligible for any nontrivial array.
The memory storage for val and flag can be put into a single memory block like
sa_opt = let
buf = Memory{UInt8}(undef, n*9)
sa = StructArray(
val=reinterpret(Float64, view(buf, 1:8n)),
flag=reinterpret(Bool, view(buf, (8n+1):9n)),
)
end
keeping the same interface (the same sum_sa function) and performance.
This can be useful to save on the number of allocations, but doesn’t help with reducing summarysize below 9000096.
Thanks, I see. Not sure what went wrong with my test.
Just to be clear I am not working with Vectors though. My interest in Vector{Union{T,Nothing}} was to see how it was boosting performance for sentinel values, relative to the poor performance of Union{T,Nothing} in structs. Though I’m not sure the lessons apply to general structs like I’m dealing with.
This doesn’t seem to work in packages? I do in fact get some world age errors when trying to do the @generated-as-cache trick:
@generated sentinel_node(::N) where {N} = N()
Logs:
Precompiling DynamicExpressions...
Info Given DynamicExpressions was explicitly requested, output will be shown live
ERROR: LoadError: MethodError: no method matching DynamicExpressions.NodeUtilsModule.NodeIndex{UInt16, 2}()
The applicable method may be too new: running in world age 26796, while current world is 28754.
Closest candidates are:
DynamicExpressions.NodeUtilsModule.NodeIndex{_T, _D}() where {_T, _D} (method too new to be called from this world context.)
@ DynamicExpressions ~/PermaDocuments/SymbolicRegressionMonorepo/DynamicExpressions.jl/src/NodeUtils.jl:152
Could be because this NodeIndex type was defined in another submodule. Or maybe because the @generated-as-cache is not compatible with precompilation? (The sentinel_node does get hit by the precompilation btw).
Which leaves the cycle method or Nullable{T}() as the possibilities for zero-allocation type-stable approaches to sentinel values in immutables.