I’m not sure I see this is as equivalent. The Enzyme one just returns a DataType (e.g.), so is totally static and immutable. Whereas the one under consideration here is a fully instantiated and mutable live object.
These return values are not types, but enum values. But your point stands: they are immutable primitive values, not objects on the heap.
Found this surprising (and unfortunate) result. Even this tiny Optional{T}
struct actually has quite a massive performance hit too when put in context of the library
Keep in mind how tiny this is though, relative to a typical application. I think it is likely great approach for most usecases. It’s just because we are in the nanosecond regime here (for copying a tree) we have to take some uglier but slightly faster approaches.
It seems the absolute fastest option here is to use cyclic references as a trick to induce user-side stack overflows upon improper access, rather than trying to create null pointers (the Optional{T}
method) or Union{T,Nothing}
.
If there’s a way to “compile-in” a null pointer I think that would work too. Basically a safer version of the @generated
technique discussed above, if such a thing is allowed.
What about my/your proposed approach? The one where there’s simply one additional Bool
field per optional value. I’m aware that it’s a hack compared to a proper ADT solution, like Rust has, however it would avoid the allocations, which seem to be the bottleneck in your CPU profile.
That’s what I tried here. I actually even did less than the Bool and just used
struct Optional{T}
x::T
Optional(x::T) where {T} = new{T}(x)
Optional{T}() where {T} = new{T}()
end
With this change:
mutable struct Node{T,D} <: AbstractExpressionNode{T,D}
degree::UInt8 # 0 for constant/variable, 1 for cos/sin, 2 for +/* etc.
constant::Bool # false if variable
val::T # If is a constant, this stores the actual value
feature::UInt16 # (Possibly undefined) If is a variable (e.g., x in cos(x)), this stores the feature index.
op::UInt8 # (Possibly undefined) If operator, this is the index of the operator in the degree-specific operator enum
- children::NTuple{D,Node{T,D}}
+ children::NTuple{D,Optional{Node{T,D}}}
end
But surprisingly, the change greatly worsened performance of copies (seen in the flamegraph above) relative to just having a child be set to point back to the parent.
It’s possible I made a silly mistake somewhere, but there didn’t appear to be anything obvious or any type instabilities. The flamegraph shows a GC warning next to all the Optional{T}
calls which I guess makes sense given we are allocating quite a few of them for every node. I’m not sure why, as I thought it would be “free.” Is it something to do with an immutable pointing to a mutable still results in allocations even when left undefined?
I’m imaging this could be the Julia compiler performing poorly due the recursive nature of your data structure?
In any case, can you show node_factory
, set_children!
, _ensure_optional
? Or maybe just push the branch (this is for DynamicExpressions.jl, right)? I don’t promise to be able to help, but I’m curious.
Quick check, something’s off, Optional{T}
is doing more work than the identically structured Some{T}
instantiation is:
julia> struct Optional{T}
x::T
Optional(x::T) where {T} = new{T}(x)
Optional{T}() where {T} = new{T}()
end
julia> using BenchmarkTools; @btime Optional($(zeros(4)))
5.100 ns (1 allocation: 16 bytes)
Optional{Vector{Float64}}([0.0, 0.0, 0.0, 0.0])
julia> @btime Some($(zeros(4)))
1.800 ns (0 allocations: 0 bytes)
Some([0.0, 0.0, 0.0, 0.0])
julia> @code_llvm Optional(zeros(4))
; Function Signature: (::Type{Main.Optional{T} where T})(Array{Float64, 1})
; @ REPL[1]:3 within `Optional`
; Function Attrs: uwtable
define nonnull ptr @julia_Optional_833(ptr noundef nonnull readonly %"#ctor-self#::Type", ptr noundef nonnull align 8 dereferenceable(24) %"x::Array") #0 {
top:
%pgcstack = call ptr inttoptr (i64 140737208355296 to ptr)() #7
%ptls_field = getelementptr inbounds ptr, ptr %pgcstack, i64 2
%ptls_load = load ptr, ptr %ptls_field, align 8
%"new::Optional" = call noalias nonnull align 8 dereferenceable(16) ptr @ijl_gc_pool_alloc_instrumented(ptr %ptls_load, i32 800, i32 16, i64 2730838875792) #6
%"new::Optional.tag_addr" = getelementptr inbounds i64, ptr %"new::Optional", i64 -1
store atomic i64 2730838875792, ptr %"new::Optional.tag_addr" unordered, align 8
store ptr null, ptr %"new::Optional", align 8
store atomic ptr %"x::Array", ptr %"new::Optional" release, align 8
ret ptr %"new::Optional"
}
And that extra work vanishes if we exclude the undef
constructor:
julia> struct Optional{T}
x::T
Optional(x::T) where {T} = new{T}(x)
#Optional{T}() where {T} = new{T}()
end
julia> using BenchmarkTools; @btime Optional($(zeros(4)))
1.800 ns (0 allocations: 0 bytes)
Optional{Vector{Float64}}([0.0, 0.0, 0.0, 0.0])
julia> @code_llvm Optional(zeros(4)) # same as `Some` call
; Function Signature: (::Type{Main.Optional{T} where T})(Array{Float64, 1})
; @ REPL[1]:3 within `Optional`
; Function Attrs: uwtable
define [1 x ptr] @julia_Optional_2817(ptr noundef nonnull readonly %"#ctor-self#::Type", ptr noundef nonnull align 8 dereferenceable(24) %"x::Array") #0 {
top:
%0 = insertvalue [1 x ptr] zeroinitializer, ptr %"x::Array", 0
ret [1 x ptr] %0
}
No earthly idea how a seemingly unrelated method would affect another.
Oh, I’ve seen this before. When there exists a constructor leaving a field undef
, the struct can no longer be allocated inline even though it’s immutable.
julia> struct Optional{T}
x::T
Optional(x::T) where {T} = new{T}(x)
Optional{T}() where {T} = new{T}()
end
julia> Base.allocatedinline(Optional{Any})
false
julia> Base.allocatedinline(Some{Any})
true
I don’t think I ever understood why, but that’s how it is. So the performance you get is as if Optional
were defined as mutable struct
.
Is there an issue for that, kinda feels like there should.
Thanks. Very odd. Correct me if I’m wrong but this seems to suggest having undef
constructors for an immutable makes the whole struct sit on the heap?
With the following it seems to be fast again:
struct Nullable{T}
null::Bool
x::T
end
@inline function Base.getindex(x::Nullable)
x.null && throw(UndefRefError())
return x.x
end
So, you still have to figure out something to put in x
, though it would have a guard against access. So in a recursive struct this isn’t so bad as you can just use “self” as x
.
Here’s the diff applying it: Permit n-argument operators by MilesCranmer · Pull Request #127 · SymbolicML/DynamicExpressions.jl · GitHub
So we still use “self” to initialise the child as before, but we also tag it with the Bool for the more helpful errors
julia> x1 = Node{Float64,2}(feature=1)
x1
julia> expr = Node{Float64,2}(op=1, children=(x1,))
unary_operator[1](x1)
julia> get_child(expr, 2)
ERROR: UndefRefError: access to undefined reference
Stacktrace:
[1] getindex
@ ~/PermaDocuments/SymbolicRegressionMonorepo/DynamicExpressions.jl/src/Utils.jl:66 [inlined]
#=...=#
julia> dump(expr)
Node{Float64, 2}
degree: UInt8 0x01
constant: Bool false
val: Float64 2.234256312e-314
feature: UInt16 0x0000
op: UInt8 0x01
children: Tuple{DynamicExpressions.UtilsModule.Nullable{Node{Float64, 2}}, DynamicExpressions.UtilsModule.Nullable{Node{Float64, 2}}}
1: DynamicExpressions.UtilsModule.Nullable{Node{Float64, 2}}
null: Bool false
x: Node{Float64, 2}
degree: UInt8 0x00
constant: Bool false
val: Float64 2.2195848863e-314
feature: UInt16 0x0001
op: UInt8 0x82
children: #undef
2: DynamicExpressions.UtilsModule.Nullable{Node{Float64, 2}}
null: Bool true
x: Node{Float64, 2}#= circular reference @-3 =#
Edit: still losing a tiny bit of a performance relative to before
master | cdc980d3c9aa3f… | master / cdc980d3c9aa3f… | |
---|---|---|---|
utils/copy/break_sharing | 28.3 ± 1.8 μs | 30.8 ± 7.2 μs | 0.918 ± 0.22 |
utils/copy/preserve_sharing | 0.102 ± 0.0084 ms | 0.108 ± 0.012 ms | 0.944 ± 0.13 |
presumably due to allocating extra Bool
s.
So other alternatives are still very much welcome. Ideally one wouldn’t have to choose between ergonomics and efficiency.
Seems so from the @code_llvm
and when I checked the linked issue’s A
type being thrown into RefValue
. This reminded me that before v1.5, any mutable fields forced immutable wrappers to be heap-allocated because multiple GC roots couldn’t be tracked for a field then (which makes me wonder, is there a finite limit to the number of tracked roots now?). The PR mentions that uninitializable fields (evidently even if they would always be isbits) disqualify inlining for tracking, and the alternative would be to add an extra byte for tracking (again, would that be a finite limit?). Obviously I don’t really know how that works, but at least it affirms our intuition that storing a null pointer inline wasn’t the hurdle.
It’s probably considered an edge case because of the very different behavior of uninitialized fields between isbits and non-isbits types, an awkward distinction to rely on, and because new
can only uninitialize trailing fields. Optional
only lucked out at having only 1 field to not initialize.
An exception that wasn’t mentioned seems to be isbits types still being inlinable even if uninitializable:
julia> struct UndefComplex{T} # 32 bytes distinguishable from pointer's 8
c::Complex{T}
d::Complex{T}
UndefComplex(c::Complex{T}, d::Complex{T}) where T = new{T}(c, d)
UndefComplex{T}() where T = new{T}() # uninitialized
end
julia> Base.allocatedinline(UndefComplex{Int}) # isbits
true
julia> Base.allocatedinline(UndefComplex{BigInt}) # not isbits
false
julia> sizeof(Ref(UndefComplex(big(1)*im, big(0)*im))) # just a pointer
8
julia> struct DoubleComplex{T}
c::Complex{T}
d::Complex{T}
DoubleComplex(c::Complex{T}, d::Complex{T}) where T = new{T}(c, d)
end
julia> Base.allocatedinline(DoubleComplex{BigInt}) # not isbits, but fully initialized
true
julia> sizeof(Ref(DoubleComplex(big(1)*im, big(0)*im)))
32