Default constructor for any type?

I am aware of two constructor invariants that runtime/compiler actually track:

  1. The leading 7 bits of Bool are zero.
  2. For each reference field, we remember whether the field can be undefined. If it can be undefined, then each access is gated by a nullpointer check that potentially throws.

Are there any others?

The price of breaking these invariants are:

  1. Mild UB for invalid Bool
  2. A segfault instead of an exception for accessing the surprisingly-null reference.
julia> @generated foo(::Type{T}) where T = Expr(:new, T)
julia> mutable struct A
              x::String
              end
julia> a=foo(A)
A(#undef)
julia> f(a)=a.x
julia> f(a)

[240721] signal (11.1): Segmentation fault

Stuff like the bad BigInt are not due to julia language / compiler / runtime assumptions, thatā€™s a stdlib / MPZ assumption, so Iā€™d exclude that from this rather technical question.

Itā€™s bad practice to rely on compiler internals to stay the same over various releases. Doing this in YOLO mode is not going to result in a tool people should use.

Sure - the raison dā€™etre for this package is to facilitate reading & writing raw registers on microcontrollers (though itā€™s written in a way that you can use it outside of that as well). So when you initialize such a bitfield with all-zeros, youā€™re potentially overwriting reserved values from the manufacturer, potentially leading to undefined behavior of the chip. Thatā€™s super no-bueno!

Without reaching for the esoteric usecase of microcontrollers - if the bitfield has defined padding, observing that padding is of course also UB. I might change the implementation in the future to go away from primitive type and use a proper LLVM struct (with internal padding!) after all; guaranteeing some particular form of padding would prevent me from doing that (which would be a shame - I still hope some day to get SROA for this thing).

Exactly, but the entire point of having a default value is that there isnā€™t an error when you try to construct it! The idea is after all to have a ā€œsafeā€ value to always fall back to when you just need to initialize a field of that type; my argument is just that there can be types where no such default value exists, and hence no such general constructor can exist.

For example, take this one:

struct TimeOfCreation
    time::Float64
    TimeOfCreation() = new(time())
end

This doesnā€™t have a valid default value; there is no ā€œdefault current timeā€. Almost any particular instance is valid, at some point in time, but not all at once (and certainly not through Expr(:new), which can give times far into the future too, or Inf or NaN).

Hereā€™s another:

struct Impossible
    a::Int
    Impossible() = throw(ArgumentError("You can't construct an `Impossible`!"))
end

Thereā€™s lots of instances you could just create with Expr(:new) or even reinterpret, but none of them are valid - the only constructor always throws an error, and the type canā€™t have a default value (because it doesnā€™t have any value at all!).

2 Likes

The runtime/compiler is not the only one trying to enforce invariants in a constructor - any time someone does some form of x < 5.0 || throw(ArgumentError("bar")) in a constructor, theyā€™re trying to communicate some requirement instances of the type must obey. Itā€™s not inconceivable that the compiler could get smart enough to figure out that all valid instances of such a type must in the field x always be < 5.0 (else it would have errored before construction), and omit some other check somewhere else based on that.

Sure, that doesnā€™t happen today, but it could in principle - and thatā€™s why itā€™s dangerous to circumvent it.

1 Like

Thatā€™s definitely an issue with using a default in a certain way. Good argument for not providing a method for zero as well. The topic here was about constructing a default, it would be impossible to expect that default to be semantically useful, but it shouldnā€™t crash Julia to try and read one of its fields, or segfault like an untracked #undef might.

Granted, although the example isnā€™t great: if thereā€™s a zero-argument constructor, by all means, the default-function builder should employ it. I actually meant to add ā€œlook at the methods and try the one with the fewest argumentsā€ to the sketch, but it didnā€™t make it in. But if one field has to be divisible by seven and greater than zero, thereā€™s no way to determine that without inspecting the source code, which isnā€™t tractable for a function to do.

I would say that a function default(t::Type) which returns a sensible instance of T, or throws an error explaining why it canā€™t, as best it can, and suggesting that a custom method needs to be written, is in the spirit of the idea here. It would work on the vast majority of types, and could be extended to the rest with a method, meaning it works on all of them.

Oh god, thatā€™s where the %new(T) comes from in the lowered code

Agreed, Iā€™d just demand the user compute a valid input before instantiation if a context-specific value or nothing cannot serve as defaults. Union-splitting routinely handles isolated Union{T, Nothing}s without runtime dispatches, but Iā€™m assuming thereā€™s some sort of downstream combinatorial explosion exceeding the splitting limit. Those can be helped by type-assertions node.value::T.

Using Union{T,Nothing} can definitely cause some performance issues in complex applications unless you want to manually trace type instabilities and add type assertions everywhere. Keep in mind this code is recursion based so a single missed assertion will bottleneck performance; I donā€™t this should be considered a practical solution. I implemented all of those ::T in my code after many painful rounds of profiling and Cthulhu descents. There were some cases where a return value of Union{T,Nothing} in one function would cause a return type of Any in the caller due to unimplemented methods for ::Nothing (even though that split is never executed at runtime).

It sounds like Expr(:new) might be too dangerous so I guess Iā€™ll move val to the end of the struct or experiment if the const-fields are superior.


I guess another alternative is to stay with Union{T,Nothing} but write a custom getproperty for val that always asserts ::T, so I donā€™t need to worry about it. But Iā€™d prefer to just be able to leave that field undefined somehow rather than do something custom

FWIW ArrowTypes.jl defines a generic function default and adds methods as-needed to handle this problem. Itā€™s a game of whack-a-mole, but it seems to work OK.

2 Likes

I just looked up how stdlib does it. They use ccall(:jl_new_struct_uninit, Any, (Any,), typ), which is marginally safer ā€“ at least zeros the memory, and it is stable and pretty official API.

Whether that is too dangerous depends on your usecase ā€“ do you know what kind of types end up in T? Are your users understanding of ā€œweirdishā€ limitations? Can you ensure that nobody will ever look at the bad instance? Not even for printing? The following gives a genuine crash out of the REPL:

julia> bad = ccall(:jl_new_struct_uninit, Any, (Any,), Tuple{Int,String})
Unhandled Task ERROR: UndefRefError: access to undefined reference
ERROR: TaskFailedException
nested task error: UndefRefError: access to undefined reference

Iā€™d also note that this entire thing only works with concrete type. You know that nobody will need to instantiate with T==Number, right?

This is not a bad solution.

This is also recommended. As noted above, your struct layout is bad for efficiency, due to alignment / structure padding.

2 Likes

I think this discussion has strayed a bit into the weeds since we donā€™t have a MWE with a clear statement of your requirements. The manual says

Although it is generally a good idea to return a fully initialized object from an inner constructor, it is possible to return incompletely initialized objects.

I donā€™t think your situation is exotic enough to warrant a departure from the recommendation of returning fully initialized objects from your inner constructors.

Based on my partial understanding of the high level problem you are addressing, I think there are a few possible approaches:

1.

Youā€™ve probably already considered this, but one way to represent a tree with immutable structs is to use a few parametric types, like this:

struct Constant{T}
    x::T
end

struct Feature
    x::UInt16
end

struct UnivariateOp{A}
    op::UInt8
    arg::A
end

struct BivariateOp{L, R}
    op::UInt8
    l::L
    r::R
end
julia> BivariateOp(0x04,
           UnivariateOp(0x02,
               Constant(3.14)
           ),
           BivariateOp(0x09,
               Feature(1),
               Feature(2)
           )
       )
BivariateOp{UnivariateOp{Constant{Float64}}, BivariateOp{Feature, Feature}}(0x04, UnivariateOp{Constant{Float64}}(0x02, Constant{Float64}(3.14)), BivariateOp{Feature, Feature}(0x09, Feature(0x0001), Feature(0x0002)))

I think you could write functions that take partial trees and return a new tree with a new node attached. (You might need some types that represent incomplete nodes?) One possible downside of this approach is that if you have large trees you end up with complex, nested parametric types, which might cause poor compile times. (I havenā€™t really tested that.)

2.

This is pretty speculative, but I wonder if you could use an alternative approach to storing your data. You could use a vector to represent the expression tree in terms of node IDs (integers), and then you could use node IDs to look up related data like the operation, feature, or constant value.

To represent the tree

1
ā”œā”€ā”€ 2
ā”‚   ā”œā”€ā”€ 3
ā”‚   ā””ā”€ā”€ 4
ā””ā”€ā”€ 5
    ā””ā”€ā”€ 6

you could use a vector like this:

[2, 5, 3, 4, 0, 0, 0, 0, 6, 0, 0, 0]

The first number is the node ID for the left child of node 1, the second number is the node ID for the right child of node 1, the third number is the node ID for the left child of node 2, the fourth number is the node ID for the right child of node 2, etc. Zeros indicate that there is no child node in that location.

When you look at elements (3-1)*2 + 1 and (3-1)*2 + 2 in the vector, you see that they are both 0, and thus node 3 is either a constant or a feature. Perhaps you could fill those elements with -1 if node 3 is a feature, and fill with 0 if node 3 is a constant, so that you can easily distinguish between feature nodes and constant nodes.

I think this representation is also capable of representing a forest of trees, like this:

1
ā”œā”€ā”€ 2
ā”‚   ā”œā”€ā”€ 3
ā”‚   ā””ā”€ā”€ 4
ā””ā”€ā”€ 5
    ā””ā”€ā”€ 6

7
ā”œā”€ā”€ 8
ā””ā”€ā”€ 9

Then your vector would be

[
    2, 5,
    3, 4,
    0, 0,
    0, 0,
    6, 0,
    0, 0,
    8, 9,
    0, 0,
    0, 0
]

And then you could swap out node 6 for node 7 by mutating your vector like this:

[
    2, 5,
    3, 4,
    0, 0,
    0, 0,
    7, 0,
    0, 0,
    8, 9,
    0, 0,
    0, 0
]

(Note that node 6 is now an isolated nodeā€”it has no parents or children.)

3.

If you want to stick with sum types, you could experiment with the SumTypes.jl package.

1 Like

Iterating on the vector storage idea: For constant and feature nodes you could use negative numbers in the child elements to provide dense, 1-based IDs for the constants and features.

For example, if you have this tree,

1
ā”œā”€ā”€ 2
ā”‚   ā”œā”€ā”€ 3
ā”‚   ā””ā”€ā”€ 4
ā””ā”€ā”€ 5
    ā”œā”€ā”€ 6
    ā””ā”€ā”€ 7

where nodes 3 and 6 are features and nodes 4 and 7 are constants, then you could label the features and constants like this:

adjacent = [
    2,  5,   # Node 1 children
    3,  4,   # Node 2 children
    0, -1,   # Node 3 children
   -1,  0,   # Node 4 children
    6,  7,   # Node 5 children
    0, -2,   # Node 6 children
   -2,  0    # Node 6 children
]

If you have a negative integer as the first child ID, then that number is the negative of the constant ID. If you have a negative integer as the second child ID, then that number is the negative of the feature ID.

So, you can have a dense vector lookup for your constant values, like this:

constants = [3.14, 42]

left(adj, i) = adj[2i - 1]
right(adj, i) = adj[2i]

right_child = right(adjacent, 3)
constants[-right_child]

Of course you could experiment with the performance of different ways of representing your adjacency list. You could try Vector{Vector{Int}} (which is what Graphs.SimpleGraphs uses), or you could try Vector{Tuple{Int, Int}}, or even use one vector for left children and another vector for right children, like this:

left = [
    2,
    3,
    0,
   -1,
    6,
    0,
   -2,
]

right = [
    5,
    4,
   -1,
    0,
    7,
   -2,
    0
]
1 Like

Very nice, itā€™s gratifying to see that an actual implementation closely follows the general pattern I sketched out. ArrowTypes.jl itself is fairly specialized, would you (assuming youā€™re one of the maintainers) be interested in breaking it out into a package, say DefaultTypes.jl?

I see room to expand it to cover more types, while making few if any changes to the ones it currently covers. Giving the code a casually look-over, I think the answer is no changes, although the version I had in mind would throw an error if it failed to make a default method for a struct type, prompting the user to extend that special case, and it may be more important for ArrowTypes to provide a fallback value rather than erroring out, I wouldnā€™t know either way.

1 Like

Thanks for the ideas.

  1. The main point of DynamicExpressions is to have a single type for any expression (like how DynamicQuantities has a single type for any quantity). These packages are for model discovery where you need to try many many possible expressions, so type stability is nearly required.

  2. This is actually similar to how it works in the GPU implementation of DynamicExpressions: Native GPU support by MilesCranmer Ā· Pull Request #65 Ā· SymbolicML/DynamicExpressions.jl Ā· GitHub though there is a lot more sparsity to support preallocation of an array (again, to permit many different expressions ā€” this time in the same array shape).

Hereā€™s the function to convert a tree to an array:
DynamicExpressions.jl/src/AsArray.jl at 6b0337cd9e18d7a2a852db06c4ff275f75ff534b Ā· SymbolicML/DynamicExpressions.jl Ā· GitHub

I havenā€™t thought about doing it for regular CPU allocations but itā€™s certainly something to consider.

  1. I basically treat Node like a sum type already. All fields (except the children) are in every object regardless of if they are defined or not, and I have different code branches depending on the .degree and .constant fields.
1 Like

ArrowTypes is already the low-dependency interface broken out from Arrow.jl, but I can see how having a more generic name and being even lighter is appealing. One thing is that in the context of serialization (e.g. via Arrow) one needs to be very careful about backwards compatibility, because one wants to continue to be able to load tables that were saved a long time ago on older versions of everything. Therefore it makes sense to me for ArrowTypes to prefer itā€™s own dedicated function that can basically just sit there forever, rather than it pulling in a more generic package that aims to cater to a wider audience.

That being said, Iā€™d encourage you to copy it and create this DefaultTypes.jl, and maybe at some point (if itā€™s very stable and matches ArrowTypes semantics) it could make sense for ArrowTypes.jl to depend on it, but IMO that shouldnā€™t necessarily be a goal, and I think that having some code duplication actually isnā€™t a big deal.

2 Likes

Thatā€™s clever, if you only access the field when it has a meaningful value, Iā€™d try this to help type inference. That doesnā€™t make things easy, youā€™d still have to write code that checks for meaning, and itā€™s made harder when node.value cannot return a contained nothing to check. Itā€™s far more preferable than accessing undefined fields, which is either an undefined error or an isbits instance unsafely interpreted from indeterminate bits in memory. Itā€™s also more preferable than a default sentinel value, which wastes as much time as the unsafe isbits instance when it gets involved in computation and is more expensive to check in general than nothing.

Donā€™t have a MWE off the top of my head to see this in action because branches that end in type-related errors are generally ignored in return type inference. But an optimized Union{T, Nothing} amounts to a cheap conditional check where the only non-erroring branch is a meaningful T instance, so push comes to shove you can do it when an extra defvalue::Bool field you manually check to evade undefined value::T, again without computing meaningless values.

This is basically what I did originally. But itā€™s a large codebase and these Union{T,Nothing} can start propagating type instabilities, and it takes a lot of effort to add assertions at the right places. Since I know that .val is always a T when itā€™s actually used (n.degree == 0 && n.constant needs to be satisfied before accessing it), to me it seems like an undefined field would actually result in more manageable code.

But yeah I should also just try rewriting getproperty as a simpler solution. It still feels like a last resort approach but I guess there are no better options than the ones discussed here.

Are you familiar with the NullPointerException?

1 Like

Undefined fields seem safe enough in Julia, no? Is it a bad practice?

Maybe the proper thing to do is something like this?

mutable struct Node{T} <: AbstractExpressionNode{T}
    const degree::UInt8
    const constant::Union{Bool,Nothing}
    val::Union{T,Nothing}
    const feature::Union{UInt16,Nothing}
    const op::Union{UInt8,Nothing}
    l::Union{Node{T},Nothing}
    r::Union{Node{T},Nothing}
end

with getters defined by

@inline function Base.getproperty(node::AbstractExpressionNode{T}, name::Symbol) where {T}
    if name == :constant
        return getfield(node, :constant)::Bool
    elseif name == :val
        return getfield(node, :val)::T
    elseif name == :feature
        return getfield(node, :feature)::UInt16
    elseif name == :op
        return getfield(node, :op)::UInt8
    elseif name in (:l, :r)
        return getfield(node, name)::typeof(node)
    else
        return getfield(node, name)
    end
end

Then I get:

  1. No type instabilities (hopefully)
  2. Safety of accessing fields (triggers assertion error)
  3. Constant propagation for fields which are actually constant

(I left .val not constant since I need to quickly change those when optimizing an expression. Also .l and .r because I donā€™t want to re-create entire expression trees for every manipulation of a tree.)

Julia has a similar problem. Itā€™s just called UndefRefError. Fortunately, Julia defaults to immutable structures that make bitstype more likely, so perhaps the issue is partially mitigated.

The takeaway lesson is that if you are playing with undefined references then you should make sure that all dereferencing operations check for undefined references.