Incomplete initialization (control which fields are initialized in `new`)

Unfortunately I believe that the optimizations are only possible when T is of reference type and not plain data, because in the former case one could use the NULL pointer to represent the nothing. If T is plain data then I believe that Union{T,Nothing} requires a hidden flag behind the scene to identify which variant is currently in use. This penalizes the plain data case (Int, Float64, …), which is admittedly the most interesting. I might be completely wrong about this; this is just my current understanding. (A bit like rust can optimize Option<&'a T> because references are never null.)

I will try to run some benchmarks as soon as I have implemented all 43 versions of the same code :joy:


It removes it here, but it introduces it as soon as you want heterogeneous collections

Vector{P{T,X,Y} where {X<:Union{T,Nothing}, Y<:Union{T,Nothing}}}

which is something that I want. This approach wouldn’t work for me I think.


That forces you to collate two a priori independent variables (think of a name and an age) for no good reason. Imagine a situation where you always know either the name or the age:

struct S
    name_or_age::Union{String, Int}
    ...
end 

This doesn’t look very good.

In the most general situation, again, it might not be straightforward to take into account all possible combinations of “definedness” of the fields, and arbitrarily grouping them under the same name seems a path to insanity.

If I weren’t super concerned with performance I would definitely go with the Union{T,Nothing} approach, which at least expresses the semantic adequately. I’ll come back with some benchmarks if I manage to get some meaningful ones.


:+1: :rocket: :100:

Ok, but the struct Either that triggered this suggestion is literally a immutable type with a Bool to specify which field is defined (or, at least, is the one that can be looked at). For that case, the suggestion seem appropriate, no?

@FredericoStra Another possibility for you is to define an assessor function and treat the order of fields in your struct as an implementation detail. E.g.

struct Interval{T}
    _left_kind::Symbol
    _right_kind::Symbol
    _one_end::T
    _other_end::T
    Interval{T}(ls::Symbol, rs::Symbol, val) where T = new{T}(ls, rs, val)
    Interval{T}(ls::Symbol, rs::Symbol, vall, valr) where T = new{T}(ls, rs, vall, valr)
end
function left_end(int::Interval)
    int._left_kind == :inf && throw("interval is left unbounded")
    return int._one_end
end
function right_end(int::Interval)
    int._right_kind == :inf && throw("interval is right unbounded")
    int._left_kind == :inf && return int._one_end
    return int._other_end
end

therefore in memory you represent

  • (-∞, a]) as (:inf, :closed, a),
  • (a, ∞) as (:open, :inf, a) and
  • [a, b) as (:closed, :open, a, b)

As sugar on top let’s define

function Base.getproperty(int::Interval, s::Symbol)
    s == :left && return left_end(int)
    s == :right && return right_end(int)
    return getfield(int, s)
end

so that you can use int.left and int.right in your code :slight_smile: would that work for you?

If you really care about performance I guess it’s better to encode possible combinations bit-wise in an UInt (even UInt8 is plenty of space for this purpose), instead of storing symbols in the struct.

2 Likes

Yes, but that was just an oversimplified example to illustrate a point.

I’m interested in the general situation where there are more complex invariants that determine which fields are to access or not.

In C you would not introduce unions or stuff like that; you would simply skip the initialization of the unneeded fields (and check very very very very carefully that you don’t accidentally access them, otherwise… Demons!).


This is pretty much what I wrote at the end of this comment. I have to praise the clarity of your implementation, though, especially regarding the use of getproperty. The idea of storing data in the incorrect field just to work around a limitation of new doesn’t look very satisfying. Now you have made unnecessarily complicated the access to .right. In larger situations this is extremely error prone, if not totally confusing at all. Moreover, you need to performs the additional checks every time to try to access the field.


I agree. The usage of an explicit Symbol flag was just to exemplify something that keeps track of the internal state of the struct. In general it might not even be a single flag, but a combination of few of them.


I feel this thread has become very interesting to read. A lot of good ideas, thank you all! :slight_smile:

I still want keyword arguments for new, though! :smiley:

1 Like

All you need are properties: left and right for an interval, so the concept of “incorrect field” is ill defined. Where and how the data is stored is an implementation detail. Would names _don't_access_me_field and _don't_access_me_other_filed better convey the message? :wink:

As far as I’m concerned it’s clean and rather safe: one pays the price of one additional if in the accesor for one (possibly unnecessary) allocation. I’d accept such code in my repository :smiley:

But the ugliness/complication is in the eye of the beholder, if you come to a solution more to your liking, please don’t forget to share it here!

2 Likes

I think that this solution is even better than using incomplete instances of the Interval object, because if both left and right fields refer to exactly the same object, you can be absolutely sure that the undefined (and unnecessary) bound won’t allocate any memory.

An alternative to achieve this is making Interval an abstract type, and define special subtypes depending on their bounded ends.

abstract type Interval{T} end

struct RightBoundInterval{T} <: Interval{T}
    right_kind::Symbol
    right::T
end

struct LeftBoundInterval{T} <: Interval{T}
    left_kind::Symbol
    left::T
end

struct BoundedInterval{T} <: Interval{T}
    left_kind::Symbol
    right_kind::Symbol
    left::T
    right::T
end

Interval(leftkind, rightkind, left, right) = BoundedInterval(leftkind, rightkind, left, right)

Then you can make other constructors for Intervals unbounded on either side. An advantage is that you may know whether an interval is right- or left-unbounded by just checking its type, without looking into its fields. But if you still want to do it that way, you can also overload getproperty:

function Base.getproperty(i::RightBoundInterval, field::Symbol)
    (field == :left_kind) ? :unbounded : getfield(i, field)
end

function Base.getproperty(i::LeftBoundInterval, field::Symbol)
    (field == :right_kind) ? :unbounded : getfield(i, field)
end

Thus:

julia> x = LeftBoundInterval(:open, 0.0)
LeftBoundInterval{Float64}(:open, 0.0)

julia> x.left_kind
:open

julia> x.left
0.0

julia> x.right_kind
:unbounded

julia> x.right
ERROR: type LeftBoundInterval has no field right

Yet another advantage of using subtypes for left- or right-bounded intervals: that way it’s pretty straightforward to give such “unbounded values” for specific types (if wanted):

function Base.getproperty(x::LeftBoundInterval{F}, field::Symbol) where {F <:AbstractFloat}
    if field == :right_kind
        return :unbounded
    elseif field == :right
        return F(Inf)
    else
        return getfield(x, field)
    end
end
julia> x = LeftBoundInterval(:open, 0.0)
LeftBoundInterval{Float64}(:open, 0.0)

julia> x.right
Inf

I proposed that above, but @FedericoStra prefers that everything is the same type.

1 Like

Oh, you’re right. Sorry for overlooking!

Then, the idea of storing the same value in both ends looks a good workaround for me. Maybe even both ideas could be combined, to ensure that you can make homogeneous arrays with “fake” bounded intervals:

function Base.convert(::Type{BoundedInterval{T}},  x::LeftBoundInterval) where {T}
    BoundedInterval{T}(x.left_kind, :unbounded, x.left, x.left)
end

function Base.convert(::Type{BoundedInterval{T}},  x::RightBoundInterval) where {T}
    BoundedInterval{T}(:unbounded, x.right_kind, x.right, x.right)
end

function Base.convert(::Type{BoundedInterval{T}},  x::BoundedInterval{S}) where {T, S}
    BoundedInterval{T}(:left_kind, x.right_kind, x.left, x.right)
end

function Base.convert(::Type{BoundedInterval}, x::S) where {S<:Interval{T} where T}
    convert(BoundedInterval{T}, x)
end
julia> bounded = Interval(:open, :closed, 0.0, 1.0)
BoundedInterval{Float64}(:open, :closed, 0.0, 1.0)

julia> unbounded = LeftBoundInterval(:open, 0)
LeftBoundInterval{Int64}(:open, 0)

julia> [bounded, unbounded]   # heterogeneous
2-element Array{Interval,1}:
 BoundedInterval{Float64}(:open, :closed, 0.0, 1.0)
 LeftBoundInterval{Int64}(:open, 0)

julia> intervals = Array{BoundedInterval}(undef, 2)
2-element Array{BoundedInterval,1}:
 #undef
 #undef

julia> intervals[1] = bounded; intervals[2] = unbounded;

julia> intervals   # still heterogeneous (different parameter type)
2-element Array{BoundedInterval,1}:
 BoundedInterval{Float64}(:left_kind, :closed, 0.0, 1.0)
 BoundedInterval{Int64}(:open, :unbounded, 0, 0)

julia> intervals = Array{BoundedInterval{Float64}}(undef, 2)
2-element Array{BoundedInterval{Float64},1}:
 #undef
 #undef

julia> intervals[1] = bounded; intervals[2] = unbounded;

julia> intervals   # homogeneous!
2-element Array{BoundedInterval{Float64},1}:
 BoundedInterval{Float64}(:left_kind, :closed, 0.0, 1.0)
 BoundedInterval{Float64}(:open, :unbounded, 0.0, 0.0)

That doesn’t make much sense. If you skip the initialization of a field then for sure you are not allocating memory unnecessarily, and in addition you skip also the unneeded assignment. So skipping the initialization has to be always more efficient than assigning an arbitrary value to a field, which you cannot even to in general

struct P{S,T}
    s::S
    t::T
end

because you don’t have another value to reuse. This idea of duplicating a field is a hack that might work in specific cases, but is definitely not a versatile solution that applies to all circumstances.


That’s completely unrelated to the question. This introduces dynamic dispatch when you want to work with Vector{Interval{T}}, which comes with a performance penalty.

As I said, this is also a disadvantage when you work with collections of intervals of heterogeneous types, because it’s more expensive than checking some fields.

This architecture of course is suitable for some cases, but doesn’t solve the incomplete initialization issue at all. My need of incompletely initialization is precisely to avoid this dynamic approach.


That’s very nice, but again very dynamic in nature when you work with heterogeneous collections.

1 Like

I’m not totally sure about this. From the example in the documentation:

julia> struct HasPlain
           n::Int
           HasPlain() = new()
       end

julia> HasPlain()
HasPlain(438103441441)

But ok, if it’s only for primitive types that are not heavy, this is no big burden.

And I agree with the rest: being able to leave arbitrary fields uninitialized might the be simplest solution to your case, and also useful in many other cases. I was just commenting on workarounds since that does not seem possible right now.

Regarding your example, since Int is “plain data”, the field n just retains whatever value was already in that memory location. It doesn’t cause additional memory allocation. If it were BigInt or String, it would just put an invalid reference, without actually allocating the internal buffers of the (unexistent) BigInt or String. (I hope!)

struct HasRef
    n::BigInt
    HasRef() = new()
end

HasRef() # => HasRef(#undef)

I feel the same. It just seems like a generally useful feature that currently is only halfway implemented.

Your suggestion is indeed really nice, and is already what I’m doing elsewhere. Basically, depending on the application, I want to code both versions for maximum flexibility.

3 Likes

Maybe you already know that intervals are the bottleneck in your code, but in case you are yet to profile, benchmark, and optimize this, I would just define an API and leave the implementation flexible, going with whatever your consider the simplest to start with.

That said, I regularly go down rabbit holes of “doing it right” like this only to discover that they are responsible for 1% of runtime, so who I am to talk :wink:

2 Likes

I know my use case, but I don’t know my library users usecases. That’s why I would like to provide both alternatives. The typed endpoints version is good if you know a priori that you need to work only with intervals of the form [x,y), the other version is better if you need to work with intervals of different kinds and you want them to be of the same type to avoid runtime type-checking.


Apart from this, while all this brainstorming about this particular example (intervals) is really interesting (there are a loooooot of good ideas to learn from you all!), I don’t think these workarounds really address the question of better controlling an incomplete initialization. Hence…

… I decided to open this issue, with the hope that one day we will have keyword arguments for new(...)! :slight_smile:

Of course feel free to comment there if you think that it is a good idea or a stupid idea, and if you have the expertise maybe give some hints about how to tackle a possible implementation.

4 Likes