Representing Nullable Values

From talking with various parties about the Union{T, Null} approach, the overall plan is the following:

  1. Represent nullable data with T? defined as Union{T, Null}.
  2. The representation of small unions is optimized to keep data inline like C unions with extra type indicator bits.
  3. Represent arrays of nullable data simply as Array{T?}.
  4. The representation of arrays of unions is optimized to keep data contiguous inline with extra type indicator bits after the main array data.
  5. For nullable values where null is included in the domain of valid values, use Value{T}? where Value is a simple wrapper type like Ref but immutable.

This is a simple, composable approach and addresses all concerns as far as I can tell. It’s not annoying to use since there’s generally no explict wrapping or unwrapping or “lifting” required. Functions that don’t have methods for null arguments simply raise method errors. Functions that need to handle nulls simply add methods that do the right thing. Recent compiler optimizations make all of this as efficient as the current Nullable approach, if not more so. Moveover, if someone wants to have seven different kinds of nullability for their data, they can do so just by implementing their own null-like types and using unions of them – the compiler optimizations are generic and decoupled from the specifics of the null type. All the scalar optimizations are already implemented and merged on Julia master. I’m not sure about the status of the array representation, but perhaps @quinnj or @jameson can fill us in on the current status.

The primary technical concern @jameson has with this whole approach is that if we represent rows of tables with many nullable columns using covariant named tuples, then there is a potentially exponential number of concrete types of rows in terms of the number of nullable columns, which could – absent compiler changes – lead to an exponential number of code specializations and bad performance due to excessive amounts of dynamic dispatch. However, @jeff.bezanson is confident that we can handle this with better specialization heuristics (we deal with potentially exponential code specializations all the time), and has pointed out that the existing NamedTuples package has the same problem and yet already works on Julia as is. Of course, the existing NamedTuples package isn’t as widely used as built-in named tuples would be.


+1 to all that.
One thing we do lose is that Nullable forced client code to acknowledge that a value could be null since it would have to unwrap it. eg

y = ...
x = f(y)
return x+1

If f returns a Nullable{Int}, then this code will fail when tested with any value of y and the author will be reminded they have to account for the fact that f can fail. If f returns a Union{Int, Null} instead, the code will only fail if tested on y that causes f(y)` to return Null, and so could be missed.

But a package can just provide a Maybe wrapper type and get back that functionality.


That would be addressed by the Value{T}? type where Value wraps an inner value and thereby distinguishes null meaning “there is no value” from Value(null) meaning “there is a value and it is null”. This single type addresses the nullable value pattern as well as the result type pattern since a function can return Union{Value{T}, <:Exception}.


Ah I see, great. Maybe having a github tracking issue that lists all the issues and PRs associated with making the plan you discussed a reality would be useful.

Sure. Just pulling that summary together has taken me a week :joy: :sob:, but yes, it should be collected on GitHub with PR references somewhere. As far as I know the Value idea is not posted anywhere, I got that from a direct conversation with @jameson. If he’s posted that on GitHub somewhere, hopefully he can provide a link here.


Jameson posted some comments about there here

I assume I was one of those “parties”…

I would like to clarify that I think this is the correct plan for return values only, but that it would potentially be a mistake to think that it should be generalized to all usages. I don’t think the use of Array{T?} should be generalized to appearing as any type parameter. I think that typically arbitrary types should be parameterized by the leaf type of their contents (and not be “tricked” into being made nullable by using a parameter type of T?). Only the field types should be declared as nullable (since that determines the return type of getfield), although I don’t think this usage should actually come up very often in practice (but for example, it occurs in linked list / tree data structures where the parent / child pointers could be nullable). Along those lines, I that think using NullableArray{T} may indeed be better than Array{T?} (but with a much simplified implementation, from being able to use Array{T?} internally for storage layout and optimization). For all other cases (e.g. other than as a return value), I think the nullable should have to be wrapped into a struct with a nullable field (essentially, DataValues.jl) in order to be lifted. This ensures the user takes responsibility for the handling of the null value (which I think should also address malmaud’s concern).

Additionally, I think with this approach we might even choose not to change eltype, but instead unify it with nullable-get semantics for dictionaries, such as a get? or []? function. I think that this approach may also let us make null-handling very explicit (with the extra ? in the function name), gaining back even more of malmaud’s forced-acknowledgement (and at the syntax level, for reviewer convenience), without needing the additional overhead of nullable unwrapping.

The primary technical concern…

It’s not primarily just a technical concern of mine. Attempting to treat a NamedTuple field as nullable also violates all of my assertions in the previous paragraph. Since I think null data should only occur as a function return value and not as a function argument, a nullable-unaware construct such as NamedTuples will require the use of an additional wrapper (e.g. DataValue) that is null-aware.

As a concrete example, for a mapping of (id, name) => (gender, age) a JuliaDB (a sparse table) might be created to have a type of Table{Columns{IdType, NameType, DataValue{GenderType}, DataValue{AgeType}}}. Alternatively, the equivalent dataframe (a null-aware table) might have the type Table{IdType, NameType, Columns{GenderType, AgeType}}. Note that in both of these cases, the nullability is explicitly indicated and handled by a type. I don’t see any reason to make NamedTuple an exception here. Since constructing it requires calling a function and parameterizing a type, either of those would require that any known-nullable value get wrapped first. Ideally, I think I would like for a null-aware type like DataFrames to be able to return a null-aware type like NullableTuple{T...} or NullableNamedTuple{T...} (with an eltype of T?) such that the type unwrapping must be acknowledged by the user. If NamedTuple were implemented to act exactly like a Dict does now (e.g. be indexed like a regular tuple with [:name], and iterate as name=>value pairs), I think using alternative implementations would be as easy as making alternative Dict. Currently, the proposed API is struct-like and not Dict-like, so the proposed usage examples appear to be biased towards exploiting implementation details, rather than the encouraging of experimentation, duck typing, and type-based dispatch for which Julia is justifiably famous :slightly_smiling_face:. So, no, this is mostly not a technical concern, because we can certain define a mapping of corresponding APIs, such as having get-nullable-field-of-namedtuple for get?, and getfield for getindex, and nonnull-keys for keys. But, um, that sounds very unpleasant to me. By-the-way, have I mentioned yet that I’m in favor of just defining getfield(dict<:Associative, s::Symbol) = dict[s] (aka dict.s = dict[:s], aka implement getfield-overloading) so that we only have one API for this indexing-type operation, but so that the syntax-demanding folks will be satisfied :slight_smile:?

Despite the fear, uncertainty and doubt this reply likely evokes, none of this negates the fundamental plan – everything I wrote is still exactly what we plan to do. There are technical debates to be had (e.g. I vehemently disagree with the need for having a NullableArray{T} type separate from Array{T?} – I just don’t see the point), but they are very much in the weeds and not something that needs to be brought up here.

The notion of having return values that can’t be arguments is nonsensical – either null is a first-class value or it isn’t. Our current #undef is not a real value – but you also can’t return it from a function. We might want to replace #undef with null and require fields that can be left uninitialized to be supertypes of Null. The “billion dollar mistake” of C/C++/Java is avoided even though null is a real value because the real problem in those languages is that there’s no way to express in the type system that an object (or pointer) cannot be null. We wouldn’t have that issue since MyConcreteType would definitely be non-null since Null is not a subtype of it. Only if something is MyConcreteType? would you have to worry about null.



Someone asked me offline whether my comments above were intended to refute the original post. So to clarify, that was not my intent. I think DataFrames.jl is making the right choice in taking advantage of the compiler optimizations to make representing nullable data unwrapped as simply Union{T, Null} so the callee doesn’t need to unwrap a Nullable{T} (but does need to check the return type). But I also think DataValue{T} has a place too (where the caller doesn’t need to unwrap the value either, but where the next callee needs to be a lifting function which does handle and propagate the null). And that DataValue can also benefit from the compiler improvements to Union, such that both approaches should actually be complementary to each other. I think the only real loser in this space is the existing Nullable type, which has the downside of needing unwrapping, but without the advantage of having some lifted methods defined.

Finally, I think DataValues / Nullable’s greatest strength is also its greatest weakness: the existence and requirement of a typed null. For return values from computations (such as tryparse or even just lifted +), it’s cumbersome to produce this type, and helps neither the caller nor the callee to have to express it (it just made the compiler happy due to some artificial “type stability” limitation which I think should now be fixed). However, I think focusing on this issue too closely misses the beneficial side: that it can work great for interfaces like Query.jl, where that type is already computed and stashed away in a type parameter.

My intention was to observe the connection between nullable data and nullable fields, as providing both and considering them independently seems redundant. The name UndefArray may have been a more appropriate name, given that NullableArray already exists with certain semantics. For instance, by using an underlying array type of Array{T?}, the interface for UndefArray{T} could emulate the current UndefRefError, and we could then even consider removing the ability for array fields to contain #undef.

This property is probably usually more adequately phrased as an “engineer’s null” or “non-propagating null” by others in this discussion thread.

But, I had considered stating this sentence in terms of return types (which I think would have been more accurate), but wasn’t certain the nuance would be appreciated. However, I will do so now: the set of possible argument types is a subset of the set of possible return types. This is due to the fact that Julia dispatches on the runtime (concrete, leaf) types of the arguments, but the return type is inferred over multiple possible dataflow paths (and could be a union or abstract type, in addition to being the exact runtime type). Thus while a return type can be Union{T, Null}, the argument type is either T or Null, but it – unlike the return type – cannot be simultaneously both.

I completely understand the concern that as long as a.b is not overloadable, it’s not possible to define other types that act like a named tuple. Hence I’m in favor of making dot overloadable, and you say you are as well, so… no problem?

I don’t think iterating name=>value pairs has anything to do with it. That’s just a design decision that can be made either way, for Dicts as well. However the main purpose of named tuples is to represent names at the type level, in order to remove them from the data path as much as possible, so you can efficiently have a huge number of tuples with the same names.


If we get rid of #undef fields, then we should get rid of #undef in arrays as well. They cause all kinds of problems, including making it impossible to write generic code to copy data structures – since trying to access an #undef is an immediate error. Sure, you can work around it, but it’s an annoying corner case.

My current preferred approach is to only allow fields and arrays that are explicitly nullable – i.e. have an (el)type that includes null – to be uninitialized, and eliminate #undef entirely. No one really wants the current #undef behavior, it was just a way to avoid the “billion dollar mistake”; I believe that only allowing explicitly nullable locations to be null prevents that just as effectively as #undef does without introducing an awkwardly non-first-class value.

I still don’t think your point about return types and argument types makes sense: the value returned from a function is either of type T or Null in exactly the same way that an argument must be of one type or the other.

1 Like

HL7 defines a number of so called Null Flavors (NullFlavor - FHIR v4.0.1)
For example NA=not applicable, UNK=unknown, etc.
Would it be possible to incorporate such Null flavors into the approach described in this thread?

For example
struct Null
flavor Int
const NA = Null(1)
const UNK = Null(2)
. . .

abstract type Null
struct NA <: Null end
struct UNK <: Null end
. . .

Yes, one of the advantages of the Union type approach is that you can define your own null types like the ones in your second set of definitions, and use e.g. Union{Int, NA, UNK}.


So it means that to initialize an array progressively, people would create Array{T?} objects, fill them, and only at the end convert them to Array{T} (e.g. when assigning them to an object’s field inside a constructor)? That’s an interesting solution, especially if that can be made fast because the Array{T} object would simply reuse the value part of the existing Array{T?} array, discarding the type tag part.


There’s also fill with a dummy ref or the ref to the first value or something.

I really love the idea, @StefanKarpinski, but how will similar(a, T) work for all T?

(Interestingly, this brings some of the difficulties I see in incrementally constructing StaticArrays to Base.Array. Perhaps both can share a design where you build a partially initialised array and then call a function to declare the construction to be “complete” - here this would discard T? for T, and for static arrays it could unwrap a Ref of a static array or something).

Yes, that’s an issue – it may not be possible with all element types in this proposed scheme, which is certainly a problem we’d need to figure out a solution to.

They are actually very different. The return type is both T and Null (not either). In a statically dispatched language, the argument type is also both (either with some form of widening or pre-computed dispatch match table). But in Julia, runtime dispatch means that the argument type must be one or the other, and cannot be both (except insofar as users can try to emulate it with Core.Inference.return_type, but that’s very difficult to get right and fairly unpredictable, so we try to only use it only when it is unavoidable - such as the empty map case).

it’s not possible to define other types that act like a named tuple

To repeat myself, “no, this is mostly not a technical concern, because we can certain define a mapping of corresponding APIs”. However, I don’t want to add another API to make existing Associative types act like named tuples. I want named tuples to (as much as is logical) act like existing Associative types. Why is this so unreasonable a request?

However the main purpose of named tuples is to represent names at the type level, in order to remove them from the data path as much as possible, so you can efficiently have a huge number of tuples with the same names.

Again, to repeat myself, “the proposed usage examples appear to be biased towards exploiting implementation details”. There’s nothing fundamental about Dict which prevents the names from being represented at the type level and removed from the data path as much as possible. Indeed, there’s even a short gist on one of your PRs showing how to do it in a handful of lines of code. Efficiently having huge numbers of tuples with the same names also doesn’t logically following from this statement. As a counter-example, just for the sake of demonstration, it could be structured as a Tuple{svec(:names...), values...}. This incurs the cost of knowing about one extra pointer, but that’s a fairly inconsequential cost overall (which could likely even be elided in various circumstances). There’s lots of other reasons this representation would not be ideal, but “inefficient use of memory” should not be one of them, since the optimal memory layout is usually something with better memory packing anyways – such as IndexedTables or a Tuple of Arrays (DataFrame of Columns).


I want named tuples to (as much as is logical) act like existing Associative types.

Yes, they can and will implement the majority of the Associative interface. See the NamedTuples.jl package, which has haskey, keys, getindex, etc. The only difference is key=>value pair iteration, because it’s very annoying to have the interface throwing the names at you when you almost never want to consume them that way.

There’s nothing fundamental about Dict which prevents the names from being represented at the type level

That’s true, you can have a FixedKeyDict. I’ll try to exhaustively list the differences between named tuples and dicts as I see it:

  1. Iterating k=>v pairs. I think if we could do it over again, I wouldn’t make dicts work this way. Pairs are awkward because they’re not the data you care about, they simply reflect the fact that you chose to put your data in a dict.
  2. Special syntax for symbol keys: a.b, (a=1, b=2), {x, y}. Restricting to symbol keys makes that syntax correspond to the type’s supported operations; otherwise you’d have to switch syntax for different types of keys. As is often the case, adding the right restriction can improve usability.
  3. No single value type. Like tuples, named tuples track the type of each element separately, so there is no good V to put in Associative{K,V}.
  4. Covariance. We need named tuples to be covariant, and supporting that is much easier than adding general covariance.
  5. Simpler reflection. Named tuples have fields the same way that tuples and struct types do, so nfields, fieldnames, etc. work as expected, and the layout is the same. This doesn’t usually matter, but it’s just nice to have this kind of parity at all levels.

This seems like sufficient justification for a separate data structure to me.

1 Like

I can think of many languages where iteration works like Julia (Java, Python, C++, Rust). The only language that comes to mind that does something else is Javascript.

Yes, but the alternative is that you have to switch syntax (for example, from getfield to getindex) for different types of Associative-like types. This seems like a particularly weak argument to use when proposing adding a new object with special syntax that can only be used by that type. That said, I’m not arguing against adding some new sort of type to represent (a=1, b=2) / keyword arguments in a more useful way, just questioning how that new type should behave.

I don’t know that this is actually a benefit. If we want users to be able to have other types that wrap or act like NamedTuples, then encouraging using builtin functions for their primary API would be a distinct disadvantage over using the existing generic names for these properties (length, keys, etc.), since it conflates representation with behavior (violates the “separation of mechanism and policy” principle).