Representing Nullable Values

This roadmap should be characterized as the DataFrames.jl/DataTables.jl/DataStreams.jl roadmap, but not the roadmap for the data ecosystem in general. There might be other packages that are on board, but at least some folks (e.g. me) are not yet sold on the Union{T,Null} approach and have plans/roadmaps that differ from what has been outlined in the original post here when it comes to handling missing values.

I want to stress that I’m strongly in favor of experimenting with Union{T,Null}. But in my opinion there are too many open technical questions and too little understanding of all the ramifications of this plan to commit at this point to this approach (see some of the issues in Nulls.jl). I also think there are other approaches that have not been fully explored yet that might give us the same kind of usability that DataFrames has right now with the speed of DataTables, but that require a lot less (if any) changes in julia base.

With that preample, here is my current roadmap for the family of packages that I’ve created in this space over the last years (DataValues.jl, IterableTables.jl, Query.jl, CSVFiles.jl, ExcelReader.jl, ExcelFiles.jl, FeatherFiles.jl and StatFiles.jl; jointly loadable as Dataverse.jl):

  1. If all the Union{T,Null} issues are sorted out by julia 1.0 and that approach has emerged as the best approach to missing data, I’ll try to port my packages over once julia 1.0 is out (or maybe during the RC phase or something like that). I should stress that currently both of these conditions have huge questionmarks associated in my mind.
  2. I’ll try to push the approach I took in DataValues.jl during the julia 0.6 cycle as far as I can. DataValue has handled the missing value story for Query.jl and IterableTables.jl for a couple of months now very successfully. I’m currently working on a port of DataTables.jl that uses DataValue for missing data. I’m not yet done, but I’m optimistic that I can create something that has the ease of use that DataFrame has, without the performance problems (i.e. performance would be in line with DataTable using Nullable). I’d say I’m half way done with this work, so I don’t really know whether this will work out, but we should know fairly soon. I’m also not sure where this code will be hosted eventually. It might be in DataTables.jl if the folks maintaining that agree with that, or it might be a new package. I’ll write a much more detailed outline of my strategy for all of this once I’m done with the coding and once I have a sense whether it can actually work.

In my mind the high level strategy for the data ecosystem broadly should be that we try to push both the Union{T,Null} and the DataValue approach as far as we can, and once we have a better understanding of the trade-offs decide between those two approaches.

8 Likes

Care to elaborate? I share your skepticism about union types, it sounds a little too good to be true, however I’m not aware of any other approach that would solve the problem anywhere near as elegantly.

The only alternative I’ve come up with that I haven’t seen mentioned is a DataFrame implementation that only allows Vector{<:AbstractFloat} and Vector{<:Integer} but presents these to humans in easily readable formats, i.e. as category strings or DateTime when necessary. At first I thought this idea was crazy, but, as I thought about it more, I realized that this would present few obstacles to performing most computations. In my 9 years of doing physics I had exactly 0 problems dealing with missing data (I just used NaN) and spent almost 0 time thinking about it. Suddenly when I started doing data science the topic became a huge clusterfuck (and this is hardly unique to Julia). My conclusion: strings and dates are evil and need to be destroyed.

I might try this at some point.

The title of the post is “An Update on DataFrames Future Plans”, and the mentioned approach of using the Nulls.jl package does indeed affect the future of a number of other related packages, so I’d say it’s fair to characterize this as a roadmap for data-related ecosystem packages; though I’d also point out it’s at least as over-generalized as naming a package DataVerse.jl :wink:

It’s unfortunate that it seems you’re backtracking from our discussions at JuliaCon where, at least IMO, we were able to all come together with the right core devs to discuss potential blocking issues and come up with plans to resolve. I understand there is still work to do to, but at least in my mind, we had a consensus about the advantages of Union{T, Null} and agreed to move forward with it.

It’s also worth re-iterating that this isn’t as much an “experimentation” as much as an optimization exercise. Indeed, DataFrames/DataArrays have already been using the Union{T, NAtype} approach for what, 4 or 5 years now? Indeed, creating the Nulls.jl package basically consisted of splitting the NAtype out of the DataArrays.jl package. The “experimentation” of porting packages over (which in most cases is really porting the packages back to a union approach) has also advanced quite far, including corresponding branches across a number of repos that now use Union{T, Null}, without running into any blocking issues. I understand that you voiced concerns that there will be issues with Query.jl using this approach, but I also don’t think I’ve seen any attempt to port code over, which would help pinpoint exact issues and which I’ve mentioned I’m happy to help with. My point is that while the specific implementation of Query.jl may indeed need additional language improvements, there are plenty of other packages/workflows/implementations of similar data processing type code that have been or will be fully functional without any additional language improvements needed (functionally at least; of course any package desires performance improvements).

That’s not even mentioning the numerous open issues about Nullables/DataValues type approach (which I also discussed in detail during my JuliaCon talk). Indeed, I’d argue there are as many open technical questions/issues with using Nullables/DataValues for data analysis as there are for Union{T, Null}. There are certainly reasons that a number of data-related packages ported completely over to using the Nullables-based approach and have now decided to move to Union{T, Null}.

At the end of the day, perhaps the implementation that Query.jl has taken is better suited to Nullables/DataValues, and the approaches taken by other data-related packages are better suited to Union{T, Null}, so perhaps it’d be more useful to explore ways to ensure all these packages continue to interop using either approach.

3 Likes

From talking with various parties about the Union{T, Null} approach, the overall plan is the following:

  1. Represent nullable data with T? defined as Union{T, Null}.
  2. The representation of small unions is optimized to keep data inline like C unions with extra type indicator bits.
  3. Represent arrays of nullable data simply as Array{T?}.
  4. The representation of arrays of unions is optimized to keep data contiguous inline with extra type indicator bits after the main array data.
  5. For nullable values where null is included in the domain of valid values, use Value{T}? where Value is a simple wrapper type like Ref but immutable.

This is a simple, composable approach and addresses all concerns as far as I can tell. It’s not annoying to use since there’s generally no explict wrapping or unwrapping or “lifting” required. Functions that don’t have methods for null arguments simply raise method errors. Functions that need to handle nulls simply add methods that do the right thing. Recent compiler optimizations make all of this as efficient as the current Nullable approach, if not more so. Moveover, if someone wants to have seven different kinds of nullability for their data, they can do so just by implementing their own null-like types and using unions of them – the compiler optimizations are generic and decoupled from the specifics of the null type. All the scalar optimizations are already implemented and merged on Julia master. I’m not sure about the status of the array representation, but perhaps @quinnj or @jameson can fill us in on the current status.

The primary technical concern @jameson has with this whole approach is that if we represent rows of tables with many nullable columns using covariant named tuples, then there is a potentially exponential number of concrete types of rows in terms of the number of nullable columns, which could – absent compiler changes – lead to an exponential number of code specializations and bad performance due to excessive amounts of dynamic dispatch. However, @jeff.bezanson is confident that we can handle this with better specialization heuristics (we deal with potentially exponential code specializations all the time), and has pointed out that the existing NamedTuples package has the same problem and yet already works on Julia as is. Of course, the existing NamedTuples package isn’t as widely used as built-in named tuples would be.

13 Likes

+1 to all that.
One thing we do lose is that Nullable forced client code to acknowledge that a value could be null since it would have to unwrap it. eg

y = ...
x = f(y)
return x+1

If f returns a Nullable{Int}, then this code will fail when tested with any value of y and the author will be reminded they have to account for the fact that f can fail. If f returns a Union{Int, Null} instead, the code will only fail if tested on y that causes f(y)` to return Null, and so could be missed.

But a package can just provide a Maybe wrapper type and get back that functionality.

2 Likes

That would be addressed by the Value{T}? type where Value wraps an inner value and thereby distinguishes null meaning “there is no value” from Value(null) meaning “there is a value and it is null”. This single type addresses the nullable value pattern as well as the result type pattern since a function can return Union{Value{T}, <:Exception}.

4 Likes

Ah I see, great. Maybe having a github tracking issue that lists all the issues and PRs associated with making the plan you discussed a reality would be useful.

Sure. Just pulling that summary together has taken me a week :joy: :sob:, but yes, it should be collected on GitHub with PR references somewhere. As far as I know the Value idea is not posted anywhere, I got that from a direct conversation with @jameson. If he’s posted that on GitHub somewhere, hopefully he can provide a link here.

2 Likes

Jameson posted some comments about there here

I assume I was one of those “parties”…

I would like to clarify that I think this is the correct plan for return values only, but that it would potentially be a mistake to think that it should be generalized to all usages. I don’t think the use of Array{T?} should be generalized to appearing as any type parameter. I think that typically arbitrary types should be parameterized by the leaf type of their contents (and not be “tricked” into being made nullable by using a parameter type of T?). Only the field types should be declared as nullable (since that determines the return type of getfield), although I don’t think this usage should actually come up very often in practice (but for example, it occurs in linked list / tree data structures where the parent / child pointers could be nullable). Along those lines, I that think using NullableArray{T} may indeed be better than Array{T?} (but with a much simplified implementation, from being able to use Array{T?} internally for storage layout and optimization). For all other cases (e.g. other than as a return value), I think the nullable should have to be wrapped into a struct with a nullable field (essentially, DataValues.jl) in order to be lifted. This ensures the user takes responsibility for the handling of the null value (which I think should also address malmaud’s concern).

Additionally, I think with this approach we might even choose not to change eltype, but instead unify it with nullable-get semantics for dictionaries, such as a get? or []? function. I think that this approach may also let us make null-handling very explicit (with the extra ? in the function name), gaining back even more of malmaud’s forced-acknowledgement (and at the syntax level, for reviewer convenience), without needing the additional overhead of nullable unwrapping.

The primary technical concern…

It’s not primarily just a technical concern of mine. Attempting to treat a NamedTuple field as nullable also violates all of my assertions in the previous paragraph. Since I think null data should only occur as a function return value and not as a function argument, a nullable-unaware construct such as NamedTuples will require the use of an additional wrapper (e.g. DataValue) that is null-aware.

As a concrete example, for a mapping of (id, name) => (gender, age) a JuliaDB (a sparse table) might be created to have a type of Table{Columns{IdType, NameType, DataValue{GenderType}, DataValue{AgeType}}}. Alternatively, the equivalent dataframe (a null-aware table) might have the type Table{IdType, NameType, Columns{GenderType, AgeType}}. Note that in both of these cases, the nullability is explicitly indicated and handled by a type. I don’t see any reason to make NamedTuple an exception here. Since constructing it requires calling a function and parameterizing a type, either of those would require that any known-nullable value get wrapped first. Ideally, I think I would like for a null-aware type like DataFrames to be able to return a null-aware type like NullableTuple{T...} or NullableNamedTuple{T...} (with an eltype of T?) such that the type unwrapping must be acknowledged by the user. If NamedTuple were implemented to act exactly like a Dict does now (e.g. be indexed like a regular tuple with [:name], and iterate as name=>value pairs), I think using alternative implementations would be as easy as making alternative Dict. Currently, the proposed API is struct-like and not Dict-like, so the proposed usage examples appear to be biased towards exploiting implementation details, rather than the encouraging of experimentation, duck typing, and type-based dispatch for which Julia is justifiably famous :slightly_smiling_face:. So, no, this is mostly not a technical concern, because we can certain define a mapping of corresponding APIs, such as having get-nullable-field-of-namedtuple for get?, and getfield for getindex, and nonnull-keys for keys. But, um, that sounds very unpleasant to me. By-the-way, have I mentioned yet that I’m in favor of just defining getfield(dict<:Associative, s::Symbol) = dict[s] (aka dict.s = dict[:s], aka implement getfield-overloading) so that we only have one API for this indexing-type operation, but so that the syntax-demanding folks will be satisfied :slight_smile:?

Despite the fear, uncertainty and doubt this reply likely evokes, none of this negates the fundamental plan – everything I wrote is still exactly what we plan to do. There are technical debates to be had (e.g. I vehemently disagree with the need for having a NullableArray{T} type separate from Array{T?} – I just don’t see the point), but they are very much in the weeds and not something that needs to be brought up here.

The notion of having return values that can’t be arguments is nonsensical – either null is a first-class value or it isn’t. Our current #undef is not a real value – but you also can’t return it from a function. We might want to replace #undef with null and require fields that can be left uninitialized to be supertypes of Null. The “billion dollar mistake” of C/C++/Java is avoided even though null is a real value because the real problem in those languages is that there’s no way to express in the type system that an object (or pointer) cannot be null. We wouldn’t have that issue since MyConcreteType would definitely be non-null since Null is not a subtype of it. Only if something is MyConcreteType? would you have to worry about null.

5 Likes

+1

Someone asked me offline whether my comments above were intended to refute the original post. So to clarify, that was not my intent. I think DataFrames.jl is making the right choice in taking advantage of the compiler optimizations to make representing nullable data unwrapped as simply Union{T, Null} so the callee doesn’t need to unwrap a Nullable{T} (but does need to check the return type). But I also think DataValue{T} has a place too (where the caller doesn’t need to unwrap the value either, but where the next callee needs to be a lifting function which does handle and propagate the null). And that DataValue can also benefit from the compiler improvements to Union, such that both approaches should actually be complementary to each other. I think the only real loser in this space is the existing Nullable type, which has the downside of needing unwrapping, but without the advantage of having some lifted methods defined.

Finally, I think DataValues / Nullable’s greatest strength is also its greatest weakness: the existence and requirement of a typed null. For return values from computations (such as tryparse or even just lifted +), it’s cumbersome to produce this type, and helps neither the caller nor the callee to have to express it (it just made the compiler happy due to some artificial “type stability” limitation which I think should now be fixed). However, I think focusing on this issue too closely misses the beneficial side: that it can work great for interfaces like Query.jl, where that type is already computed and stashed away in a type parameter.

My intention was to observe the connection between nullable data and nullable fields, as providing both and considering them independently seems redundant. The name UndefArray may have been a more appropriate name, given that NullableArray already exists with certain semantics. For instance, by using an underlying array type of Array{T?}, the interface for UndefArray{T} could emulate the current UndefRefError, and we could then even consider removing the ability for array fields to contain #undef.

This property is probably usually more adequately phrased as an “engineer’s null” or “non-propagating null” by others in this discussion thread.

But, I had considered stating this sentence in terms of return types (which I think would have been more accurate), but wasn’t certain the nuance would be appreciated. However, I will do so now: the set of possible argument types is a subset of the set of possible return types. This is due to the fact that Julia dispatches on the runtime (concrete, leaf) types of the arguments, but the return type is inferred over multiple possible dataflow paths (and could be a union or abstract type, in addition to being the exact runtime type). Thus while a return type can be Union{T, Null}, the argument type is either T or Null, but it – unlike the return type – cannot be simultaneously both.

I completely understand the concern that as long as a.b is not overloadable, it’s not possible to define other types that act like a named tuple. Hence I’m in favor of making dot overloadable, and you say you are as well, so… no problem?

I don’t think iterating name=>value pairs has anything to do with it. That’s just a design decision that can be made either way, for Dicts as well. However the main purpose of named tuples is to represent names at the type level, in order to remove them from the data path as much as possible, so you can efficiently have a huge number of tuples with the same names.

3 Likes

If we get rid of #undef fields, then we should get rid of #undef in arrays as well. They cause all kinds of problems, including making it impossible to write generic code to copy data structures – since trying to access an #undef is an immediate error. Sure, you can work around it, but it’s an annoying corner case.

My current preferred approach is to only allow fields and arrays that are explicitly nullable – i.e. have an (el)type that includes null – to be uninitialized, and eliminate #undef entirely. No one really wants the current #undef behavior, it was just a way to avoid the “billion dollar mistake”; I believe that only allowing explicitly nullable locations to be null prevents that just as effectively as #undef does without introducing an awkwardly non-first-class value.

I still don’t think your point about return types and argument types makes sense: the value returned from a function is either of type T or Null in exactly the same way that an argument must be of one type or the other.

1 Like

HL7 defines a number of so called Null Flavors (NullFlavor - FHIR v4.0.1)
For example NA=not applicable, UNK=unknown, etc.
Would it be possible to incorporate such Null flavors into the approach described in this thread?

For example
struct Null
flavor Int
end
const NA = Null(1)
const UNK = Null(2)
. . .

Alternatively:
abstract type Null
struct NA <: Null end
struct UNK <: Null end
. . .

Yes, one of the advantages of the Union type approach is that you can define your own null types like the ones in your second set of definitions, and use e.g. Union{Int, NA, UNK}.

5 Likes

So it means that to initialize an array progressively, people would create Array{T?} objects, fill them, and only at the end convert them to Array{T} (e.g. when assigning them to an object’s field inside a constructor)? That’s an interesting solution, especially if that can be made fast because the Array{T} object would simply reuse the value part of the existing Array{T?} array, discarding the type tag part.

2 Likes

There’s also fill with a dummy ref or the ref to the first value or something.

I really love the idea, @StefanKarpinski, but how will similar(a, T) work for all T?

(Interestingly, this brings some of the difficulties I see in incrementally constructing StaticArrays to Base.Array. Perhaps both can share a design where you build a partially initialised array and then call a function to declare the construction to be “complete” - here this would discard T? for T, and for static arrays it could unwrap a Ref of a static array or something).

Yes, that’s an issue – it may not be possible with all element types in this proposed scheme, which is certainly a problem we’d need to figure out a solution to.