Representing Nullable Values

The history of Nullable is littered with PRs and ideas on how to make them work better in a data science context that were never merged/implemented. I think many of them were not taken up because folks felt those ideas were too magical for a software engineering Nullable type (and I strongly agree with that). I don’t have that constraint with DataValue, i.e. I’m just picking up many of these ideas. A second difference is that my array type DataValueArray does not use the higher order approach to lifting in functions like map and broadcast that NullableArray pioneered. Instead, things like map and broadcast work exactly like for a normal array (and the lifting happens at the scalar level, i.e. via white-list like methods defined on DataValue). Plus, I have a couple of other things that I believe will help a lot make things smooth. This is kind of a high-level overview, probably best to just wait until I have something ready to show before we start discussing the details :slight_smile:

That was the recommended approach in John Myles White’s julep on missing data and analyzed extensively there.

And just to be entirely clear: you are proposing that the data stack uses Union{T,Null}, not Union{Value{T},Null}, right? It was not clear to me from this thread whether the folks that would like to see the extra safeguards one gets from Value would want that for the data stack usage or not.

Well, it exists, but is it really required? There is also an “untyped” missing value in DataValues.jl, namely const NA = DataValue{Union{}}(). If the optimizations for small union types in base would also work for inferred return types like Union{T,DataValue{Union{}}} and Union{DataValue{T},DataValue{Union{}}} it would be great because folks could circumvent the requirement for a typed missing value in many cases.

There is one other issue I raised here somewhat, but that I also want to bring up in this thread: it seems genuinely unclear to me at this point what the “best” memory layout for columns with missing data in a DataFrame like structure is. For example, the missingness mask could be expressed either as a tight bitmask, or a byte sized type tag per element (which seems the current plan in base). I talked with @Jameson about this at juliacon and he pointed out that it really depends on the type of algorithm one is running which of those will be more efficient. Other considerations in this area are emerging standards for in-memory layout for tables like Apache Arrow. It might be really desirable to have a julia table type be super compatible with that initiative. Or not. My point is that at least to me it is not clear at this point which of those considerations is more important/relevant. I also don’t think those things will be sorted out in the julia 1.0 time frame, some of them depend on larger industry trends etc. I think having things like DataValueArray and NullableArray (in packages) is beneficial in such a situation: we don’t have to make decisions about memory layout now for the data stack that will be very difficult to change later on. Note that this does not imply that I think we shouldn’t potentially use something like Array{T?} internally in the implementations of DataValueArray, but I would much prefer that we can use that, then experiment with something that uses BitArrays or even another option.