Representing Nullable Values

davidanthoff · July 14, 2017, 1:31pm

The history of Nullable is littered with PRs and ideas on how to make them work better in a data science context that were never merged/implemented. I think many of them were not taken up because folks felt those ideas were too magical for a software engineering Nullable type (and I strongly agree with that). I don’t have that constraint with DataValue, i.e. I’m just picking up many of these ideas. A second difference is that my array type DataValueArray does not use the higher order approach to lifting in functions like map and broadcast that NullableArray pioneered. Instead, things like map and broadcast work exactly like for a normal array (and the lifting happens at the scalar level, i.e. via white-list like methods defined on DataValue). Plus, I have a couple of other things that I believe will help a lot make things smooth. This is kind of a high-level overview, probably best to just wait until I have something ready to show before we start discussing the details

That was the recommended approach in John Myles White’s julep on missing data and analyzed extensively there.

And just to be entirely clear: you are proposing that the data stack uses Union{T,Null}, not Union{Value{T},Null}, right? It was not clear to me from this thread whether the folks that would like to see the extra safeguards one gets from Value would want that for the data stack usage or not.

Well, it exists, but is it really required? There is also an “untyped” missing value in DataValues.jl, namely const NA = DataValue{Union{}}(). If the optimizations for small union types in base would also work for inferred return types like Union{T,DataValue{Union{}}} and Union{DataValue{T},DataValue{Union{}}} it would be great because folks could circumvent the requirement for a typed missing value in many cases.

There is one other issue I raised here somewhat, but that I also want to bring up in this thread: it seems genuinely unclear to me at this point what the “best” memory layout for columns with missing data in a DataFrame like structure is. For example, the missingness mask could be expressed either as a tight bitmask, or a byte sized type tag per element (which seems the current plan in base). I talked with @Jameson about this at juliacon and he pointed out that it really depends on the type of algorithm one is running which of those will be more efficient. Other considerations in this area are emerging standards for in-memory layout for tables like Apache Arrow. It might be really desirable to have a julia table type be super compatible with that initiative. Or not. My point is that at least to me it is not clear at this point which of those considerations is more important/relevant. I also don’t think those things will be sorted out in the julia 1.0 time frame, some of them depend on larger industry trends etc. I think having things like DataValueArray and NullableArray (in packages) is beneficial in such a situation: we don’t have to make decisions about memory layout now for the data stack that will be very difficult to change later on. Note that this does not imply that I think we shouldn’t potentially use something like Array{T?} internally in the implementations of DataValueArray, but I would much prefer that we can use that, then experiment with something that uses BitArrays or even another option.

Topic		Replies	Views
Announcement: An Update on DataFrames Future Plans Data announcement	41	9248	December 27, 2017
Missing data and NamedTuple compatibility Internals & Design	92	10633	April 2, 2018
Getting our act together in the data ecosystem Data	4	1787	July 4, 2017
DataTables or DataFrames? Data question	32	15374	November 19, 2018
Missing or NaN General Usage	26	12330	August 1, 2018

Representing Nullable Values

Related topics