The history of Nullable
is littered with PRs and ideas on how to make them work better in a data science context that were never merged/implemented. I think many of them were not taken up because folks felt those ideas were too magical for a software engineering Nullable
type (and I strongly agree with that). I don’t have that constraint with DataValue
, i.e. I’m just picking up many of these ideas. A second difference is that my array type DataValueArray
does not use the higher order approach to lifting in functions like map
and broadcast
that NullableArray
pioneered. Instead, things like map
and broadcast
work exactly like for a normal array (and the lifting happens at the scalar level, i.e. via white-list like methods defined on DataValue
). Plus, I have a couple of other things that I believe will help a lot make things smooth. This is kind of a high-level overview, probably best to just wait until I have something ready to show before we start discussing the details
That was the recommended approach in John Myles White’s julep on missing data and analyzed extensively there.
And just to be entirely clear: you are proposing that the data stack uses Union{T,Null}
, not Union{Value{T},Null}
, right? It was not clear to me from this thread whether the folks that would like to see the extra safeguards one gets from Value
would want that for the data stack usage or not.
Well, it exists, but is it really required? There is also an “untyped” missing value in DataValues.jl, namely const NA = DataValue{Union{}}()
. If the optimizations for small union types in base would also work for inferred return types like Union{T,DataValue{Union{}}}
and Union{DataValue{T},DataValue{Union{}}}
it would be great because folks could circumvent the requirement for a typed missing value in many cases.
There is one other issue I raised here somewhat, but that I also want to bring up in this thread: it seems genuinely unclear to me at this point what the “best” memory layout for columns with missing data in a DataFrame
like structure is. For example, the missingness mask could be expressed either as a tight bitmask, or a byte sized type tag per element (which seems the current plan in base). I talked with @Jameson about this at juliacon and he pointed out that it really depends on the type of algorithm one is running which of those will be more efficient. Other considerations in this area are emerging standards for in-memory layout for tables like Apache Arrow. It might be really desirable to have a julia table type be super compatible with that initiative. Or not. My point is that at least to me it is not clear at this point which of those considerations is more important/relevant. I also don’t think those things will be sorted out in the julia 1.0 time frame, some of them depend on larger industry trends etc. I think having things like DataValueArray
and NullableArray
(in packages) is beneficial in such a situation: we don’t have to make decisions about memory layout now for the data stack that will be very difficult to change later on. Note that this does not imply that I think we shouldn’t potentially use something like Array{T?}
internally in the implementations of DataValueArray
, but I would much prefer that we can use that, then experiment with something that uses BitArray
s or even another option.