Missing data and NamedTuple compatibility

I would actually prefer to keep this discussion focused on a concrete solution strategy and I would encourage to separate comments on the gravity (or not gravity) of the situation in a different thread, as this discussion is quite long already but has been, at least in my view, quite productive.

Replying to the merit of the comments, I mostly agree with @jlperla that we could take advantage of the extra time we have pre 0.7 to focus on this situation. To summarize my understanding of the situation, there are, broadly speaking, three types of data manipulations:

  1. Column based (like DataFrames)
  2. Row based but with complete type information (DataStreams, some cases of JuliaDB)
  3. Row based but with incomplete type information (inference free design of Query, some cases of JuliaDB)

Case 3 is particularly complex because the sink has to be created based on the first element that is returned and expanded as new element types are encountered. Case 1) and 2) work extremely well with Union{T, Missing} in Julia 0.7 and that is the official recommended missing data implementation.

Case 3 has not been implemented with Union{T, Missing} yet (both JuliaDB and Query use DataValue) and there are strong reasons to believe such implementation will be very challenging. This discussion tries to find a solution for case 3. Case 1) and case 2) already work very well.

It seems that some consensus is arising that a completely unified missing data representation will not satisfy all needs and we would need some sort of hybrid. Union{T, Missing} is very good for storing data, but to concatenate lazy operation on iterables of rows it is not ideal (for a series of technical reasons). Based on the implementation of collect_columns (see here) I tend to believe that Union{NonMissing{T}, Missing} would be a good compromise to maintain performance and not rely on inference. Mostly this wrapping and unwrapping could be invisible to the user and be done by JuliaDB, but we need to understand what things need to be implemented for this to happen and this is one of the purposes of this discussion. Query is a very complex piece of software and I’m not sure whether Union{NonMissing{T}, Missing} would suffice there (I simply don’t know that package well enough to have an informed opinion).

Note that now we are already in a hybrid situation (DataValues on one side and Union{T, Missing} on the other side) and this is an effort to remedy that.

9 Likes