Missing data and NamedTuple compatibility

piever · March 30, 2018, 6:30pm

I would actually prefer to keep this discussion focused on a concrete solution strategy and I would encourage to separate comments on the gravity (or not gravity) of the situation in a different thread, as this discussion is quite long already but has been, at least in my view, quite productive.

Replying to the merit of the comments, I mostly agree with @jlperla that we could take advantage of the extra time we have pre 0.7 to focus on this situation. To summarize my understanding of the situation, there are, broadly speaking, three types of data manipulations:

Column based (like DataFrames)
Row based but with complete type information (DataStreams, some cases of JuliaDB)
Row based but with incomplete type information (inference free design of Query, some cases of JuliaDB)

Case 3 is particularly complex because the sink has to be created based on the first element that is returned and expanded as new element types are encountered. Case 1) and 2) work extremely well with Union{T, Missing} in Julia 0.7 and that is the official recommended missing data implementation.

Case 3 has not been implemented with Union{T, Missing} yet (both JuliaDB and Query use DataValue) and there are strong reasons to believe such implementation will be very challenging. This discussion tries to find a solution for case 3. Case 1) and case 2) already work very well.

It seems that some consensus is arising that a completely unified missing data representation will not satisfy all needs and we would need some sort of hybrid. Union{T, Missing} is very good for storing data, but to concatenate lazy operation on iterables of rows it is not ideal (for a series of technical reasons). Based on the implementation of collect_columns (see here) I tend to believe that Union{NonMissing{T}, Missing} would be a good compromise to maintain performance and not rely on inference. Mostly this wrapping and unwrapping could be invisible to the user and be done by JuliaDB, but we need to understand what things need to be implemented for this to happen and this is one of the purposes of this discussion. Query is a very complex piece of software and I’m not sure whether Union{NonMissing{T}, Missing} would suffice there (I simply don’t know that package well enough to have an informed opinion).

Note that now we are already in a hybrid situation (DataValues on one side and Union{T, Missing} on the other side) and this is an effort to remedy that.

Topic		Replies	Views
Type inference of tables /w missing cells Internals & Design inference , type , suggestions , tuple	4	810	February 17, 2019
Broader (non-concrete) types in NamedTuple Internals & Design question	5	620	October 12, 2018
Compatibility of Query and Union{T, Missing} Data	3	1737	November 28, 2017
Type stability problem with NamedTuple of Union{T,Missing} General Usage	1	393	November 14, 2018
Announcement: An Update on DataFrames Future Plans Data announcement	41	9247	December 27, 2017

Missing data and NamedTuple compatibility

Related topics