Compatibility of Query and Union{T, Missing}

I just posted this on the slack channel re status of Query/IterableTables/Missings:

I don’t see a path for Query and iterable tables to use missing in the julia 1.0 time frame. My read of that discussion so far is that there is one broad strategy on the table that would a) require much more work in base in terms of union return types (essentially not just making small union return types fast, but really large ones with 2^n elements where n is the number of columns) and b) it would require pretty much every collection type that currently accepts an iterator for initialization to have materialization code that is a lot more complex than what they currently have, essentially special case handling the results of queries. The latter in my mind really breaks the very nice composability properties Query has right now when used with DataValue for the missing value story (you can materialize queries into all sorts of data structures that I have never heard of in an efficient way).
So my take on this is that the Missings design at this point would still be a significant step back for Query. In my mind we are still at a point where Missings is a design that works great for some parts of the data ecosystem, but is not a good design for other parts.
Irrespective of what one thinks about the merits of this broad strategy, it seems extremely unlikely that a) and b) would be done by julia 1.0 (as far as I can tell they are broad ideas at this point, with no concrete design or anyone working on them). Who knows, maybe in julia 1.1 Missings will be more usable for things like Query and I can revisit things… Having said that, the current design with DataValue seems to work great, i.e. it is not exactly the case that there are problems with that design that Missings would solve, and I really try very hard to not break code that uses Query/IterableTables and friends, so I think any decision down the road in the julia 1.1 timeframe would have to take such constraints into account as well.
I am almost done finishing the interop story for IterableTables and the new DataFrames, so that will allow you to use Query with the new DataFrame. The model will be the same as it is today: regardless of what missing story a source uses, in the query itself you’ll deal with DataValues, and when you materialize a query I’ll use the “native” missing story of the type that is the sink (so you’ll get DataFrames that have missing values).

Sorry for the duplicate text, I only saw this message here after I had posted on slack, but I don’t want to leave @ValdarT’s question unanswered.

5 Likes

I wrote the following also on Slack, which I figure I should also repost here (let’s have the rest of the conversation here rather than in two places, if it goes more than a few messages we should move it into its own topic):

My understanding is that Query uses type inference to determine result container types in a way that is broken by the union approach to representing missingness. However that general approach [i.e. using inference results via `Base.return_type`] is not recommended and may break in the future as inference gets better or worse at inferring certain things. What Query should do is use the approach that map does and optimistically produce concretely types collections, bailing out to a more abstract collection type as necessary. Currently this requires copying which is unfortunate, but we should hopefully be able to eliminate that copying in the future in cases where the data representation is compatible.

@jeff.bezanson or @quinnj may have more to say on the issue – we’ve discussed this a few times and this was my takeaway from those conversations.

I think using type inference to determine a container type is equally broken with or without the Union approach to missingness.

What the Union approach interferes with is the ability to pass somebody, say, an Int along with the information that it might have been missing, such that if they put it in a container they should make a Container{Union{Int,Missing}} instead of a Container{Int}.

This is somewhat similar to a map call that only produces Ints, when I was planning to mutate the resulting collection to also contain a String. For instance, we might produce a table with no missing values, but then append a new row that does have missing values. This is a real problem, but I feel it cannot be as fundamental as the problem with return_type. I hold out hope it can be fixed by a combination of (1) copying things to a new type when needed, as Stefan said, and (2) finding ways to pass the extra bit that says the value might be missing.

I dislike overly-complex code as much as anybody, but this sounds to me like there is some known work that could be done to address the problem, and yet the response is to reject it and continue to declare the problem unsolved. We might have to just hold our noses and write the ugly code.

If these iterator consumers are currently very simple, I’d guess they don’t support heterogeneous iterators generally — is that the case?

We might be able to help this situation with an iterator trait. In particular, even an iterator that doesn’t have a known eltype might have a good guess at which parts of its results are missing-able. The trait could be implemented for tables as well as Generators over tables. Example:

hint = wider_type_hint(iter)   # this might return e.g. Tuple{Union{}, Missing, Union{}}
elt = get_next_value(iter)
eltype = promote_type(typeof(elt), hint)
result = Container{eltype}(...)

The good part is that wider_type_hint can correctly default to Union{} for all iterators.