Missing data and NamedTuple compatibility

Say something like df |> @map(Person(_.lastname, _.firstname)) |> collect. So you start out with a DataFrame, and then you project the rows into a type Person that is a domain specific type in some package that you use.

I’ve also been working pretty heavily on this. Just updated ZippedArrays so that you can collect(ZippedArray, iterator_of_tuples). Basically Jameson convinced me the collection machinery in place cannot operate effectively on complicated types like rows, so instead, whenever you collect, each item needs to go into a separate column. Each column then is going to have an eltype like Union{Missing, Int} and collect can work much more effectively with these simple union types.

I see your point — the filter(i -> i<3, [1,2,4.,6.]) example is a good one.

What do you do if a user-supplied function for map consumes DataValues but doesn’t return DataValues?

It seems possible to add a rule that the element type of a mapped column containing missing should always contain Missing. I.e. if the map result has element type T, actually use Union{Missing,T}. True, that is an extra step and so might not be super convenient, but it might be good enough.

Thanks, that clarifies it. But what if the Person type has String fields? Then you won’t be able to construct it with DataValues without some extra work, either.

Well, then the column type won’t be DataValue, and in that case I think that would be the expected and correct result. The canonical example of course is get: if one calls get on a DataValue column in map, then that column is converted into a non-DataValue column.

So I think that heuristic also wouldn’t be right: if the function passed to map gets rid of the missingness in some way, then the result column shouldn’t be T?. To stick with the number analogy, surely we wouldn’t want map(i->Float64(i), [1,2,3]) to return Array{Int}.

Yep, indeed. I’m not saying that DataValue doesn’t have lots of problems. It certainly does.

My hope would be that eventually we will get multiple inheritance/interfaces, and then might be able to make DataValue{String} <: AbstractString to be true (that in general would solve a lot of problems).

But yes, I don’t see a good short term and general solution with this with DataValue either.

Is that only in the context of Union{T,Missing}, or in general? It seems to work pretty flawlessly with named tuples as rows that have DataValue fields in the Query.jl world.

Well that’s good but unexpected. My understanding was that for a fast collect you need:

a) each item to be exactly the same type
or
b) item types to be something simple that can easily be handled by promote_typejoin

In Query.jl with DataValue you almost always are in case a) where each element has the same type, so that is why it works.

To reply to a couple of points:

  • the key idea for JuliaDB is that with a column based storage there’s no need to copy everything if a column must be widened, one can just copy the column. I think in Query one would have to understand for how many sinks this applies. For sending the data to a remote database this may be very tricky (or maybe not, you may actually require backends to implement a widencolumn function which most remote datasets do any way)

  • with any implementation that’s not “bad”, meaning outcome relies on inference, if you filter away all missing values, the column stops accepting missings, so that’s what would happen with my implementation.

- given the previous point, to avoid copying one may always start allowing missing in all fields and strictifying at the end. I think that could be the easiest practical approach. If the user wants otherwise they may need to explicitly type some fields (annoying, it’s true, but DataValue also has his annoyances for the user and this is being a trade-off).

Scratch that, the first point (requiring the backends to implement a widencolumn function and using JuliaDB’s strategy) may be the way forward.

Concerning the Array of structs sync, I’m sure one can do a generalization of Columns that would work with any struct while giving optimized storage.

Is this in the manual somewhere? I could not find it. I found the section on output type computation, which is somewhat related, but explicitly suggests promote_op.

2 Likes

@piever knows the next point already because we discussed it in an issue, just want to make sure the info is here in this thread: the situation for a column based storage is even better, in principle the widening from a T column to a T? column doesn’t require any copying as long as you store the missing mask as a separate array, which both Array{Union{T,Missing}} and DataValueArray do. I think the medium-term issue is that Array{Union{T,Missing}} doesn’t expose the necessary APIs for that strategy (and I don’t see an issue for that on the 1.0 milestone, so I’m assuming this is a 1.x feature). With DataValue and DataValueArray this all works today on julia 0.6 (and TextParse.jl actually uses that strategy for its handling of missing values).

That is only a problem with Union{T,Missing}. Your implementation does exactly the right thing with DataValue.

Its not just remote DBs, it is also files that one streams data into. I think once one starts with a design where a sink is required to do certain things, one essentially limits what kind of sinks one can target. There will simply be sinks out there that don’ support this, and I think it is on us to come up with a design that can interop with what is out there.

I’m not sure I understand this point :slight_smile: Could you elaborate?

1 Like

Yes, I was referring to missing, with DataValue the eltype is constant so all these issues disappear.

My idea is that the trick of using Columns{NamedTuples._NT_a_b{Int64,Int64}} object to iterate on named tuples can actually be generalized and one could create a Column{Person} object that iterates Person structs and stores the fields of all these Persons as a list of Arrays. Maybe you could consider adding something like that to Query if you want to support struct iterators. In general I really think the Columns machinery from IndexedTables should be moved out of it in a small self sufficient package so that more packages can benefit from it.

Ah, ok got it. I don’t think that works in general. The constructor of Person might do fancy stuff (from computing additional fields, to side-effects etc.), so to generally store arrays of structs as structs of arrays for the fields of Person seems to invasive to me. I think that works well for named tuples because we know exactly how they behave, but it seems to me that one can’t just use that technique for any struct.

Ah fair enough, hadn’t thought about that.

At the cost of repeating some things that were discussed in the original Julep proposal, I thought I’d summarize my thought after spending more time trying to transition some of the JuliaDB algorithms to an iterator based inference free style (thus making it more “Query-like”).

I see three approaches, of which I’ll try to analyze advantages and disadvantages:

  1. DataValue{T}, as in the Query-verse.

Big advantage: a single element (be it missing or not) is enough to detect whether the column accepts missing data and what is the type of the data, which is very helpful in all these iterator based implementations.

Big disadvantage: hard to lift in non type-stable case. How could one lift say getindex(d::Dict, key::DataValue{Symbol})? Once there is no input data, what is the type of the output? This problem comes from the fact that the missing data here is typed so for this kind of operations the user is on its own and needs to explicitly unwrap (even in the case where no data is actually missing).

Equivalent (in my view) union based implementation would be something like Union{Some{T}, Missing{T}}, assuming that a “typed missing” existed. The Some{T} would mean “element type is T, the element is present” and Missing{T} would mean “element type is T, the element is missing”.

  1. Union{T, Missing} as in DataFrames

Big advantage: no unwrapping needed and easy to lift, even for type unstable functions

Big disadvantage: doesn’t work well with iterators. As soon as I collect a column that allows missings but doesn’t have missing data, I strictify it (especially if we don’t want result of a function to rely on inference). With the current inference-free implementation and Union{T, Missing}, something like map(identity, t) would strictify all columns of t.

  1. Union{Some{T}, Missing}

Big advantage: easy to lift, even for type unstable functions. One non-missing element is enough to deduce that the column allows missing data (T means Missing not allowed, type T, Some{T} means Missing allowed, type T). If I collect an iterable I know from the type Some{T} that it accepts missings

Big disadvantage: the extra unwrapping (from Some{T} to T) can be annoying.

My overall intuition is that, even though many concerns raised by @davidanthoff about Union{T, Missing} are unavoidable (for example when collecting iterables that would allow missings but have no missing values), it would be feasible to port Query to Union{Some{T}, Missing}, where the Some would convey the information that missings are accepted, even if not present.

I would therefore suggest the following:

allow both Union{T, Missing} and Union{Some{T}, Missing} as ways to represent missing data (just as we do with Union{T, Nothing} and Union{Some{T}, Nothing}) . The first option, Union{T, Missing}, means: I don’t want to wrap/unwrap, but I accept that I could disallow missingness by collecting an iterable. The second option, Union{Some{T}, Missing}, means: I’m happy to wrap/unwrap but I do not accept that I could disallow missingness by collecting an iterable. [Important detail, I’m not sure whether the two Some of Missing and Nothing case should be the same, this can be discussed].

Of course, in the same way as DataValue had automated listed for most commonly used functions, so coulld Some.

In this scenario Query would still do a small translation step (from Union{T, Missing} to Union{Some{T}, Missing}) but I think that’s considerably less invasive. I also think that one could change most statistical packages to also work with Union{Some{T}, Missing}. For example skipmissing could also automatically unwrap the Some.

DataFrames would I believe already accept Union{Some{T}, Missing} without much change.

Another possibility would be to assume that all Union{T, Missing} inputs will give a Union{T, Missing} output, except if the result is wrapped in a NotMissing object. That would work in most cases but still offer flexibility when needed. Always wrapping in a Some object takes the opposite approach, which is inconvenient in the most common case.

One of the concerns is that when one starts chaining all this iterators, what the input is becomes quickly unclear (say I filter, join and then map my data, it’s hard to keep track of what came from where and ideally the result of collecting an iterator should only depend on what is being iterated - otherwise implementing the collect_columns function becomes very difficult). Some in my mind was a way to make this tracking feasible (one would just look in the final result at what was of type Some).

I’m not sure about the usability problem: if all functions that support Missing also support Some, this shouldn’t make any difference for the user.

Without this trick I’m not sure what’s a good way to keep track of which columns allow missing when composing a long sequence of iterators. Imagine that all sorts of functions are possible (for example a list comprehension over some of the fieldnames of the NamedTuple).

That’s the problem: Some isn’t supposed to implement any methods, as the point of this wrapper is precisely to force explicit unwrapping.

Of course a different wrapper type could be used instead (say NonMissing{T}). But there there would still be the issue that with Union{T, Missing}, if there are no missing values any function accepting T values will work, even if it does not accept missing. With Union{Some{T}, Missing}, only functions explicitly implemented would work. I guess it’s up to Query to decide whether it wants to be stricter (i.e. less dynamic) than plain Julia code.

At least I agree that using Union{NonMissing{T}, Missing} instead of DataValue{T} would make Query a bit closer to the rest of the ecosystem by using missing to represent a missing value.

Here’s something I’ve been working on.

export unzip
"""
    unzip(iterator)

Collect an iterator that returns tuples into a tuple of arrays.

\```jldoctest
julia> iterator = Tuple{Union{Missing, Int}, Union{Missing, String}}[(1, missing), (missing, "a")]

julia> unzip(iterator)
\```
"""
function unzip(iterator)
    first_state = start(iterator)
    if done(iterator, first_state)
        error("Cannot unzip empty iterators")
    end
    items, second_state = next(iterator, first_state)
    unzip_grow_to!(map(item -> [item], items), iterator, second_state)
end

function push_widen!(old_result, item)
    old_eltype = eltype(old_result)
    new_eltype = typeof(item)
    if new_eltype <: old_eltype
        push!(old_result, item)
        old_result
    else
        new_result = Vector{promote_typejoin(old_eltype, new_eltype)}(undef, 0)
        sizehint!(new_result, length(old_result))
        append!(new_result, old_result)
        new_result
    end
end

unzip_grow_to!(results::Tuple, iterator, state) =
    if !done(iterator, state)
        items, state = next(iterator, state)
        unzip_grow_to!(map(push_widen!, results, items), iterator, state)
    else
        results
    end

This is about as close to type stability as I can get. The return type you’d be looking for is something like

Tuple{Union{Array{Int}, Array{Missing}, Array{Union{Int, Missing}}}, Union{Array{String}, Array{Missing}, Array{Union{String, Missing}}}, 

And from what I can tell while it would be possible for inference to infer this it basically just bails.