Missing data and NamedTuple compatibility

jlapeyre · March 30, 2018, 7:01pm

Agreed. The meta questions don’t belong in this technical discussion. Someone could split this and take the posts. But, I don’t have anything more to add.

ChrisRackauckas · March 30, 2018, 7:12pm

piever:

It seems that some consensus is arising that a completely unified missing data representation will not satisfy all needs and we would need some sort of hybrid. Union{T, Missing} is very good for storing data, but to concatenate lazy operation on iterables of rows it is not ideal (for a series of technical reasons). Based on the implementation of collect_columns (see here) I tend to believe that Union{NonMissing{T}, Missing} would be a good compromise to maintain performance and not rely on inference. Mostly this wrapping and unwrapping could be invisible to the user and be done by JuliaDB, but we need to understand what things need to be implemented for this to happen and this is one of the purposes of this discussion. Query is a very complex piece of software and I’m not sure whether Union{NonMissing{T}, Missing} would suffice there (I simply don’t know that package well enough to have an informed opinion).

Note that now we are already in a hybrid situation (DataValues on one side and Union{T, Missing} on the other side) and this is an effort to remedy that.

This sounds nice. I tend to think that the data layout and optimization is a technical detail that doesn’t have to be represented in the API. Indeed, Julia is all about using generic functions via multiple dispatch and abstract typing to disconnect actions from data representation, and so it sounds like that will be what’s needed here as well. If Union{NonMissing{T}, Missing} gets overloads to work with the same lifting and skipmissing functions, then the end user likely wouldn’t even notice the difference, meaning packages can choose the representation that is best for the application.

So I think the thing to make note of is that the missing framework has a clear data handling API (well, it has a lot of nice primitives and packages, with other pieces evolving) and it does work for a lot of the cases as @piever explains. If alternative representations (Union{NonMissing{T}, Missing}, AbstractFloats with sentinel values, etc) can utilize that same basic API, then it’s really fine.

I think the way to summarize the technical discussions for @jlperla and similar audiences is that, there is something that works well in a large number of cases that can be a nice fast and stable API for things like DataFrames. The nature of being a Julia developer is that you always want to make basic constructs faster/better, and so some people are going to continually ask questions about how to make this better for lazy iterators on TPUs inside of a neural net represented by a JuMP model with differential equations… but don’t confuse that for it not working!

ExpandingMan · March 30, 2018, 7:39pm

Agreed. It would be nice if “values returned to users” (an admittedly difficult to define concept) resemble Union{T,Missing} as much as possible wherever possible. Hopefully these would literally be Union{T,Missing} in most cases. As a practical matter I have found dealing with Union{T,Missing} to be much more painless than the alternatives and frankly I usually don’t have a particularly difficult time getting from my containers into something useful and “fully efficient” (i.e. not suffering from inefficiencies due to missingness or type instability). As I have emphasized many times however, the concept of missingness tends to disappear altoghether once you actually do something with your data, so I can’t help but feel that this problem has implications more limited in scope than this thread would seem to imply, though certainly this quite an intimidating problem within databases and data storage frameworks themselves.

StefanKarpinski · March 30, 2018, 8:16pm

Agree 100%. This thread has an awful lot of FUD in it. I’ll risk being overly blunt in the hopes that it will provide some clarity and reduce the FUD level:

We are committed to representing missing data as small unions, i.e. Union{Missing,T}. This is not arbitrary: we’ve tried all the options and this is the one that works the best with Julia’s combination of dynamic typing and high-performance JIT.
Operations with “unpredictable” types should emulate Base’s map function and produce containers with an element type that is based on the values that are actually in them. They should not rely on type inference except for the case of empty containers.
Yes, this map-like behavior is currently a bit tricky to implement efficiently. There has been some discussion of providing easier ways to express this kind of pattern and there are plenty of possible directions for such supporting infrastructure.
The O(2^n) specializations issue has already been addressed by @jeff.bezanson: the Julia compiler deals with potentially exponential numbers of specializations all the time. This is no different. There may be some cases where the current compiler heuristics are off and bad behavior occurs, but those are dealt with the way optimizers have been improved since time immemorial: by collecting use cases and benchmarks and tweaking the heuristics until they handle more and more situations gracefully.
Please stop taking @jameson’s statements out of context. They almost certainly don’t mean what you think they mean. Honestly, I’m not entirely sure what they mean. What I am sure of is that they don’t mean “I implemented the small union stuff and we’re all screwed — this will never work.” which is how people are representing them here.

In short, we are quite clear on how missing values will work in Julia: missing data is represented with small unions. The performance is already good and it will improve over time. There are some issues to be worked out in terms of which programming patterns are best for dealing with this representation, but the uncertainty is quite overblown in this thread. These patterns will become clearer over time as more people have worked with the new representation. As indicated by @ExpandingMan and others, it’s already quite usable.

ExpandingMan · March 30, 2018, 8:33pm

To expand on what I was saying, and what I think @StefanKarpinski was echoing: I think it’s really easy to hypothesize the set of every possible use case of missingness from a purely abstract programming perspective, realize that this set is huge and variegated and then totally freak out. The thing is that most of those cases will probably never actually occur (to draw an analogy from quantum mechanics, they are a bit like all the “exotic” non-classical states that will never get visited by your Hamiltonian). At the end of the day algorithms use floats and ints. At the end of the day you have to somehow remove the missings. At the end of the day all the functions you wrote will only make sense on a tiny subset of every possible Julia type. This has been going through my mind all the time that I’ve been following this thread, and I have so far been meek about it because honestly I’m pretty ignorant when it comes to a lot of this data manipulation stuff, and certainly I don’t know the answer to most of the very specific sub-problems that have been raised here (so please don’t take this comment as denigrating the above discussion in any way). One thing I’m not ignorant about however is all sorts of things that might happen to data once it gets out of some dataframe or other storage format (e.g. optimization, differential equations, machine learning, monte carlo simulations, rendering): and in those cases I can tell you with some confidence that most of this stuff will just go away.

piever · March 30, 2018, 9:30pm

As mentioned above, I’ve implemented this map like behavior (relying on inference only in the empty case) for iterables of Tuples or NamedTuples in JuliaDB here. I think at least in terms of JuliaDB it gets us almost all the way there with one concern (in the JuliaDB case, the Query case is possibly different as I’m using the trick of only copying the columns where the data type actually gets wider which cannot be implemented for all Query’s sinks) for the case Union{T, Missing}:

The map like approach will forget that a column allows missing if no missings appear. Meaning a column Union{Int, Missing}[1,2,2] will always be transformed in Int[1,2,3]. There is no way around this if we don’t use inference (especially because one could - and it is often useful to do so - combine all sort of lazy operations on iterators). Even using inference it is a tricky problem.

This concern has two negative consequences :

Columns that allow missing data will stop doing so unless some data is actually missing.
In the case of distributed storage for JuliaDB, as some chunks may have missing data and other may not have it in the same columns, we would end up with a distributed table where the various chunks are of different type

I think we need to decide whether, given these issues, as well as the set of issues that have prevented Query from transitioning to Missing so far, we should:

Option 1) Acknowledge the limitations of the Union{T, Missing} scenario and consider that some packages may prefer the Union{NonMissing{T}, Missing} implementation (or maybe even the typed Missing version, we’d need to figure this out), potentially in a way that is invisible to the user: meaning the public API of skipmissing, a future lift and so on would take care of both cases and functions that have a method to lift Missing would also need to have a method to lift NonMissing

Option 2) Stick to the Union{T, Missing} as the only officially supported way. Either packages can be rewritten to use that representation or they have to use their own missing data, as has been the case for Query so far

In my view, there clearly is a trade-off between option 1) and option 2) (simplicity versus universality) and I am hopeful that in this thread we could find some sort of compromise and work together to implement whatever still needs to be done.

jameson · March 31, 2018, 5:15am

As this can happen independent of Missing, it seems like a robust implementation should either handle this intelligently or require the user to handle it explicitly. It doesn’t seem like this should be a driving concern for here.

nalimilan · March 31, 2018, 9:45am

Is 2. really a problem in practice? This shouldn’t make any difference in behavior, and if you need to combine the resulting columns again in a single storage then promote_type or promote_typejoin will give you the expected column type.

piever · March 31, 2018, 12:16pm

This is an interesting shift in perspective, in the same way as one could say that Union{T, Missing} being slow on Julia 0.6 is not a limitation of the approach, on the contrary it highlighted a limitation of Juila (about small unions being slow when they don’t need to) which was then fixed. In the same way, this is highlighting a limitation of JuliaDB (it only works well with concretely typed columns) because, unless one uses Union for missing data, it is really quite uncommon to have a union of two types in the same column.

Thinking more carefully about it, I think that problem 2. (different chunks have different types - which can happen both when splitting for out-of-core operations or for grouping according to some columns) has in turn two components:

Make sure that everything (like ungrouping for example) still works. In the case of grouping this will require some work on flatten which is the JuliaDB equivalent of combine but I’m confident it can be done. On the “out-of-core” side of things, I’m less sure as I’m not very familiar with the out-of-core machinery implementation details
Exacerbates the problem that columns may not allow missing all of a sudden (now things get tricky even if only one chunk doesn’t have missing data), but this really is the same as the first concern I was mentioning above.

piever · March 31, 2018, 12:34pm

To fix problem 1) instead, one could instead encourage a use of some API that works regardless of whether the columns allowed missing initially.

For example, when collecting an iterable of NamedTuples without eltype and length, we check whether we need to expand the type and then push!. When the iterable has a length we preallocate and do the same with setindex! (check whether one needs to widen type, then do setindex!).

We could simply create some API that simplifies doing these two things and show user that this is how one works with potentially missing data. I’m not sure how easy or hard it will be to create this API.

JuliaDBMeta could potentially help here and @byrow! would already use the smart potentially widening setindex!. We could also change collect_columns to work already from a given starting point (meaning it would take a table and an iterable and add the rows of the iterable at the end of the table) and we could recommend using that instead of push!.

EDIT: one crucial optimization here would be to allow converting from Array{T, N} to Array{Union{T, Missing}, N} without copying and then it really would not make much difference whether one was initially allowing missing data or not.

quinnj · March 31, 2018, 6:40pm

It could maybe work in some cases to convert from Array{T, N} to Array{Union{T, Missing}, N}, but basically only when realloc is able to use the same memory range. It’s not that terrible though to just malloc a new region big enough for Union{T, Missing} and then copy everything over. We probably need a little extra machinery in Base though to handle this though. Would you mind opening an issue about it? It should also be pretty straightforward to have the opposite conversion, Array{Union{T, Missing}, N} to Array{T, N} w/o copying.

jameson · March 31, 2018, 8:45pm

This operation is O(1) (amortized), the exact same as the push! operation you’re comparing it against (but also generally with a lower constant factor – although strictly speaking that’s an implementation detail). It’s thus not possible for this to be a crucial optimization, since it’s guaranteed to only spends a minority of the time here.

piever · April 2, 2018, 10:34am

That’s a good point, so this optimization is mainly relevant in the setindex! case. Luckily, in practice setting an element of an Array to missing should happen mostly when initializing. If there is a regular value already, it’s strange to set that value to missing.

@quinnj I’ve opened the issue here

Topic		Replies	Views
Type inference of tables /w missing cells Internals & Design inference , type , suggestions , tuple	4	810	February 17, 2019
Broader (non-concrete) types in NamedTuple Internals & Design question	5	620	October 12, 2018
Compatibility of Query and Union{T, Missing} Data	3	1737	November 28, 2017
Type stability problem with NamedTuple of Union{T,Missing} General Usage	1	393	November 14, 2018
Announcement: An Update on DataFrames Future Plans Data announcement	41	9247	December 27, 2017

Missing data and NamedTuple compatibility

Related topics