Missing data and NamedTuple compatibility

Yes, we may need a way for inference to compute the promoted type directly rather than first computing the full Union and then simplifying it with promote_strict_type.

Sure. In general that shouldn’t be too hard since all it takes is to have a method in Base for types to opt-in for this behavior.

2 Likes

I see, so maybe one could implement something like Base._return_type but it would use a:

_keep_union_within_container(::Type) = false
_keep_union_within_container(::Type{Missing}) = true

internally (to decide whether it should return a Union{Container{T}, Container{S}} or Container{Union{T, S}}) and any type that wants the latter behavior would overload it. I wonder if one could even change Base._return_type directly, rather than having a separate function (of course only for Missing, or whichever type decides to overload _keep_union_within_container).

My PR has been merged, so now map, collect, broadcast and similar functions use Union{T, Missing} rather than Any, including with Tuple and NamedTuple. This should satisfy all our needs in terms of user-visible behavior, right @davidanthoff?

Now the next step is to find a way to handle missing values efficiently. I haven’t really looked at how hard implementing the various definitions we evoked would be.

3 Likes

Bravo! And in time for 1.0 too!
About the next steps, how tough do you think it will be to restore performance for this sort of union type inference?

julia> @btime $[1, 2] .+ $[1, missing]
  26.469 μs (40 allocations: 1.34 KiB)
2-element Array{Union{Missing, Int64},1}:
 2       
  missing

julia> @btime $[1, 2] .+ $[1, 2]
  30.755 ns (1 allocation: 96 bytes)
2-element Array{Int64,1}:
 2
 4

EDIT: A note - in a 31 days old master without this PR the broadcast to Any is actually faster:

julia> @btime $[1, 2] .+ $[1, missing]
  16.986 μs (45 allocations: 1.42 KiB)
2-element Array{Any,1}:
 2       
  missing
1 Like

I have no idea how hard it will be to improve performance in general. There are probably some low-hanging fruits. In the broadcast example you give, it appears that broadcast_nonleaf is used, and it’s a lot slower than the method used for concrete types. The approach is similar to map: it starts with the type of the first elements, and widens it as needed. Apparently the type instability really kills performance. (The regression from an older master could be due to the fact that now the array needs to be copied when missing is encountered, since the original Int element type cannot contain it. It would be worth trying with larger arrays, both actually containing and not containing missing values.)

As noted above, a possible solution would be to special-case Union{T, Missing} (or all small unions of concrete types) so that we would always use Array{Union{T, Missing}}, and if at the end we haven’t encountered both types, perform a conversion. Many more optimizations are likely possible, e.g. improving the efficiency of getindex/setindex on arrays with Union element types would also help in general.

3 Likes

I’ve filed another issue to discuss a possible approach based on improving inference.

1 Like

@davidanthoff When filing issue #25925, I realized that inference is currently quite good with tuples, but not with NamedTuples (details in the issue). Do you think Query would work well with Missing if it used tuples rather than NamedTuples? Of course I’m not suggesting to use tuples, I’d just like to find out whether improving inference so that type information about NamedTuples is as rich as that for tuples would be enough. That would help arguing for these changes.

As developed in the issue, the idea I have in mind is that return_type would be called, and if it gives a precise enough type (e.g. either a concrete type or a Union{T, Missing} type, or a tuple thereof), it would be used to choose the element type of the result in advance.

7 posts were split to a new topic: KeyedTuple and inference

Alright, I finally managed to catch up with this thread and some of the issues that it links to. Sorry for being slow.

My high level comment is that I think we need to get a clear feedback from the core devs whether a strategy that relies on inference is all of a sudden blessed. The message I got from pretty much all the core devs was really clear: don’t use inference to pick container types. If they have changed their mind about that, it seems fine to investigate that route, but if not, it seems like a dead end to me. And at least I have not gotten any indication that there is a change in guidance on that point.

So my current plan for Query.jl is to actually get rid of the reliance on inference. I’ve put a fair bit of time and effort into that already. If there was a change in opinion about inference on the part of the core devs, I’d love to know because it would mean I don’t have to spend time on that effort. Having said that, the current version of inference clearly warrants a move away from it in the Query.jl world, there are lots and lots of situations where things fail because inference can’t resolve things. I’ve pretty much mapped out how I can do that at this point. Essentially I would use the current map strategy throughout Query.jl. That will work great and be fast with DataValue, but I don’t see how it would work with Union{T,Missing}, because of the issues I described above.

So I’m not sure we are any closer to a solution for this situation than we were whenever we started talking about this problem…

Alright, now some random reactions to various things from above:

@nalimilan I looked at https://github.com/JuliaLang/julia/pull/25553, and just want to make sure I understand it correctly. That is essentially an implementation of what I described as case “1. Plain array sink”, right? So it gets you an array of the right type, but it ends up copying things n times in the process, so it really is quite inefficient?

Also, does this really work on master? I just tried the following:

julia> [(3,missing),(missing,6.)]
2-element Array{Tuple{Any,Any},1}:
 (3, missing)
 (missing, 6.0)

julia> [(a=3,b=missing),(a=missing,b=5.)]
2-element Array{NamedTuple{(:a, :b),T} where T<:Tuple,1}:
 (a = 3, b = missing)
 (a = missing, b = 5.0)

so in neither case does it produce the type of array we would like to see, right? Or was that PR meant for some other scenario?

In general, the design with promote_typejoin seems nice in that it allows new types to plug into that mechanism. But on the other hand, it is also a pretty clear case of what I meant when I wrote in some other thread a while ago that the Union{T,Missing} approach really leads to a design that is not very composable: now types like NamedTuple have to explicitly opt in and provide code to handle these cases with missing values. In my mind that is the poster child of an uncomposable design: named tuples have nothing to do with missing data, and missing data has nothing to do with named tuples. In a composable system these two would just work together without extra integration code. This is a real issue, at least for Query.jl: one of the core design goals of Query.jl is to not just work for tabular data, but all sorts of data. So named tuples are just one case of a much larger class of structures that Query.jl needs to work with. But now, any other structure that one might want to use in the place of a named tuple needs to provide the same integration via promote_typejoin with the missing data story. I guess it can work, but it does not strike me as a really good design.

@piever made a point above that actually made me realize that there is another issue related to what we are discussing here that we haven’t even talked about, but that also seems tricky. So in a iterator pipeline with named tuples with missing data we really can have two different situations with the Union{T,Missing} story: early on in the pipeline, when we iterate directly from say a DataFrame, the stream of elements would all have the same element type, i.e. the named tuples fields would always be Union{T,Missing} for columns that can have missing values. But after a map (or similar operation, there are lots of those in Query.jl), we’ll have these streams of 2^n different named tuple types. So now, if we chain another iterator after such a map, and that iterator is a higher order function like say filter, then we have another, new problem: the anonymous function in the filter would now probably compile specialized methods for 2^n different type signatures, because it of course would be called with the 2^n different named tuple types. That can’t be good :slight_smile: . So this is actually quite a distinct problem from the ones we have discussed so far.

I also just saw @StefanKarpinski message re julia 1.0 and that the new optimizer is meant to help with the woes in the data space. Does someone have more details about that? What I’ve seen so far seemed to address the issues from the iteration protocol, which seem quite distinct from the issues here, so it would be great to hear more to what extend the new optimizer would actually help with the problems in the data stack. Am I wrong to assume that in principle a new optimizer could help with the performance problem of anonymous functions that return 2^n types, but not really address the other (probably more difficult) issues discussed here?

Once again, you need to distinguish between choosing the return type using inference and optimizing the implementation behind the scenes using inference. The former is indeed strongly discouraged, but I don’t think core devs are opposed to the latter.

That PR doesn’t in itself propose any particular solution. But the solution I propose in issue 25925 is completely different from the “plain array sink”, precisely because you’d use inference to detect fields that may be Union{T, Missing}, and allocate an array which allows for missing values. If in the end the data does not actually contain missing values, the resulting array would be converted, so that the return type does not depend on inference, but only on the data (like map currently).

That part of my PR was reverted by Jameson in the meantime. And since nobody supported my position…

My PR 25924 hasn’t been merged, so there are no special promote_typejoin methods for Tuple nor NamedTuple. All promote_typejoin does is return Union{T, Missing} instead of Any for promote_typejoin(T, Missing).

This sounds like a very similar problem to me. But why do you say map will create 2^n different types if the original stream provides named tuples with the same type?

I don’t really know, but I’m afraid it won’t help given that if we generate 2^n different types the compiler has no choice other than specializing on too many types, or falling back to not specializing anything.

EDIT: hit return too fast.

1 Like

That’s a key point: IndexedTables is in the same situation: in a few cases the output type depends on the outcome of inference, which it shouldn’t, whereas I think it is fine to use a faster implementation if one can infer a concrete type. There was some discussions here and here (_promote_op is the function that determines the expected outcome type using inference). If a good implementation is found (of a map like behavior, optimized when inference work) IndexedTables would be a good place to try it out (it is needed for quite a few functions: groupby, join, map, groupreduce).

EDIT: I’m sorry to hear @nalimilan PR was reverted, what’s the proposed work around to call map, collect and broadcast on Arrays with missing data?

Interesting. Could you provide an example where this is the case? How does the current implementation manage to be efficient in these cases?

The revert only affects the element type of array of tuples whose fields can be missing. Array{Union{T,Missing}} still uses an efficient representation. It’s not clear to me what’s the best proposed solution to work with arrays of tuples or named tuples with missing fields.

@shashi is for sure much more qualified than I am to comment on this. My understanding is that now there are two different strategies, which favor performance but are not 100% correct.

  • map (and I believe also groupby) tries to infer the output type and preallocates an adequate number of columns. If the output type can’t be inferred, it creates a vector of NamedTuples instead (which is indeed not ideal, see for example #101 and #110 ). What we could do instead would be to then convert this vector into separate columns: it’d be inefficient but at least the result would not depend on the inferrability of the operation.

  • groupreduce is even more radical and chooses the output element type according to the result of the first group. This of course will need to change with Missings as if the first result is missing one would have no information to work with.

Do you think it’s possible to create a milestone issue about this set of problems? I believe they are quite urgent and I don’t have a strong intuition as to whether they can be solved in a non-breaking way.

Well, I’m a bit disappointed that IndexedTables.jl calls promote_op. It didn’t when I started working on it. But there’s a hierarchy of inference-dependence:

  1. Best: don’t ever call promote_op or return_type.
  2. Mostly OK: call return_type, but only use it as a performance hint, making sure to give the same result no matter what it returns.
  3. Bad: program result depends on return_type.

As for promote_typejoin, we’ll have to take it up with Jameson. If I recall, his concern is about the overhead of copying part of the result every time the element type changes. But I doubt we have sufficient data on this. We could also try style (2) above, using return_type to “guess” an element type.

As for the non-composability of needing promote_typejoin methods for new containers, I agree with David only to a certain point. First, how many row types do we really need? You can represent any kind of array or table data with just tuples, so that type is a rather low-level aspect of the implementation. It doesn’t seem so crazy that a new kind of row would need one extra method to fine-tune its type behavior (certainly, I hope, not affecting correctness!). Second, don’t we need these kinds of definitions anyway? Wouldn’t you similarly want to promote MyRow{DataValue{T}} and MyRow{DataValue{S}}?

Don’t worry about that — we already compile way too many specializations and it’s on us to do better. And at risk of stating the obvious, all 2^n won’t actually occur; if you have 50 columns we will not literally need 2^50 methods…

6 Likes

I’m working on an inference free implementation of map in IndexedTables (in Julia 0.6) similar to Base.map (the Columns type actually helps quite a bit in the implementation). I’ll try to make a WIP PR soon so that we can discuss more concretely.

2 Likes

Update: pull request opened here. It uses some of IndexedTables.Columns machinery but the key parts should be clear even without knowing about Columns:

  • How to decide when to widen (I compare with <: field by field)
  • How to widen (I use “fieldwise” typejoin which I believe should become “fieldwise” Base.promote_typejoin on Julia 0.7)

I have the impression that doing the operations fieldwise solves the combinatorial explosion issue we were discussing above, but we certainly need quite a bit of testing to be sure.

2 Likes

Here are some reactions to specific points from above. I’ll post another message right after that about another, broader question.

I just don’t have an idea how to do that. If I try to follow Jeff’s 3-inference-usage-rules from later in the thread (which essentially is how I had understood the core devs position all along), then it seems to me that one can reasonably use inference results if they are concrete, but not for anything else. And with Missing one won’t ever get a concrete return type, so essentially that implies that one can’t use inference.

That still seems to imply a lot of copying in pretty common situations. Say you have a source DataFrame with a column of type Union{T,Missing} that actually doesn’t contain any missing values. A simple projection over that would always end up making a full copy at the end. That seems a way too common scenario to me to accept that copying overhead.

Say you start out with df = DataFrame(a=[1,2,missing,missing], b=[1,missing,2,missing]). When you iterate that, the element type will be NT{a::Int?, b::Int?} (again some type syntax I just made up). If you then create a generator over that a la (a=i.a, b=i.b) for i in df (assuming DataFrame were to implement iteration with named tuples for rows), that generator will generate elements with the following types: NT{a::Int,b::Int}, NT{a::Int,b::Missing}, NT{a::Missing,b::Int} and NT{a::Missing,b::Missing}. With n columns you can end up with 2^n types (to be clear, that is the worst case upper bound).

Thanks, that is very clear.

Well, Query.jl is not just about rows, it really is meant to bridge between a tabular data world and everything else. So I use it all the time to for example project tables into arrays of custom types. Those aren’t rows, they are often types from some domain specific project that is completely unrelated to any data stack story.

That is a good point that I hadn’t considered. I think in practice this doesn’t happen very often in the typical Query.jl cases (at least in my use-cases): you typically start out with a collection of elements that all have the same type, and then in general with DataValue there is just hardly ever a situation where one ends up with a stream of heterogeneously typed elements, so the whole promote machinery just hardly ever kicks in. Less clear to me how often that assumption breaks down, though.

Well, I do worry about it because I see nothing on the 1.0 milestone on this, so I’m operating under the assumption that this is a post 1.0 feature, which would mean Missings is off limits for quite a while for a fair number of packages. Also, could you elaborate what the proposed solution for this would be? Just not specializing also doesn’t sound great, after all we do want these small anonymous functions to be really fast (they are really in the innermost loop) and ideally inlined in the function that hosts the loop that iterates things. The way this plays out with DataValue right now is pretty much ideal: typically one version gets compiled for each table structure, and that one is then inlined in the collecting loop. So there is small jit overhead that is not an issue, but then the loop is super fast.

Well, it depends on your data. It is obviously an upper bound, but I think for a typically dataset one can easily end up with way too many specializations…

This doesn’t make much sense to me — whether an inference result is concrete is orthogonal to whether you use it in a way that causes your code to give an inference-dependent answer.

I really like @piever’s collect implementation, great to see that! I think that essentially qualifies as what I described above as the strategy for struct of array sinks, right? But of course that only works in the more narrow case of a struct of array sink (so for JuliaDB this is great), but the more general Query.jl requirements don’t really match with that (i.e. to also support array of struct sinks and persistent sinks, the two scenarios I described above).

There is one aspect to that strategy that will be weird with Missing, though. Say you have df = DataFrame(a=[1,2,3,4], b=[1,2,missing,missing]). So column b will be of type Int?. The question now is: what column types should one get for a query like df |> @filter(_.a<3) |> DataFrame? I think column b in this case should still be of type Int?. That to me seems how most systems like this would work. But with a Missing implementation that doesn’t rely in a discouraged way on inference the column type would be Int. With DataValue that is not a problem: @piever’s implementation will just produce the expected and what to me seems the correct result, namely an Int? column.

Maybe my intuition on this is off here, but the following seems like a useful analogy: imagine filter(i -> i<3, [1,2,4.,6.]) would return an Array{Int}. That would surely be weird, but the missing case above seems very similar to me.

UPDATE: Actually, my example was incomplete. It also needs a map in there. So say something like df |> @filter(_.a<3) |> @map({_.a, _.b}) |> DataFrame

I might not understand you fully here, but in this case it sounds like you’re just dealing with values themselves, and there are no row containers, and hence no custom row containers to define promotion rules for.