Missing data and NamedTuple compatibility

piever · January 3, 2018, 5:52pm

By adding support for missing data and NamedTuples Julia is offering very solid foundations to build a data ecosystem upon. While I’m extremely happy that these two features have been added (especially given the extensive discussion and work that they required), I’d like to show that there are some scenarios where I’m afraid they do not play nicely with each other and see what are possible solutions (I’m not an expert on the technical side though, so my ideas in this respect may be flawed/unfeasible). This issue has been already mentioned elsewhere, but I hope it can be useful to have a writeup that’s understandable also for non-experts (such as myself).

Let’s start with a concrete example. When updating the package MySQL to the new version of DataFrames, it became clear that there was something tricky with the corresponding row iterator. To be more specific, let’s imagine that I’m streaming data from a remote dataset with two columns, :x and :y, which contain integers and can both have missing data. A first obvious attempt is to iterate each row as a tuple, with a value for :x and one for :y. However, this is type unstable, as every row could then have 4 different types:

Tuple{Int, Int}
Tuple{Int, Missing}
Tuple{Missing, Int}
Tuple{Missing, Missing}

This seems bad as the number of possible types increases exponentially with the number of columns of the dataset. Note that explicitly typing the returned Tuple doesn’t actually do anything, as Tuples are covariant, therefore:

julia> typeof(Tuple{Union{Missing, Int64}, Union{Missing, Int64}}((1, 2)))
Tuple{Int64,Int64}

The next attempt is to use NamedTuples instead as IIUC they are not covariant so the explicit typing is effective:

julia> typeof(NamedTuple{(:x, :y),Tuple{Union{Missing,Int}, Union{Missing, Int}}}((1, 2)))
NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}}

This effectively solves the problem of what a data source should iterate: an explicitly typed NamedTuple where some fields explicitly accept missing data.

However, as was mentioned here by @davidanthoff, applying a function elementwise to a vector of NamedTuples would quickly lead to type instabilities:

julia> v = NamedTuple{(:x, :y),Tuple{Union{Missing,Int}, Union{Missing, Int}}}.([(1,1),(1,missing),(missing,1),(missing,missing)])
4-element Array{NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}},1}:
 NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}}((1, 1))            
 NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}}((1, missing))      
 NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}}((missing, 1))      
 NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}}((missing, missing))

julia> map(i -> (s = i.x, t = i.x+i.y), v)
4-element Array{NamedTuple{(:s, :t),T} where T<:Tuple,1}:
 (s = 1, t = 2)            
 (s = 1, t = missing)      
 (s = missing, t = missing)
 (s = missing, t = missing)

Note that here the resulting Vector has lost the typing of its NamedTuples.

This I believe would be a problem both for Query.jl and JuliaDB.jl, were they to transition to Missing, as they both have this map operator (in JuliaDB it actually is map, in Query it is @map or @select and corresponds to a LINQ select statement). The only way to rescue this, right now, would be to explicitly type the outcome of the anonymous function that map uses, which would greatly damage usability.

Given that a Vector of NamedTuples has been proposed as a possible “general” table representation and that being unable to apply map to it in a type stable way is a clear drawback, I wondered whether we need a new type, something like a NullableNamedTuple, which by default allows missing data. For example NullableNamedTuple{(:x, :y),Tuple{Int64, Int64}} would be equivalent to NamedTuple{(:x, :y),Tuple{Union{Missing, Int64},Union{Missing, Int64}}}. Then, one could use map(f, v) where v is a vector of NullableNamedTuples and f takes as input a NullableNamedTuple and outputs a NullableNamedTuple. Note that this would be reasonably easy to incorporate in Query as it already uses a special syntax {...} for NamedTuples and it would maybe be possible to use that syntax to instead mean NullableNamedTuple.

Another alternative would be to have a simplified syntax to declare which fields would accept missing data. For example (the notation is made up) (x ?= 1, y = 2) would be of type NamedTuple{(:x, :y),Tuple{Union{Missing, Int64}, Int64}}. In this case, to apply map the user would have a slightly less convenient syntax and would have to type:

map(i -> (s ?= i.x, t ?= i.x+i.y), v)

The trade-off here would be whether it’s better to be explicit about which fields can have missing data, at the cost of a more verbose syntax.

davidanthoff · January 4, 2018, 12:46am

If {} in Query.jl mapped to NullableNamedTuple (or would it be MissableNamedTuple ) you would always get a result where every column supports missing values, even if you started with a source that didn’t have any support for missing values. That seems really less than ideal, compared to the current behavior, where missing columns flow through the query as missing columns, and normal columns (i.e. Vector) flow through as normal scalar values that end up as a normal Vector in the result.

I think adding specific syntax in the named tuple construction that indicates missable fields (what is the right term here? The equivalent to “nullable fields”?) is just too verbose for something like Query. By far the most common type of a query projection is simple that one selects a subset of columns with say @map({_.colA, _.colB, _.colD}). With this proposal one would have to know whether one of these source columns supported missing data or not and then use different syntax depending on that for that specific column. I just don’t think that is user friendly. Just imagine you had to write a different SELECT statement in SQL depending on whether a source column supported NULL or not to do a simple projection…

I have to say it still baffles me why a solution with such major unresolved technical issues was put into Base… If it had been at least the stdlib, then maybe one could have iterated on the design before julia 2.0… But well, I clearly lost that argument I think the reality of all of this is that we’ll simply have a data stack that uses two different missing data implementations (Query.jl and friends will just continue to use DataValue, which has none of these problems and has worked really well over the last year, i.e. there isn’t really any major pain-point point with that approach that Missings would solve).

kristoffer.carlsson · January 4, 2018, 12:57am

Have you measured the effect on performance of this type instability? Julia 0.7 has quite a lot of new Union optimizations.

piever · January 4, 2018, 3:34am

I didn’t do rigorous benchmarking. My understanding is that the new optimizations concern small unions, whereas here the number of possible types is 2^n where n is the number of “missable” columns. When this number gets too big, I’m afraid the type information gets lost completely and one is left with a Vector of untyped Tuples.

piever · January 4, 2018, 3:57am

My idea was to take care of that when collecting. When the iterable of NullableNamedTuples (a better name is needed…) is collected into say a DataFrame, the columns that do not have any missings are converted to regular Vector{T} whereas the others stay as Vector{Union{T, Missing}}. This conversion could maybe be possible without copying, though I’m not sure about that.

Inside the query, it is true that all columns would accept Missings, but that shouldn’t be a concern: a key advantage of the Missing approach (over a container approach) is that the code doesn’t need to be changed if some column allows missing data if there actually is no missing data (whereas DataValue would require the occasional get as soon as it encounters a function that is not “whitelisted”).

One might argue that this solution is still not ideal as in the final output some columns that “should be nullable” in the sense that they are a function of nullable columns would not be nullable if there is no missing data in the input. I’m not sure whether this is a problem in practice (though I don’t think it should be). I also don’t think there is a way to avoid this behavior with a Union approach to missing data without relying on Base._return_type.

What drove my curiosity was this comment about JuliaDB. I was trying to understand what solution JuliaDB’s developers had in mind as it seems to me that the same issues that affect Query would also affect JuliaDB (JuliaDB’s map being very similar to Query’s @map).

nalimilan · January 4, 2018, 10:04am

That’s a serious problem, which has been discussed previously in this issue. Thanks for bringing it up again and proposing solutions! The NullableNamedTuples idea sounds clever. However, I would really like these to work by default with plain NamedTuple. Your post prompted me to start a discussion again with some of the core developers, and it looks like we agree on a solution which involves changing how collect and map compute the element type of the returned array.

The idea is that instead of using the “raw” element type Union{NamedTuple{{:x, :y}, Tuple{Int64, Int64}}, NamedTuple{{:x, :y}, Tuple{Missing, Int64}}, NamedTuple{{:x, :y}, Tuple{Int64, Missing}}, NamedTuple{{:x, :y}, Tuple{Missing, Missing}}}, these functions would detect that this type is a Union of NamedTuple types with different type parameters, and would move the Union to the type parameter themselves, giving NamedTuple{{:x, :y}, Tuple{Union{Int64, Missing}, Union{Int64, Missing}}. Like your NullableNamedTuples approach, a column would only allow for missing values if some missing values are actually present (this is how map works for all types so this cannot really be changed).

This is actually related to the currently open PR 24332, which is essential to get map to work with missing (even apart from issues related to tuples). But it appears it would make sense to go even further and use a mechanism almost identical to promote to choose the element type. The only change compared with promote is that the computed type must always be able to represent all the values exactly, which doesn’t work currently e.g. for promote(1.0, typemax(Int64)). So we agreed that we would need a separate mechanism, tentatively called promote_strict for that. promote could actually automatically fall back on that function, so that most types only need to implement the former.

Help would be welcome to experiment this if somebody is interested.

piever · January 4, 2018, 3:48pm

I’d tend to agree that having map, collect, NamedTuples and Missing play nice with each other would be much better than an ad hoc NullableNamedTuple type. Good to know that if I propose some hack somebody will come along with a better solution

It’d be ideal to have an implementation of map that works very well with Missing in Base so that JuliaDB and Query could fall back to that if inference fails (or it could even be the default): I’ve already had several issues trying to apply function whose return type is not inferred correctly and I think it’s a usability concern as it may be difficult for the average user to figure out what is going on.

Could your proposed improvement be done in the 1.x timeframe?

bramtayl · January 4, 2018, 5:26pm

This seems like a useful strategy but it would be unfortunate to limit it to NamedTuples. Would it be dangerous to get this to work with any nested type?

davidanthoff · January 4, 2018, 5:31pm

There are really two issues here:

The first issue (a) is the performance of the type-instable anonymous functions that can return 2^n different types, essentially what @piever pointed out above. My understanding is that @jeff.bezanson thinks one could add further compiler optimizations along the lines of the small union type optimization that make such type-instable functions (that can return these “big, 2^n”-unions) also fast. That would obviously be great. I’ve operated under the assumptionthat this would come in julia 1.1 the earliest, given the project schedule for 1.0, but I haven’t heard anything official about that. I’ve said a couple of times that I don’t see how Query.jl could use Missing in the julia 1.0 timeframe, this is the main reason for that. @piever I saw the JuliaDB anouncement as well and was also wondering how they plan to solve this if they want to move to Missing on julia 1.0, I think they’ll run into exactly the same issue with the current API (which I really like, I have to say).

The second issue (b) is what do you do with a stream of named tuples that all have these slightly different types? I think there are different stories for different types of sinks:

Plain array sink: if you collect them into a plain array, things are tricky. Lets say you receive 10k elements with types NamedTuple{a=Int,b=Missing} (I’m using made up type syntax here, this is a shortcut for NamedTuple{(:a,:b),Tuple{Int,Missing}}, so you create a Vector{NamedTuple{a=Int,b=Missing}} and happily push them into that. Now element 10001 you get from your iterator is of type NamedTuple{a=Missing,b=Missing}. So now you realize that you actually need a Vector{NamedTuple{a=Int?,b=Missing}}. At this point you’ll have to allocate a new vector, copy all 10k elements over and add the new element. The memory layout for each element will be entirely different in the new vector, so you will have to actually somehow copy/modify each element individually, no way to use something like just copying a memory region over. If you have n columns this need to reallocate everything can happen ~~n+1~~ n times in the worst case, and if you are unlucky and the type variations only show up at the end of the stream you are going to copy a lot of stuff around.
Struct of array sink (like DataFrame): the situation here is much better and probably not a problem at all. You get your 10k first elements, you allocate one vector Vector{Int} for the first column and one Vector{Missing} for the second column. You happily push your 10k elements into those. Now element 10001 arrives, and you realize that column 1 actually needs to be Vector{Int?}, not Vector{Int}. Changing the type in that way should actually be easily possible without any memory copying, given the memory layout of arrays of small unions (I don’t know whether one can do that right now with the public API, but purely from how things are in memory, this should be simple).
More persistent sinks: examples are you want to stream a query result into a database (you get an iterator of named tuples, you create a table in a DB for that data and stream the data into the DB), a file on disc or transmit something over a network. I don’t see how these cases can work with these streams of changing named tuples. There are lots of external systems where you need to tell the system up-front whether a column should be nullable or not, so the strategy to “widen/adjust” the sink midway through the collecting process doesn’t work in these cases (unless you abort the current stream and start over, but that might also not work in all situations, plus it would have really terrible performance implications).

Another approach to the second issue is to continue to rely on inference and make inference reliably detect these 2^n unions and return them as say NamedTuple{a=T1,b=T2,c=T3} where {T1<:Union{Int,Missing}, T2<:Union{String,Missing},T3<:Union{Float64,Missing}}. But the message I got from the core devs is very clear that we should not rely on inference to decide on container types, plus inference currently doesn’t give us that kind of result, so I’ve just assumed that this is not an option.

I think the solution that @nalimilan outlined above will run into issue (a) and (b.1) and (b.3) on julia 1.0 (unless there are plans to address (a) in the 1.0 timeframe). I can see that with compiler work in julia 1.1 issue (a) could be solved, but then we are still stuck with issue (b.1) and (b.3).

davidanthoff · January 4, 2018, 5:41pm

Couple of random other thoughts:

@bramtayl’s point is an important one, but I haven’t thought that one through. At least for Query.jl it is really key that all of this works not just for named tuples, but any composite type, including deeply nested composite types. I don’t know whether that makes things even more complicated or not, but this is important to keep in mind.
In some way it seems a bit messy that we would have two different ways to represent a row from a sink with columns that can support missing values. Sometimes a sink would get NamedTuple{a=Int?,b=String}, i.e. a named tuple where the field uses a small union, and in other cases it would have to deal with a different type depending per row, depending on whether a given column has a missing value in a row or not. If all these other things could be solved, this wouldn’t be the end of the world, but it strikes me as a classical leaky abstraction problem where a design mismatch somewhere deep then creeps through and ends up creating different special cases for situations that really shouldn’t be different.
The whole situation of mixing regular tuples and Union{T,Missing} also seems really unfortunate. The fact that one can’t have a tuple as a concrete type with a member of type Union{T,Missing} just seems really weird. I understand why that is the case (the co-variance of tuples), but at the end it just seems quite limiting to have a system where one can’t represent a row as a tuple in a similar way that one can represent a row as a named tuple. Again, probably not the end of the world, but in my mind it points to some real design mismatch.

nalimilan · January 6, 2018, 7:03pm

I’ve made a first WIP PR to illustrate the proposed mechanism: WIP: Add promote_strict mechanism and use it instead of typejoin() by nalimilan · Pull Request #25423 · JuliaLang/julia · GitHub

Yes, it would make sense to extend this to any parametric type. Actually that shouldn’t be hard as the (complex) logic implemented by typejoin could be reused.

@davidanthoff I’m not the best person to comment on performance issues (a), but AFAIK that’s totally doable. Regarding issue (b), I don’t think it’s as problematic as you say. The naive strategy of starting with a narrow type and extending it as needed doesn’t correspond to what will happen in typical cases. We can use inference to guess the return type, so in type-stable situations we can directly allocate a vector of the right type. The slow approach is only needed as a fallback when inference fails. What core developers say is that the observable result should not depend on inference. But the performance can definitely depend on inference, that’s how map works currently in Base (else it would always be slow).

Regarding specifically your point 3 (how to tell databases upfront whether a column will be nullable), I don’t see how DataValue/Nullable allows handling this better than Union{T, Missing}. Indeed, even with DataValue, you cannot know in advance that any operation on it will return a DataValue: some custom functions could return a non-nullable value, just like isnull does. Of course, you can decide that it’s not allowed and always convert the result to a DataValue. That can be a reasonable approach when the database needs to know the type in advance. But if you do that, you could as well apply the same approach with Union{T, Missing}, and decide that all columns resulting from computations on >:Missing columns should also be >:Missing.

If you want to provide more flexibility and allow functions to return nullable values or not depending on their purpose, the only way to find out how an operation behaves is to rely on inference. Maybe it’s not so bad to make it user-visible in particular contexts, like when working with databases. It might be less of an issue than for Julia Base itself, since the context of databases is less dynamic (you always know the input column types, and you need to choose the output column types in advance). But I don’t think there’s any fundamental difference between Union{T, Missing} and DataValue/Nullable in that regard.

Don’t take it bad, but I wish you spent as much energy helping us to find solutions as you spend detailing the difficulties. We really need your experience with all these issues to find a user-friendly, generic and efficient solution.

ScottPJones · January 6, 2018, 7:14pm

Just to bring up a point, that’s not true for a lot of databases (such as the one I worked on for years, or most other “NoSQL” databases).

piever · January 7, 2018, 1:35am

That PR looks like a major step forward. I hadn’t noticed before but also broadcast has problems with missing on Julia 0.6 and it seems like that would also be addressed in your PR:

julia> [1, missing]
2-element Array{Union{Int64, Missings.Missing},1}:
 1       
  missing

julia> [1, 2] .+ [1, missing]
2-element Array{Any,1}:
 2       
  missing

davidanthoff · January 10, 2018, 12:07am

Yes, that is what Jeff said. But not for julia 1.0, I assume?

I don’t think that is how map works. My understanding is that map only ever uses inference if it gets an empty input sequence, but in all other cases allocates a container based on the first element in the sequence and then widens as necessary if a new element arrives that requires that.

Here is my understanding of the “use type-inference to allocate a container” story: again, we need to differentiate different cases. If type inference returns a concrete type, I see no reason why one couldn’t use it to allocate a vector of the right type. This is really the best case, my understanding is that when type inference returns a concrete type, one can rely on it being right. At the same time, this is also the case where relying on type inference doesn’t buy you much, though: by definition you don’t have a sequence of values with different types, but instead a sequence of values that all have exactly the same type, so you might as well just look at the type of the first element in the sequence, use that to decide on the container type, and be guaranteed to never have to widen the container type. Things get a lot more interesting when type inference returns a non-concrete type. My understanding is that in those cases it could happen that type-inference returns a “too wide” type, or it might return a different type in different revisions of julia etc. So in those cases it is unclear a priori whether it is a better strategy to pick the container type based on the inference result, or start with a container type for the first element in the sequence and widen as necessary. The latter strategy though is more stable, and I think that is probably the reason that in base it is used. Essentially, if inference returns a non-concrete type, and you allocate your storage based on that, you can get these changes in the observable results.

So how does one of the simplest use cases of named tuples with missing value fields fit into this logic? Take something like map(i->(a=i.a,b=i.b),source). And lets assume the type of source is Vector{NamedTuple{a::Int?,b::Float64?}} (I’m again using made-up named tuple type syntax). If the missing value story is Union{T,Missing}, then inference will never return a concrete type for this map example (because that would of course just be incorrect). So with Missing you are always in this in-between world where inference returns a non-concrete type, and my understanding is that in base folks don’t want to rely on that. The situation is quite different with DataValue{T}. In that case inference in fact most often does return a concrete type, so we are in the simple world where we can rely on it without any issues. Note also that because of this, it should be fairly simple for me to drop the reliance on inference entirely and move to the strategy that map uses: because relying on inference doesn’t actually help much if inference returns a concrete type, one might as well not rely on it. Implementing that is currently my plan for Query.jl.

But in many ways this is a bit of a theoretical discussion, because as of right now, inference very reliable doesn’t provide any information that could be used to pick a container type for these named tuples with Union{T,Missing} fields. That was the original impetus for opening Union{T,Null} inference in structs · Issue #6 · JuliaData/Missings.jl · GitHub. Again, I have no idea how easy/difficult it would be to change inference to reliably return information like NamedTuple{a::T1,b::T2,c::T3} where {T1<:Union{Int,Missing},T2<:Union{Float64,Missing},T3<:Union{String,Missing}}. I think that is the correct pattern that inference would have to return. But again, this won’t happen for julia 1.0, right? As I wrote above, if we could lean reliably on inference in that way, the problem would go away, but I’ve interpreted the core devs as really not being keen on that kind of design.

Take the simple example of a map from above. With DataValue for missing values you get elements in your sequence that all have the same type, so you can just look at the first element in the sequence, create a table that has columns based on the types of the fields of the named tuple (and if a field is a DataValue, you make the column support NULL) and you are done. The vast, vast, vast majority of Query.jl streams end up having a concrete eltype for the iterators that they return if one uses DataValue. With Missing even the simplest queries wouldn’t have a concrete eltype, and so the whole strategy of looking at the type of the first element in the sequence doesn’t work.

Of course, if we could rely on inference for this, the problem would go away.

Trust me, I’ve spent more energy and time trying to find a solution for this problem than pretty much anything else over the last nine months or so. Don’t take the lack of a solution as an indicator that I haven’t tried.

So I’m a bit lost at this point. When I stumbled over this issue, I had initially thought that something along the lines of @nalimilan’s suggestion would be the solution, essentially improving inference to handle these cases and adding more performance optimizations for composite types of small unions. But at the same time I have interpreted the feedback from the core devs as very clearly discouraging this kind of reliance on inference, and at this point I kind of see why. The one thing that does seem clear is that with the current version of inference and optimizations for small unions, all of these ideas won’t work in any case. I don’t understand the julia 1.0 schedule well, but fixing these to me look like fairly big ticket items, so I assume that at best we’ll see improvements to these things in julia 1.1. So in terms of concrete things we can do now, we can just wait, right? And just live with the situation where Queryverse uses DataValue and DataFrames uses Missing. If someone has another idea, I would really welcome it, I think this kind of split is really not good.

piever · January 10, 2018, 1:25am

Let’s see if I understand. If we have an iterator where all elements have the same type and we apply a type-stable function (from named tuples to named tuples, like f = i -> (s = i.a+b, d = i.a-i.b)), to figure out the output we can simply look at the first element (or use inference, as it returns a concrete type). This is what happens without missing data (or when using DataValues). With Union{T, Missing} however the input iterator is never concretely typed and the main problem seems to be that there is ambiguity as to which fields of the output will be missing.

What I’m wondering is: to figure out which fields of the output can have missing values, could we use one of the following heuristics?

Let’s say the type of the elements of the input is NamedTuple{a::T1,b::T2,c::T3} where {T1<:Union{Int,Missing},T2<:Union{Float64},T3<:Union{String,Missing}}

Like map: take the first element, for example (a = 1, b = 2.3, c = missing) , replace all the nullable fields with missing: we get (a = missing, b = 2.3, c = missing), apply the function f and look at the type of the output. That should tell us which columns of the output should allow missing data (the one that got missing when all the input nullable columns were missing).
With Inference. Infer the return type of the two extreme cases (either all nullable fields are missing, or none of them is): NamedTuple{a::Missing,b::Float64,c::Missing} and NamedTuple{a::Int,b::Float64,c::String}. The result of the inference should be the union of two types, the fields of the second type (without missings) should tell us the types of the different fields, whereas the fields of the first type (with missings) should tell us which fields are nullable.

Tamas_Papp · January 10, 2018, 6:13am

The problem is that this does not handle functions that return missing for non-missing arguments, eg

"In this dataset, -1 indicates missing values."
recode_missing(x) = x == -1 ? missing : x

nalimilan · January 10, 2018, 10:40am

I really don’t know.

davidanthoff:

I don’t think that is how map works. My understanding is that map only ever uses inference if it gets an empty input sequence, but in all other cases allocates a container based on the first element in the sequence and then widens as necessary if a new element arrives that requires that.

Here is my understanding of the “use type-inference to allocate a container” story: again, we need to differentiate different cases. If type inference returns a concrete type, I see no reason why one couldn’t use it to allocate a vector of the right type. This is really the best case, my understanding is that when type inference returns a concrete type, one can rely on it being right. At the same time, this is also the case where relying on type inference doesn’t buy you much, though: by definition you don’t have a sequence of values with different types, but instead a sequence of values that all have exactly the same type, so you might as well just look at the type of the first element in the sequence, use that to decide on the container type, and be guaranteed to never have to widen the container type. Things get a lot more interesting when type inference returns a non-concrete type. My understanding is that in those cases it could happen that type-inference returns a “too wide” type, or it might return a different type in different revisions of julia etc. So in those cases it is unclear a priori whether it is a better strategy to pick the container type based on the inference result, or start with a container type for the first element in the sequence and widen as necessary. The latter strategy though is more stable, and I think that is probably the reason that in base it is used. Essentially, if inference returns a non-concrete type, and you allocate your storage based on that, you can get these changes in the observable results.

Yeah, right, currently map only uses inference for empty iterators, and the inferred type is indeed exposed to the user. But that system could easily be reused to choose the array type using inference, at least when inference returns a type which sounds useful.

As you note, it’s not easy to decide when the inferred type should be considered as more useful than just starting with the type of the first element and widening it as needed. But we should be able to identify some cases: for example, small Union of concrete types (like Union{Int, Missing} should definitely be used instead of the type of the first element, in particular since we have an efficient memory representation for them, which allows keeping the values array and discarding the type tag array if it turns out only one of the two types is actually present in the data.

davidanthoff:

So how does one of the simplest use cases of named tuples with missing value fields fit into this logic? Take something like map(i->(a=i.a,b=i.b),source). And lets assume the type of source is Vector{NamedTuple{a::Int?,b::Float64?}} (I’m again using made-up named tuple type syntax). If the missing value story is Union{T,Missing}, then inference will never return a concrete type for this map example (because that would of course just be incorrect). So with Missing you are always in this in-between world where inference returns a non-concrete type, and my understanding is that in base folks don’t want to rely on that. The situation is quite different with DataValue{T}. In that case inference in fact most often does return a concrete type, so we are in the simple world where we can rely on it without any issues. Note also that because of this, it should be fairly simple for me to drop the reliance on inference entirely and move to the strategy that map uses: because relying on inference doesn’t actually help much if inference returns a concrete type, one might as well not rely on it. Implementing that is currently my plan for Query.jl.

Inference wouldn’t return a concrete type, but we could use smarter heuristics to choose the initial array type. For example, with my PR 25423, the final element type is chosen by calling promote_strict_type on the concrete types of elements. We could apply the same approach to the type computed by inference when it returns a Union of concrete types, which makes sense since the goal is to guess in advance what the final element type will be.

davidanthoff:

But in many ways this is a bit of a theoretical discussion, because as of right now, inference very reliable doesn’t provide any information that could be used to pick a container type for these named tuples with Union{T,Missing} fields. That was the original impetus for opening Union{T,Null} inference in structs · Issue #6 · JuliaData/Missings.jl · GitHub. Again, I have no idea how easy/difficult it would be to change inference to reliably return information like NamedTuple{a::T1,b::T2,c::T3} where {T1<:Union{Int,Missing},T2<:Union{Float64,Missing},T3<:Union{String,Missing}}. I think that is the correct pattern that inference would have to return. But again, this won’t happen for julia 1.0, right? As I wrote above, if we could lean reliably on inference in that way, the problem would go away, but I’ve interpreted the core devs as really not being keen on that kind of design.

It’s not clear to me what inference could do. I guess it could happen for 1.0 if we agreed on a plan, but the hard part is to find out how it could/should work.

OK, I see what you mean: you choose the element type after iterating over the first value. What I had in mind was knowing the element type without even accessing the data. That’s a stricter requirement, maybe it’s not needed in practice.

davidanthoff:

Trust me, I’ve spent more energy and time trying to find a solution for this problem than pretty much anything else over the last nine months or so. Don’t take the lack of a solution as an indicator that I haven’t tried.

So I’m a bit lost at this point. When I stumbled over this issue, I had initially thought that something along the lines of @nalimilan’s suggestion would be the solution, essentially improving inference to handle these cases and adding more performance optimizations for composite types of small unions. But at the same time I have interpreted the feedback from the core devs as very clearly discouraging this kind of reliance on inference, and at this point I kind of see why. The one thing that does seem clear is that with the current version of inference and optimizations for small unions, all of these ideas won’t work in any case. I don’t understand the julia 1.0 schedule well, but fixing these to me look like fairly big ticket items, so I assume that at best we’ll see improvements to these things in julia 1.1. So in terms of concrete things we can do now, we can just wait, right? And just live with the situation where Queryverse uses DataValue and DataFrames uses Missing. If someone has another idea, I would really welcome it, I think this kind of split is really not good.

I don’t think we just need to wait. If we do that, the API will be fixed in a state which doesn’t suit our needs until 2.0. We need the promote_strict_type PR as a first step to make the user-visible part suit our needs, disregarding performance, since that won’t be allowed to change after 1.0. Then, having settled the public API will help having a clearer vision of what we need to make fast, which I suspect can help finding a solution using inference. BTW, choose an efficient element type should already improve performance a lot compared with Any, even if the current implementation of map will require copying the already collected data every time a new element type is encountered.

nalimilan · January 10, 2018, 10:53am

The difficulty is that in theory there’s no reason why a nullable field in the input should translate to a nullable field in the output. For example, you could use ismissing(x.a). In practice, though, it wouldn’t be unreasonable to assume this, and if that assumption turns out to be wrong we can always convert the resulting array to a non-nullable one.

So this approach looks like it could work. It’s not very satisfying since it’s very ad-hoc, but if it works it could be good enough at least for 1.0.

Yes, this is basically what I suggest in my last reply to David, i.e. call promote_strict_type on each type of the Union returned by inference. What we need for this to work is that inference does not decide that a type is too complex when the Union gets too large: return_type has to return the full Union so that map/collect can use it to compute an accurate simplified type using promote_strict_type. I have no idea what problems that poses.

piever · January 10, 2018, 11:13am

My concern was that the full Union with an exponential number of elements could be too much for the compiler to handle without losing the type information completely, hence my heuristics, but I see @Tamas_Papp and @nalimilan points that it is not a general solution. On the other hand, maybe it’s fine to have a simple solution for simple cases and in more complicated scenarios the user would have to type the output explicitly if he/she wants better performance.

Tamas_Papp · January 10, 2018, 11:17am

It would be great if Missing would not end up being “special”, so that users could define alternative types for finer granularity of missingness, relying on the same mechanisms of the language and the compiler.

Topic		Replies	Views
Representing Nullable Values Internals & Design	39	7313	January 20, 2018
Announcement: An Update on DataFrames Future Plans Data announcement	41	9248	December 27, 2017
Missing or NaN General Usage	26	12335	August 1, 2018
Compatibility of Query and Union{T, Missing} Data	3	1737	November 28, 2017
Aliases for Union{T, Nothing} and Union{T, Missing}? New to Julia	40	7294	May 10, 2019

Missing data and NamedTuple compatibility

Related topics