Missing data and NamedTuple compatibility

I see, then it is an unfortunate choice and your NonMissing makes much more sense.

My impression is that there should be a way to reach a compromise that works for everybody:

  • on the side of Missings, add the NonMissing wrapper that lifts whenever Missing does
  • on the side of Query (and potentially JuliaDB, as I think similar problems will arise when switching to Missings), use Union{NonMissing{T}, Missing}.

I agree that Union{T, Missing} has the advantage that if there are no missings it just works for any function.

It is also true that the use of NonMissing doesn’t have to be ubiquitous, and could be limited to those cases when one wants to concatenate many iterators that do not necessarily have a well defined eltype (meaning, DataFrames stays as is and Query wraps in a NonMissing wrapper and unwraps before collecting). I’m also wondering whether, for those functions where this lifting method does not exist, Query could take care of the lifting automatically.

Ok… some progress here. I’ve got a proof of concept package up called KeyedTables up. To get it to work you need the master version of Keys.jl and ZippedArrays.jl. It makes used of collect_zipped from ZippedArrays to implement essentially the “struct of arrays” strategy. It can efficiently generate a tuple of arrays from an iterator which yields tuples, even if that iterator contains Union{Missing, T}s. Only works on 0.7

julia> using KeyedTables

julia> using ChainRecursive: @chain

julia> using LazyCall: @~

julia> using Test: @inferred

julia> test() = @chain begin
           @keyed_tuple(a = [1, 2, missing], b = ["a", missing, "c"])
           rows(it)
           Iterators.filter(@~ ismissing((~it).a) || (~it).a > 1)
           columns(it)
       end

julia> @inferred test()
(a = Union{Missing, Int64}[2, missing], b = Union{Missing, String}[missing, "c"])

why are the final inferred entries different:


julia-0.7> test() = @chain begin
               @keyed_tuple(a = [1, 2, missing], b = [true, missing, false] , c = [3//4, 2//4, missing])
               rows(it)
               Iterators.filter(@~ ismissing((~it).a) || (~it).a > 1)
               columns(it)
           end
test (generic function with 1 method)

julia-0.7> @inferred test()
(a = Union{Missing, Int64}[2, missing], b = Union{Missing, Bool}[missing, false], c = Union{Missing, Rational{Int64}}[1//2, missing])

julia-0.7> test() = @chain begin
               @keyed_tuple(a = [1, 2, missing], b = [missing, missing, missing] , c = [3//4, 2//4, missing])
               rows(it)
               Iterators.filter(@~ ismissing((~it).a) || (~it).a > 1)
               columns(it)
           end
test (generic function with 1 method)

julia-0.7> @inferred test()
(a = Union{Missing, Int64}[2, missing], b = Missing[missing, missing], c = Any[1//2, missing])

julia-0.7> 


Hmm dunno I’ll look into that.

It’s a problem in map over tuples in Base. You can trace it down to these specific lines:

Core.SSAValue(71) = (Base.Iterators.tuple)(Core.SSAValue(78), $(QuoteNode(missing)), Core.SSAValue(102))::Tuple{Union{Missing, Int64},Missing,Union{Missing, Rational{Int64}}}
Core.SSAValue(123) = (Base.tail)(Core.SSAValue(71))::Tuple{Missing,Any}

That’s actually a very frustrating problem… having a type stable map over tuples is critical for this whole thing to work…

Ref https://github.com/JuliaLang/julia/issues/26610

I think a hybrid approach in the spirit of @piever’s suggestion is very much worth exploring.

I do think that for the Query.jl, JuliaDB.jl and other iteration case we actually need a typed missing value for the inference free implementations. If you have a source with a column that has element type Union{T,Missing} and you do a projection that selects that column into the result, you really want the resulting column to be of type Union{T,Missing}, including when the source column has a) no missing values or b) only missing values. The Union{NotMissing{T},Missing} case would allow us to get a), but not b). So if one wanted to achieve that with a Union, it would have to be something like Union{NotMissing{T}, Missing{T}}. That seems pretty unwieldy and something like DataValue{T} just seems simpler in that case.

I think at the end of the day there are simply situations where it is helpful to have a typed missing value, and other situations where it is a pain and easier to also have an untyped missing value. Now, DataValues.jl actually is designed to support both of these cases: DataValue{T}() constructs a typed missing value, and DataValue{Union{}}() constructs an untyped missing value (that is also accessible as NA currently). I haven’t really talked much about that aspect of DataValues.jl, but you can pretty much use it in a Union spirit, where the union looks like Union{T,DataValue{Union{}}, if you want to.

I just found a comment by @jameson over on github that is probably relevant for the discussion here:

Someone else is welcome to explore this, and optimize some of these cases during v1.x. But as the guy who said “we can have fast small unions” (and did so, mostly), I’m also acutely aware that it works best in cases where the union is of completely unrelated items (like Tuple and Nothing), and that the work does not generalize to container constructors (or any other large union, such as is represented by Tuple{Union…}). That would be a much more difficult problem. Afaik, supporting that would require a static type system. (Optimal performance currently depends heavily on creating all 2^n specializations – although as Jeff notes in Missing data and NamedTuple compatibility - #34 by jeff.bezanson we won’t actually need them – so we just instead make sure that inference won’t do that work or compute a return type and instead defer that analysis for runtime).

I’m not entirely clear how I should interpret his comment, but it does sound to me as if it might be quite challenging to get this design fast, and as if it won’t happen in 1.0, and that someone else would have to do it if it was to happen in 1.x.

I think this statement is self-contradictory. Due to the “counterfactual return type problem”, you cannot know the type non-missing values would have had when you only have missing values without relying on inference. DataValue doesn’t allow working around that.

But I wonder whether it’s a real problem in practice. Have you already encountered a situation where it would have been problematic to return a column which can only contain missing values? It’s quite rare to assign non-missing values to a column which contains only missing values.

As I see it, Union{NotMissing{T},Missing} would be a good replacement for DataValue which wouldn’t rely on inference at all and would be closer to the missing values machinery defined in Base.

I mostly agree with @davidanthoff points (to understand what he meant I had to go through the experience of actually implementing these inference free algorithms on iterators and then it became increasingly clear that it’s much more challenging to do so with the Union{T, Missing} approach) and I believe that now we need to see what’s the best “hybrid” approach.

In principle one would want a typed missing to also solve the problem when all data is missing but:

  • as noted above, this still relies on inference, even though in a different way. For example, if f is not type-stable, what is f(DataValue{T}()) ?
  • I think this is less of a concern (at least for JuliaDB) in practice. Also performance-wise, copying a column of only missings to also allow non-missings of a certain type can probably be made reasonably cheap
  • I’m afraid this is not going to fly when there is only untyped Missing in Base: there would need to be a lot of discussion to convince everybody of the relevance of this approach or at least I don’t see a simple way of implementing it alongside Missing. Unless we redefine Missing to be an alias for DataValue{Union{}} and then also allow using DataValue… If there is a way to do this that is not too confusing for users, we could maybe think about it.
  • Finally, one extra thing that is relevant to keep in mind: even though Union{NotMissing{T}, Missing{T}} looks a but unwieldy, it has the advantage over DataValue that it can be stored efficiently in Arrays due to the Julia optimizations without a custom container type (in this case DataValueArray). It’d be nice to keep this feature.

As a final consideration I wanted to emphasize that there is work in progress both in the Query-verse and in JuliaDB to make the implementations work for any iterators without relying on inference and it’d be sad to lose this due to the choice of implementation of missing data: I really hope we can find a solution flexible enough to accomodate everybody’s needs and at the moment this solution (admittedly a bit of a compromise) seems to be allowing an optional NotMissing wrapper. Once again, it would probably be best to store things normally and then only wrap them in this NotMissing thing when Query or JuliaDB need to iterate through the data.

3 Likes

I don’t want to push this thread off topic, but where can I find more about this optimization feature? In particular, why doesn’t it require a custom container type?

1 Like

It’s not very explicit, but it’s been implemented by this PR. Also see issues referred to in the description.

2 Likes

I know this comment is not especially constructive, but to give the perspective of an outsider…

This whole thread and the uncertainty over whether missing/named tuple/union programming patterns will be typesafe for the v1.0.X (or soon after) is absolutely terrifying. I am not worried that things may be slow for a 3-6 months after release as the compiler gets optimized… I am worried that efficiency will require a fundamental redesign and make the v1.0 untenable as a sufficiently backwards compatible release. Either the union with missing will become pervasive (and then could require a fundamental redesign) or there might be insufficient trust for data oriented package developers that they use something different or decide Julia isn’t data ready.

At one point the JuliaComputing crew asked if there was anything that they should focus on while Keno finishes the compiler optimizations. My answer is: this.

3 Likes

What do you mean with “typesafe”? Do you mean “perform well enough”?

I consider @jlperla’s comments to be a caring effort to preclude a potentially pernicious pseudomeme. As many professional software developers have learned, there is no skating through uncertainty. This is one situation wherein percepts drive reality.

Whatever the resolution, it need be compellingly simple to apply and cannot exhibit roughness that derides our core contributors’ efforts.

Were I appropriately versed … am not though. Over to you who are.

Best, Jeffrey

Yeah, something like “perform well enough” in the short-term and “no overhead at all” in the 12 month timeframe would have been a better way to phrase it.

But take the fact that I don’t know how to phrase it, don’t really understand the problem, and am unlikely to even be able to understand the details if I studied it, to be a symptom of the bigger problem: uncertainty. That is the last thing you want as you move towards a release intended to be highlighted for its backwards compatibility and stability.

If this requires a redesign of missing, type-invariance of named tuples, or all of the other things I have heard discussed (and barely understand) then I don’t think it can wait until after v1.0.

What should have no overhead and overhead compared to what?

I think it is important to be aware that comments like this has a chance of seeding uncertainty (typically called FUD) on their own. So I think it is important to reflect on that before making such a post where one does perhaps does not “understand the problem”.

1 Like

Compared to whatever hand-coded alternative people have in mind. Those details are missing the point.

I am not commenting on the specifics of the problem, but rather on the (1) the number of months this discussion has gone on for; (2) the mass of details on the back and forth of this topic; and (3) the time-frames discussed for a solution. I know that lots of smart people are working on this, but I don’t get the sense this has ever been thought by Juila Computing to be a showstopper for v1.0.

If I have created FUD where none should exist, then I sincerely apologize - but I have a funny feeling I am not the only person outside of the discussion who is scared. If this is a minor and self-contained issue on a peripheral part of the Julia ecosystem with no chance of fragmenting the ecosystems, then it hasn’t come across that way in the discussions.

3 Likes

I am entirely an observer in this general theme of missing data in Julia, but I would tend to agree with @jlperla: the confusion about how to handle missing data that has persisted now for a few years has made me wonder whether the “data munging” ecosystem in Julia will ever usable, at least for someone who is accustomed to doing this work in R. My biggest concern is that the two authors of packages that are explicitly designed to create a sane data-munging workflow (@piever and @davidanthoff) both seem to think that the path that the core Julia team is taking with missing values will make dplyr-like (or even data.table-like) pipelines really hard to be performant. I of course don’t know whether “performant” by the standards of these package authors is unrealistic, and I haven’t recently tried to do straight forward import/clean type work using Query.jl or the newly released JuliaDBMeta.jl, but it was definitely the case about a year ago that this was (a) harder in julia and (b) definitely slower in julia, at least if you had to deal with missing data.

I am a huge fan of julia, but if I am going to convince my co-authors to eventually switch from STATA or R, they are definitely going to want to see something like dplyr or data.table, and they are going to demand that it roughly as fast as R. The reason for this, at least for an empirical economist like myself, is that most of the time I spend in front of the computer is getting data ready for a model. Yes, faster models are nice, and julia is unquestionably faster and nicer to program with than R or MATLAB or (shudder) STATA, but most of my time is just spent getting data ready. If that is slow, or a PITA because of an inconsistent missing values story, its going to be super hard to bring my colleagues in from the dark side…

1 Like

I have been following the Missing saga, as well. I don’t understand it well enough to have an opinion, and I certainly don’t want to spread FUD. But, I have to say that this discussion (not just this thread) is the most unsettling thing I’ve experienced in my few years of using Julia. It may or may not be a completely irrational unsettledness. But, rational or not, I guess that @jlperla and I are not the only ones who feel uneasy.

btw @tcovert was composing a post at the same time that I was, so I did not read it yet.

1 Like

I would actually prefer to keep this discussion focused on a concrete solution strategy and I would encourage to separate comments on the gravity (or not gravity) of the situation in a different thread, as this discussion is quite long already but has been, at least in my view, quite productive.

Replying to the merit of the comments, I mostly agree with @jlperla that we could take advantage of the extra time we have pre 0.7 to focus on this situation. To summarize my understanding of the situation, there are, broadly speaking, three types of data manipulations:

  1. Column based (like DataFrames)
  2. Row based but with complete type information (DataStreams, some cases of JuliaDB)
  3. Row based but with incomplete type information (inference free design of Query, some cases of JuliaDB)

Case 3 is particularly complex because the sink has to be created based on the first element that is returned and expanded as new element types are encountered. Case 1) and 2) work extremely well with Union{T, Missing} in Julia 0.7 and that is the official recommended missing data implementation.

Case 3 has not been implemented with Union{T, Missing} yet (both JuliaDB and Query use DataValue) and there are strong reasons to believe such implementation will be very challenging. This discussion tries to find a solution for case 3. Case 1) and case 2) already work very well.

It seems that some consensus is arising that a completely unified missing data representation will not satisfy all needs and we would need some sort of hybrid. Union{T, Missing} is very good for storing data, but to concatenate lazy operation on iterables of rows it is not ideal (for a series of technical reasons). Based on the implementation of collect_columns (see here) I tend to believe that Union{NonMissing{T}, Missing} would be a good compromise to maintain performance and not rely on inference. Mostly this wrapping and unwrapping could be invisible to the user and be done by JuliaDB, but we need to understand what things need to be implemented for this to happen and this is one of the purposes of this discussion. Query is a very complex piece of software and I’m not sure whether Union{NonMissing{T}, Missing} would suffice there (I simply don’t know that package well enough to have an informed opinion).

Note that now we are already in a hybrid situation (DataValues on one side and Union{T, Missing} on the other side) and this is an effort to remedy that.

9 Likes