Query.jl with filtering by missing values doesn't seem to work?


#1

I’m probably missing something but I can’t figure out why this doesn’t work.

I have a DataFrame and I want to use Query.jl to filter it by various criteria, including keeping/removing rows where certain columns are missing.

Now, let’s take just one row as an example:

julia> clean_df = @from b in df begin
       @where b.Location_Id == "0474285-05-001"
       @select {b.DBA_Name, b.Business_End_Date}
       @collect DataFrame
       end

1×2 DataFrames.DataFrame
│ Row │ DBA_Name        │ Business_End_Date │
├─────┼─────────────────┼───────────────────┤
│ 1   │ Pressed Juicery │ missing           │

Sure enough, Business_End_Date is missing:

julia> clean_df[1, :Business_End_Date] |> ismissing
true

However, attempting to filter by the values where Business_End_Date fails:

julia> @from b in df begin
       @where b.Location_Id == "0474285-05-001" && ismissing(b.Business_End_Date)
       @select {b.DBA_Name, b.Business_End_Date}
       @collect DataFrame
       end

0×2 DataFrames.DataFrame

:face_with_raised_eyebrow:


#2

Hmmm, for some reason, within the query statements block, ismissing is false:

julia> @from b in df begin
       @where b.Location_Id == "0474285-05-001"
       @select {b.DBA_Name, b.Business_End_Date, ismissing(b.Business_End_Date)}
       @collect DataFrame
       end

1×3 DataFrames.DataFrame
│ Row │ DBA_Name        │ Business_End_Date │ _3_   │
├─────┼─────────────────┼───────────────────┼───────┤
│ 1   │ Pressed Juicery │ missing           │ false │

#3

Digging in it turned out that Query.jl does not have support for missing, instead using a “proprietary” DataValues.DataValue. The check should be performed using isnull.


#4

That is right. The current design of Missing is simple, but can’t be used for something like Query.jl. Various folks have been trying to find a solution to this, but no workable plan has emerged so far, so we’ll have to see how this plays out.


#5

As a quick workaround, do you think there would be any value in having the API extended so that:

  1. an ismissing method is defined for DataValues.DataValue (or as an alias for isnull).
  2. optionally define a DataValues.Missing as an alias to DataValues.DataValue.

Like a simple form of duck-typing so that the minimalistic Missings API “would work”.


#6

So I could certainly add a method to ismissing so that it also works with DataValue. I’m a bit on the fence whether it is a good idea: the semantics of Missing and DataValue are different in a number of instances, and I’m just not sure whether it might be better to keep the nomenclatur of the two separate to make it clear to users that they actually have different semantics, or not. I think if we do any of this I’d do it in the release that goes along with julia 0.7.

I think defining DataValues.Missing is probably not a good idea, I think that would just lead to a very confusing situation where the same term has different meanings in different contexts and no one would know anymore what is going on.


#7

Slightly off-topic, but I was wondering whether a first step could be to replace the content of a DataValue with something like:

struct DataValue{T}
    value::Union{T, Missing}
end

which hopefully in 0.7 should be as performant as the previous implementation. I don’t have a strong intuition on these things but I believe this could also have the correct layout in an Array without having to use a DataValueArray.

Then one would simply argue that Query needs a wrapper around Missing for technical reasons (as a temporary solution, before the Missing story becomes compatible with the Query design) and it would make sense to try and make this wrapper as invisible as possible to the user.


#8

It probably would have to be value::Union{Some{T}, Missing}. I tried that a few weeks ago, and it segfaulted julia 0.7 :wink: I think that is fixed now on master, though.

I would be surprised if this would give the correct array layout, though. I think one would end up with an efficient representation of the union in the struct, but that would probably be all.

I think I still need a different sentinal NA value for DataValue, no matter what. Right now that is defined as const NA = DataValue{Union{}}(), and I don’t think that could easily be replaced with the existing missing definition.

I’ve also been thinking briefly about a design where we had

struct Missing{T}
    value::Union{Some{T}, Void}
end

const missing = Missing{Union{}}()

and then one could have Union{T, Missing{Union{}} as the simple representation of missingness. It would be some sort of hybrid between the current Missing design and the DataValue design. I haven’t thought it through and it might well be a silly idea. But it is probably kind of pointless to even consider it at this point, given that the Missing design is essentially baked into stone until juila 2.0 at this point, as far as I can tell.

Well, that is still the question, whether that can happen. I’ll need to follow up on the other thread, but I still think there is just a really deep design mismatch. Query’s design is all inspired by the kind of monadic composability that folks like Erik Meijer and his MS friends have pushed in the early 2000s, and the Union approach just doesn’t seem to have those properties. But, sorry for even bringing this up here, lets keep that discussion in the other thread we have :slight_smile:


#9

Yeah, using ismissing for DataValue would kind of make sense, given that it’s supposed to represent missing values and that isnull doesn’t exist anymore in Base – especially since Query converts missing to an empty DataValue automatically. It doesn’t seem very useful to bug users with this difference in function names, which AFAICT is really the main significant difference in behavior between DaatValue and missing.