Query - missing vs. isna

Hi all,

I have a question about ismissing and isna. I found Query.jl - filtering on missing data so I started using that.
I thought that isna is really needed whenever I use Query.jl.

But that’s not the case. Example with correct results:

julia> df = DataFrame(x = [100, missing])
2×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 100     │
│ 2   │ missing │

julia> df[!, :x]       |> @filter(!ismissing(_))   |> collect
1-element Array{Union{Missing, Int64},1}:
 100

julia> df[!, [:x]]     |> @filter(!isna(_.x))      |> collect
1-element Array{NamedTuple{(:x,),Tuple{DataValues.DataValue{Int64}}},1}:
 (x = DataValue{Int64}(100),)

If I use isna instead of ismissing and vice versa, it gives incorrect results.

I understand that one is array, and the other is DataFrame. But shouldn’t that behave the same way? If ismissing is hard to implement, then isna should work in both cases imho.

I haven’t found isna documented and also documentation about Query.jl doesn’t mention that.

Can you share the incorrect results? You don’t say what’s incorrect about them, so hard to know the issue is.

I don’t know enough about Query to know when it makes more sense to use isna vs ismissing, but I know @davidanthoff and colleagues have put a lot of effort into making it as clear and consistent as possible. If such an explanation is missing from the docs, I’m certain they would be happy to explain and grateful for a PR to add it in.

Sure, the incorrect values look like this:

julia> df = DataFrame(x = [100, missing])
2×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 100     │
│ 2   │ missing │

julia> df[!, :x]       |> @filter(!isna(_))   |> collect
2-element Array{Union{Missing, Int64},1}:
 100
    missing

julia> df[!, [:x]]     |> @filter(!ismissing(_.x))      |> collect
2-element Array{NamedTuple{(:x,),Tuple{DataValue{Int64}}},1}:
 (x = DataValue{Int64}(100),)
 (x = DataValue{Int64}(),)

I expect to get only 1 item, not 2 (that one with value 100).

Ah, yeah looks like the conversation of missing (which needs ismissing) to DataValue (which needs isna) is only happening when you pass a DataFrame to the query. I agree this seems inconsistent.

In principle, it should not be hard to either

  1. Always convert to DataValue for a query
  2. Overload ismissing to work as expected on DataValue

It’s technically feasible to overload isna to work on missing, but would be considered type piracy.

I’ll leave it to the people more familiar with the package to explain if this is a considered choice or something that should be changed.

1 Like

Now I don’t know if to add to this post or make a new question…
The reason why I’m battling with isna and ismissing is something like this:

julia> df = DataFrame(x = [1, missing, missing], y = [missing, 10, missing])
3×2 DataFrame
│ Row │ x       │ y       │
│     │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┤
│ 1   │ 1       │ missing │
│ 2   │ missing │ 10      │
│ 3   │ missing │ missing │

julia> df[!, :anyvalQueryMap] =
           df[!, [:x, :y]] |>
           @map(ifelse(!isna(_.x), _.x,
                ifelse(!isna(_.y), _.y,
                0))) |>
           collect;

julia> df[!, :anyvalCoreMap] =
           map(row -> ifelse(!ismissing(row.x), row.x,
                      ifelse(!ismissing(row.y), row.y,
                      0)),
               eachrow(df[!, [:x, :y]]));

julia> df
3×4 DataFrame
│ Row │ x       │ y       │ anyvalQueryMap │ anyvalCoreMap │
│     │ Int64?  │ Int64?  │ Union…         │ Int64         │
├─────┼─────────┼─────────┼────────────────┼───────────────┤
│ 1   │ 1       │ missing │ 1              │ 1             │
│ 2   │ missing │ 10      │ 10             │ 10            │
│ 3   │ missing │ missing │ 0              │ 0             │

This works well for map, but if I would like to convert it to Query’s @map, I dont get the expected results.
I added 2 columns just to show difference between map and @map. I need to have the column of type Int64, but

julia> df[!, :anyvalQueryMap]
3-element Array{Union{Int64, DataValue{Int64}},1}:
  DataValue{Int64}(1)
  DataValue{Int64}(10)
 0

It could go either way. You can always change the title to be more inclusive of this question for the sake of future searches.

As to your actual question, if I’m being honest, this kind of confusion with DataValues is why I don’t use Query. Using the stuff that comes out of the box with DataFrames is sufficient for my needs, especially now with v0.21.

The way I’d do what you’re showing is

df[!, :anyvalCoreMap] = map(eachrow(df[!, [:x, :y]])) do row
     !ismissing(row.x) ? row.x :
     !ismissing(row.y) ? row.y : 0
end

or maybe, if I have a bunch of columns

df[!, :anyvalCoreMap] = map(eachrow(df[!, [:x, :y]])) do row
     col = findfirst(!ismissing, row)
     isnothing(col) ? 0 : row[col]
end

(Edit: there’s probably even a nicer way to do this with the new DataFrames transform or something, but I haven’t had a chance to play much with it yet)

1 Like

Your solution looks pretty good. I need to read more on syntax and using do etc.

Why I try to use Query is that I come from .NET world and LINQ is very nice concept. It wasn’t very performant, but it’s pretty powerfull. And the performance is getting better and better.

Also I like piping a lot, so doing df |> filter .. |> map ... |> orderby .. is still most appealing to me :slight_smile:

Basically, anywhere you’d put an anonymous function as the first argument, you can use do. So

somefuction(x-> #do stuff#, 
    container)

becomes

somefunction(container) do x
    #do stuff#
end

Takes a little getting used to, but it’s awesome.

That’s a solid reason. There’s also DataFramesMeta.jl that has @linq and uses missing natively, but it’s a bit out of step with the most recent version of DataFrames at the moment (though being updated).

That said, I’m guessing it will only take a little bit of effort to wrap your head around DataValue and its quirks to continue using Query.jl. I just never expended that effort because I’m not super attached to that syntax.

1 Like