Query - missing vs. isna

stej · May 22, 2020, 10:41am

Hi all,

I have a question about ismissing and isna. I found Query.jl - filtering on missing data so I started using that.
I thought that isna is really needed whenever I use Query.jl.

But that’s not the case. Example with correct results:

julia> df = DataFrame(x = [100, missing])
2×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 100     │
│ 2   │ missing │

julia> df[!, :x]       |> @filter(!ismissing(_))   |> collect
1-element Array{Union{Missing, Int64},1}:
 100

julia> df[!, [:x]]     |> @filter(!isna(_.x))      |> collect
1-element Array{NamedTuple{(:x,),Tuple{DataValues.DataValue{Int64}}},1}:
 (x = DataValue{Int64}(100),)

If I use isna instead of ismissing and vice versa, it gives incorrect results.

I understand that one is array, and the other is DataFrame. But shouldn’t that behave the same way? If ismissing is hard to implement, then isna should work in both cases imho.

I haven’t found isna documented and also documentation about Query.jl doesn’t mention that.

kevbonham · May 22, 2020, 11:14am

Can you share the incorrect results? You don’t say what’s incorrect about them, so hard to know the issue is.

I don’t know enough about Query to know when it makes more sense to use isna vs ismissing, but I know @davidanthoff and colleagues have put a lot of effort into making it as clear and consistent as possible. If such an explanation is missing from the docs, I’m certain they would be happy to explain and grateful for a PR to add it in.

stej · May 22, 2020, 11:17am

Sure, the incorrect values look like this:

julia> df = DataFrame(x = [100, missing])
2×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 100     │
│ 2   │ missing │

julia> df[!, :x]       |> @filter(!isna(_))   |> collect
2-element Array{Union{Missing, Int64},1}:
 100
    missing

julia> df[!, [:x]]     |> @filter(!ismissing(_.x))      |> collect
2-element Array{NamedTuple{(:x,),Tuple{DataValue{Int64}}},1}:
 (x = DataValue{Int64}(100),)
 (x = DataValue{Int64}(),)

I expect to get only 1 item, not 2 (that one with value 100).

kevbonham · May 22, 2020, 11:30am

Ah, yeah looks like the conversation of missing (which needs ismissing) to DataValue (which needs isna) is only happening when you pass a DataFrame to the query. I agree this seems inconsistent.

In principle, it should not be hard to either

Always convert to DataValue for a query
Overload ismissing to work as expected on DataValue

It’s technically feasible to overload isna to work on missing, but would be considered type piracy.

I’ll leave it to the people more familiar with the package to explain if this is a considered choice or something that should be changed.

stej · May 22, 2020, 11:33am

Now I don’t know if to add to this post or make a new question…
The reason why I’m battling with isna and ismissing is something like this:

julia> df = DataFrame(x = [1, missing, missing], y = [missing, 10, missing])
3×2 DataFrame
│ Row │ x       │ y       │
│     │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┤
│ 1   │ 1       │ missing │
│ 2   │ missing │ 10      │
│ 3   │ missing │ missing │

julia> df[!, :anyvalQueryMap] =
           df[!, [:x, :y]] |>
           @map(ifelse(!isna(_.x), _.x,
                ifelse(!isna(_.y), _.y,
                0))) |>
           collect;

julia> df[!, :anyvalCoreMap] =
           map(row -> ifelse(!ismissing(row.x), row.x,
                      ifelse(!ismissing(row.y), row.y,
                      0)),
               eachrow(df[!, [:x, :y]]));

julia> df
3×4 DataFrame
│ Row │ x       │ y       │ anyvalQueryMap │ anyvalCoreMap │
│     │ Int64?  │ Int64?  │ Union…         │ Int64         │
├─────┼─────────┼─────────┼────────────────┼───────────────┤
│ 1   │ 1       │ missing │ 1              │ 1             │
│ 2   │ missing │ 10      │ 10             │ 10            │
│ 3   │ missing │ missing │ 0              │ 0             │

This works well for map, but if I would like to convert it to Query’s @map, I dont get the expected results.
I added 2 columns just to show difference between map and @map. I need to have the column of type Int64, but

julia> df[!, :anyvalQueryMap]
3-element Array{Union{Int64, DataValue{Int64}},1}:
  DataValue{Int64}(1)
  DataValue{Int64}(10)
 0

kevbonham · May 22, 2020, 1:02pm

It could go either way. You can always change the title to be more inclusive of this question for the sake of future searches.

As to your actual question, if I’m being honest, this kind of confusion with DataValues is why I don’t use Query. Using the stuff that comes out of the box with DataFrames is sufficient for my needs, especially now with v0.21.

The way I’d do what you’re showing is

df[!, :anyvalCoreMap] = map(eachrow(df[!, [:x, :y]])) do row
     !ismissing(row.x) ? row.x :
     !ismissing(row.y) ? row.y : 0
end

or maybe, if I have a bunch of columns

df[!, :anyvalCoreMap] = map(eachrow(df[!, [:x, :y]])) do row
     col = findfirst(!ismissing, row)
     isnothing(col) ? 0 : row[col]
end

(Edit: there’s probably even a nicer way to do this with the new DataFrames transform or something, but I haven’t had a chance to play much with it yet)

stej · May 22, 2020, 1:40pm

Your solution looks pretty good. I need to read more on syntax and using do etc.

Why I try to use Query is that I come from .NET world and LINQ is very nice concept. It wasn’t very performant, but it’s pretty powerfull. And the performance is getting better and better.

Also I like piping a lot, so doing df |> filter .. |> map ... |> orderby .. is still most appealing to me

kevbonham · May 22, 2020, 2:00pm

Basically, anywhere you’d put an anonymous function as the first argument, you can use do. So

somefuction(x-> #do stuff#, 
    container)

becomes

somefunction(container) do x
    #do stuff#
end

Takes a little getting used to, but it’s awesome.

That’s a solid reason. There’s also DataFramesMeta.jl that has @linq and uses missing natively, but it’s a bit out of step with the most recent version of DataFrames at the moment (though being updated).

That said, I’m guessing it will only take a little bit of effort to wrap your head around DataValue and its quirks to continue using Query.jl. I just never expended that effort because I’m not super attached to that syntax.

Topic		Replies	Views
Query.jl - filtering on missing data Data	7	1557	September 21, 2018
Query.jl with filtering by missing values doesn't seem to work? General Usage	8	1652	January 23, 2018
Basic function usage in Query @filter General Usage query , queryverse	9	216	May 23, 2024
Query.jl - User-set missing values in data frames not removed by @dropna Data	2	451	January 28, 2021
Replacement for get() in Query General Usage	6	515	November 12, 2018

Query - missing vs. isna

Related topics