Basic function usage in Query @filter

Is the @filter macro from Query.jl only useful for simple things like _.a > 1 and so on?
I’ve tried to use ismissing which just returns a boolean and it does not return anything (but also does not return any errors). Broadcasting ismissing also doesn’t change the outcome (results in an error).

Here is my example with commented output beneath each comparing Query.jl syntax and basic DataFrames.jl syntax (which is successful):

using DataFrames, Query

example = DataFrame(
	alph = ["a", "b", "c", "d"],
	name = ["apple", "banana", "carrot", "date"],
	prct = [0.2, 1, missing, 10]
)

example |>
@filter(ismissing(_.prct)) |>
collect DataFrame
# Output: [], no rows

example |>
x -> subset(x, :prct => ByRow(ismissing))
# Output:
#	alph name     prct
# 1	 "c" "carrot" missing

# Edit, LINQ style just for completeness
@from i in example begin
    @where ismissing(i.prct)
    @select i
    @collect DataFrame
end
# Output: []. no rows

I also tried anonymous function syntax with no luck. Am I using @filter wrong?

There seems to be some lack of documentation: Docs for missing values · Issue #306 · queryverse/Query.jl · GitHub

The docs about this: Getting Started · Query.jl

Still, nobody tells us, how it is supposed to work. I tried and found that this is working, but I don’t know if it is the right way:

julia> using DataValues

julia> example |>
       @filter(_.prct == DataValue(missing))
1x3 query result
alph │ name   │ prct
─────┼────────┼─────
c    │ carrot │ #NA

Let’s ping those who know: @baggepinnen :wink:

It looks like Query.jl treats missing values as na:

julia> example |> @filter(isna(_.prct))
1x3 query result
alph │ name   │ prct
─────┼────────┼─────
c    │ carrot │ #NA

See Query.jl - filtering on missing data

1 Like

Interesting! Yes I kept rereading the doc line about the filter macro, if it returns true then the element is retained, I would not have considered doing col == True, but nice to see that something works!

So I guess by that logic none of the missing values are missing according to Query? What’s strange is that I’ve done a decent amount of manipulation on dataframes containing missing values and the instances of missing have always been retained, but perhaps that’s because by collecting back into a DataFrame we never see the intermediate NA type?

Here is a quick check that indeed ismissing is called on a DataValue:

julia> using TraceFuns

julia> @trace (example |> @filter(ismissing(_.prct)) |> DataFrame) ismissing
                                       10: ismissing(DataValue{Float64}(0.2)) -- Method ismissing(x) @ Base essentials.jl:1010 of ismissing
                                       10: ismissing(DataValue{Float64}(0.2)) -> false
                                       10: ismissing(DataValue{Float64}(1.0)) -- Method ismissing(x) @ Base essentials.jl:1010 of ismissing
                                       10: ismissing(DataValue{Float64}(1.0)) -> false
                                       10: ismissing(DataValue{Float64}()) -- Method ismissing(x) @ Base essentials.jl:1010 of ismissing
                                       10: ismissing(DataValue{Float64}()) -> false
                                       10: ismissing(DataValue{Float64}(10.0)) -- Method ismissing(x) @ Base essentials.jl:1010 of ismissing
                                       10: ismissing(DataValue{Float64}(10.0)) -> false
0×3 DataFrame
 Row │ alph    name    prct     
     │ String  String  Float64? 
─────┴──────────────────────────

Imho, there are two issues with this:

  1. Query operators change the type of data columns and I can just hope that operators forward to DataValue{T} as expected
  2. Missing values are treated differently – which is one of the motivations for DataValues as far as I understand – an in particular ismissing is not forwarded as expected.
    Would it be good practice to define?
     ismissing(::DataValue) = error("ismissing called on DataValue. This is most likely not what you want, consider calling isna instead")
    

PS: Just found another example where DataValue’s wrapping might be problematic:

julia> 1.2 isa Real
true

julia> DataValue(1.2) isa Real
false

Note that this is pervasive in also effecting dispatch!

3 Likes

It sounds like QueryVerse diverges from the rest of the Julia DataFrames ecosystem in some ways, to provide semantics that work with a broader set of table-like structures. One of those ways is using a replacement for missing. Unfortunately, that results in situations like this, where ismissing silently returns false

I think most things have been said above, but just to confirm: Union{Missing,T} columns in tables are converted to DataValue{T} inside the query, and then converted back to Union{Missing,T} if you materialize into say a DataFrame at the end. So inside the query you need to use things like isna etc.

The biggest difference between Missing and DataValue is probably that traditional predicates stay predicates with DataValue, i.e. something like == will always return a Bool.

There is a bit of docs at Getting Started · Query.jl and GitHub - queryverse/DataValues.jl: Missing values for julia.

2 Likes

Thanks! I can’t really tell how to interact with DataValues.jl (even with Query loaded), so it might be beneficial for something like this to be added to the docs for Query, even if it’s as minor as a suggested replacement function for ismissing, etc.

I really like the utilization of named tuples as a behind-the-scenes data structure and I really like the way macros work in Query!

However, in the pursuit of consistency, I don’t see where the behavior of missing could be improved upon - anything to do with missing should always return missing, other than ismissing, right?

Maybe DataValues.jl solves problems that existed in prior version of Julia that have now been addressed?