Is the @filter macro from Query.jl only useful for simple things like _.a > 1 and so on?
I’ve tried to use ismissing which just returns a boolean and it does not return anything (but also does not return any errors). Broadcasting ismissing also doesn’t change the outcome (results in an error).
Here is my example with commented output beneath each comparing Query.jl syntax and basic DataFrames.jl syntax (which is successful):
using DataFrames, Query
example = DataFrame(
alph = ["a", "b", "c", "d"],
name = ["apple", "banana", "carrot", "date"],
prct = [0.2, 1, missing, 10]
)
example |>
@filter(ismissing(_.prct)) |>
collect DataFrame
# Output: [], no rows
example |>
x -> subset(x, :prct => ByRow(ismissing))
# Output:
# alph name prct
# 1 "c" "carrot" missing
# Edit, LINQ style just for completeness
@from i in example begin
@where ismissing(i.prct)
@select i
@collect DataFrame
end
# Output: []. no rows
I also tried anonymous function syntax with no luck. Am I using @filter wrong?
Still, nobody tells us, how it is supposed to work. I tried and found that this is working, but I don’t know if it is the right way:
julia> using DataValues
julia> example |>
@filter(_.prct == DataValue(missing))
1x3 query result
alph │ name │ prct
─────┼────────┼─────
c │ carrot │ #NA
Interesting! Yes I kept rereading the doc line about the filter macro, if it returns true then the element is retained, I would not have considered doing col == True, but nice to see that something works!
So I guess by that logic none of the missing values are missing according to Query? What’s strange is that I’ve done a decent amount of manipulation on dataframes containing missing values and the instances of missing have always been retained, but perhaps that’s because by collecting back into a DataFrame we never see the intermediate NA type?
Here is a quick check that indeed ismissing is called on a DataValue:
julia> using TraceFuns
julia> @trace (example |> @filter(ismissing(_.prct)) |> DataFrame) ismissing
10: ismissing(DataValue{Float64}(0.2)) -- Method ismissing(x) @ Base essentials.jl:1010 of ismissing
10: ismissing(DataValue{Float64}(0.2)) -> false
10: ismissing(DataValue{Float64}(1.0)) -- Method ismissing(x) @ Base essentials.jl:1010 of ismissing
10: ismissing(DataValue{Float64}(1.0)) -> false
10: ismissing(DataValue{Float64}()) -- Method ismissing(x) @ Base essentials.jl:1010 of ismissing
10: ismissing(DataValue{Float64}()) -> false
10: ismissing(DataValue{Float64}(10.0)) -- Method ismissing(x) @ Base essentials.jl:1010 of ismissing
10: ismissing(DataValue{Float64}(10.0)) -> false
0×3 DataFrame
Row │ alph name prct
│ String String Float64?
─────┴──────────────────────────
Imho, there are two issues with this:
Query operators change the type of data columns and I can just hope that operators forward to DataValue{T} as expected
Missing values are treated differently – which is one of the motivations for DataValues as far as I understand – an in particular ismissing is not forwarded as expected.
Would it be good practice to define?
ismissing(::DataValue) = error("ismissing called on DataValue. This is most likely not what you want, consider calling isna instead")
PS: Just found another example where DataValue’s wrapping might be problematic:
julia> 1.2 isa Real
true
julia> DataValue(1.2) isa Real
false
Note that this is pervasive in also effecting dispatch!
It sounds like QueryVerse diverges from the rest of the Julia DataFrames ecosystem in some ways, to provide semantics that work with a broader set of table-like structures. One of those ways is using a replacement for missing. Unfortunately, that results in situations like this, where ismissing silently returns false
I think most things have been said above, but just to confirm: Union{Missing,T} columns in tables are converted to DataValue{T} inside the query, and then converted back to Union{Missing,T} if you materialize into say a DataFrame at the end. So inside the query you need to use things like isna etc.
The biggest difference between Missing and DataValue is probably that traditional predicates stay predicates with DataValue, i.e. something like == will always return a Bool.
Thanks! I can’t really tell how to interact with DataValues.jl (even with Query loaded), so it might be beneficial for something like this to be added to the docs for Query, even if it’s as minor as a suggested replacement function for ismissing, etc.
I really like the utilization of named tuples as a behind-the-scenes data structure and I really like the way macros work in Query!
However, in the pursuit of consistency, I don’t see where the behavior of missing could be improved upon - anything to do with missing should always return missing, other than ismissing, right?
Maybe DataValues.jl solves problems that existed in prior version of Julia that have now been addressed?