Help setting a filter that can return missing or non-missing values

Hi all,
Is there no way to specifically filter for missing values in a DataFrame column that is compatible with filtering for non-missing values (e.g. so a single filter can loop through non-missing and missing?)

Take this code:

df = DataFrame(status=["employee", "employee", "contractor", missing], name=["Bill", "Bob", "John", "Joe"], salary =[1000,5000,3000,1200])

Filtering for employee:

filter(:status => n->n=="employee",df)

Must be written:

filter(:status => n->n=="employee",coalesce.(df,false))

This I guess is fine if I only want to filter for a non-missing value. But if I want to loop through the β€œstatuses” column, what’s my best approach? This code, as expected, does not work:

status_types = unique(df[:,:status])
for s in status_types
    filter(:status => n->n==s,df)
    println("I just filtered $s")
    #do other stuff
end

And neither does this:

status_types = unique(df[:,:status])
for s in status_types
    filter(:status => n->n==s,coalesce.(df,false))
    println("I just filtered $s")
    #do other stuff
end

There’s probably another bit here to understand why missing==missing does not equal true, but for now I’m mostly concerned with how to actually filter in a way where I can get missing values but also retrieve any other value I want in the column.

Thanks for your help.

edit
Of course 2 minutes after I post I find the solution (after Kagi searching "Why doesn’t missing==missing = true in Julia).
I can get to my expected answer using three equal signs, e.g.:

for s in status_types
    new_df = filter(:status => n->n===s,df)
    println("I just filtered $s")
    print(new_df)
    #do other stuff
end

But, is this the optimal way? Is there any danger to using === here?
When looking at the Documentation on === it’s not really clear to me when I will get a true value or a false. Looking at a & b, visually, they look the same, and I would expect the return to be true.

First, note that your solution is not correct

filter(:status => n->n=="employee",coalesce.(df,false))

coalesce.(df, false) replaces all missing values in the data frame with false. So the only reason your result works is because you are comparing "employee" == false, which returns false. But you could have gotten the same results with coalesce.(df, "xxx"). This, of course, would be a very bad idea because it affects all columns in your data frame, even non-numeric ones.

A few solutions

  1. Use subset, or DataFramesMeta.jl’s @rsubset, both of which have better missing values handling
julia> using DataFramesMeta

julia> df = DataFrame(status=["employee", "employee", "contractor", missing], name=["Bill", "Bob", "John", "Joe"], salary =[1000,5000,3000,1200])
4Γ—3 DataFrame
 Row β”‚ status      name    salary 
     β”‚ String?     String  Int64  
─────┼────────────────────────────
   1 β”‚ employee    Bill      1000
   2 β”‚ employee    Bob       5000
   3 β”‚ contractor  John      3000
   4 β”‚ missing     Joe       1200

julia> @rsubset df :status == "employee"
2Γ—3 DataFrame
 Row β”‚ status    name    salary 
     β”‚ String?   String  Int64  
─────┼──────────────────────────
   1 β”‚ employee  Bill      1000
   2 β”‚ employee  Bob       5000
  1. Use the package MissingsAsFalse.jl (disclaimer, I wrote MissingsAsFalse.jl and maintain DataFramesMeta)
julia> using MissingsAsFalse

julia> @mfalse filter(:status => n -> n == "employee", df)
2Γ—3 DataFrame
 Row β”‚ status    name    salary 
     β”‚ String?   String  Int64  
─────┼──────────────────────────
   1 β”‚ employee  Bill      1000
   2 β”‚ employee  Bob       5000
  1. Just use more complicated syntax in general
julia> filter(:status => n -> ismissing(n) ? false : n == "employee", df)
2Γ—3 DataFrame
 Row β”‚ status    name    salary 
     β”‚ String?   String  Int64  
─────┼──────────────────────────
   1 β”‚ employee  Bill      1000
   2 β”‚ employee  Bob       5000

Generally speaking when dealing with missing it’s useful to reach for isequal, which treats comparisons to missing as false:

julia> filter(:status => isequal("employee"), df)
2Γ—3 DataFrame
 Row β”‚ status    name    salary
     β”‚ String?   String  Int64
─────┼──────────────────────────
   1 β”‚ employee  Bill      1000
   2 β”‚ employee  Bob       5000

Thanks! Was just coming back to my thread to edit and say isequal also works in my loop, and would suspect it’s probably my best solution:

status_types = unique(df[:,:status])
for s in status_types
    new_df = filter(:status => n->isequal.(n,s),df)
    println("I just filtered $s")
    println(new_df)
    #do other stuff
end

There’s no need for broadcasting here as filter already operates row-wise, and you can use the infix method

filter(:status => isequal(s), df)