I really like to use missing
, but there are some instances where it becomes annoying. First, logical indexing. Second, when a package hasn’t implemented a safety method f(missing)=missing
.
For example, suppose I read in a DataFrame from a file and that file contains missing values. Now I want to subset using where
or in
on one of the columns containing missing values. The understandably conservative logic of missing
requires two steps rather than one. First, I must eliminate rows with missing
, then I can do the filtering operation.
using DataFrames
df = DataFrame(a=[1,missing,3],b=["low",missing,"high"])
@where(df,:a .> 0) # error
@where(df,in.(:b,Ref(("low","high")))) # error
# necessary (?)
@where(df,.!ismissing.(:a),:a.>0)
Is there a way to change the logic so that missing > 0 == false
or missing \in (0,1) == false
? I’m not proposing to change the default behavior, I’m just wondering if there is a way for me to basically make missing
behave more like NaN
but not just for numeric columns.
Additionally, and not to pick on any package specifically, but to give an example.
using Distributions
pdf(Normal(0,1),missing) # errors
What am I to do in this example? Well, this again becomes a two step procedure rather than one. I need to define an anonymous function like pdf2(x) = ismissing(x) ? missing : pdf(Normal(0,1),x)
or something.