Changing missing logic

I really like to use missing, but there are some instances where it becomes annoying. First, logical indexing. Second, when a package hasn’t implemented a safety method f(missing)=missing.

For example, suppose I read in a DataFrame from a file and that file contains missing values. Now I want to subset using where or in on one of the columns containing missing values. The understandably conservative logic of missing requires two steps rather than one. First, I must eliminate rows with missing, then I can do the filtering operation.

using DataFrames
df = DataFrame(a=[1,missing,3],b=["low",missing,"high"])
@where(df,:a .> 0) # error
@where(df,in.(:b,Ref(("low","high")))) # error

# necessary (?)
@where(df,.!ismissing.(:a),:a.>0)

Is there a way to change the logic so that missing > 0 == false or missing \in (0,1) == false? I’m not proposing to change the default behavior, I’m just wondering if there is a way for me to basically make missing behave more like NaN but not just for numeric columns.

Additionally, and not to pick on any package specifically, but to give an example.

using Distributions
pdf(Normal(0,1),missing) # errors

What am I to do in this example? Well, this again becomes a two step procedure rather than one. I need to define an anonymous function like pdf2(x) = ismissing(x) ? missing : pdf(Normal(0,1),x) or something.

1 Like

For your first example, yeah I think that is would be the right way to keep all observations that aren’t missing and greater than 0. I agree that the .!ismissing.(:a) is not the most convenient syntax. A convenient workaround would be to simply define

notmissing(x) = ismissing(x) == false

The way to deal with your second example is though the function passmissing defined in Missings.jl. It is a mapping Function -> Function that returns missing more frequently.

For your last example you would do

julia> passmissing(pdf)(Normal(0, 1), missing)
missing

I sometimes use @where(df, coalesce.(:a.>0, false))

I was not aware of passmissing, which does look like a nice utility. That still basically amounts to my existing solution just in a much more compact way.

The issue with the first is not the .!ismissing.(:a) is ugly. It is that it is necessary before trying :a .> 0. Not terrible if you are just doing this for one column, but if you are chaining things together it gets out of hand. My DataFramesMeta @linq calls basically now have one long line of where(.!ismissing.(:a),.!ismissing.(:b), ....) |> before I move on to actual filtering.

That does work better for that situation but I can’t seem to get it to work for the second scenario using in

That’s a good point. Maybe do this?

ulia> @where(df, notmissing.(:a.>0))

It’s not good that the order matters here. You should file an issue with DataFramesMeta and see if there is anything that should be changed.

You could also use Query.jl for this. It uses DataValues.jl for missing value handling, which is based on the C# semantics for missing values, so your first examples just work like

df |> @filter(_.a > 0)

and

df |> @filter(_.b in ("low", "high"))

. If you want a DataFrame with your results, you just pipe it into one:

df |> @filter(_.a > 0) |> DataFrame
2 Likes