Changing missing logic

tbeason · December 18, 2019, 4:01pm

I really like to use missing, but there are some instances where it becomes annoying. First, logical indexing. Second, when a package hasn’t implemented a safety method f(missing)=missing.

For example, suppose I read in a DataFrame from a file and that file contains missing values. Now I want to subset using where or in on one of the columns containing missing values. The understandably conservative logic of missing requires two steps rather than one. First, I must eliminate rows with missing, then I can do the filtering operation.

using DataFrames
df = DataFrame(a=[1,missing,3],b=["low",missing,"high"])
@where(df,:a .> 0) # error
@where(df,in.(:b,Ref(("low","high")))) # error

# necessary (?)
@where(df,.!ismissing.(:a),:a.>0)

Is there a way to change the logic so that missing > 0 == false or missing \in (0,1) == false? I’m not proposing to change the default behavior, I’m just wondering if there is a way for me to basically make missing behave more like NaN but not just for numeric columns.

Additionally, and not to pick on any package specifically, but to give an example.

using Distributions
pdf(Normal(0,1),missing) # errors

What am I to do in this example? Well, this again becomes a two step procedure rather than one. I need to define an anonymous function like pdf2(x) = ismissing(x) ? missing : pdf(Normal(0,1),x) or something.

pdeffebach · December 18, 2019, 4:22pm

For your first example, yeah I think that is would be the right way to keep all observations that aren’t missing and greater than 0. I agree that the .!ismissing.(:a) is not the most convenient syntax. A convenient workaround would be to simply define

notmissing(x) = ismissing(x) == false

The way to deal with your second example is though the function passmissing defined in Missings.jl. It is a mapping Function -> Function that returns missing more frequently.

For your last example you would do

julia> passmissing(pdf)(Normal(0, 1), missing)
missing

cstjean · December 18, 2019, 4:26pm

I sometimes use @where(df, coalesce.(:a.>0, false))

tbeason · December 18, 2019, 4:40pm

I was not aware of passmissing, which does look like a nice utility. That still basically amounts to my existing solution just in a much more compact way.

The issue with the first is not the .!ismissing.(:a) is ugly. It is that it is necessary before trying :a .> 0. Not terrible if you are just doing this for one column, but if you are chaining things together it gets out of hand. My DataFramesMeta @linq calls basically now have one long line of where(.!ismissing.(:a),.!ismissing.(:b), ....) |> before I move on to actual filtering.

tbeason · December 18, 2019, 4:40pm

That does work better for that situation but I can’t seem to get it to work for the second scenario using in

pdeffebach · December 18, 2019, 4:42pm

That’s a good point. Maybe do this?

ulia> @where(df, notmissing.(:a.>0))

It’s not good that the order matters here. You should file an issue with DataFramesMeta and see if there is anything that should be changed.

davidanthoff · December 18, 2019, 8:45pm

You could also use Query.jl for this. It uses DataValues.jl for missing value handling, which is based on the C# semantics for missing values, so your first examples just work like

df |> @filter(_.a > 0)

and

df |> @filter(_.b in ("low", "high"))

. If you want a DataFrame with your results, you just pipe it into one:

df |> @filter(_.a > 0) |> DataFrame

Topic		Replies	Views
Filter a number when missing values are in column New to Julia dataframes , missing-values	23	2599	February 11, 2022
Function occursin() needs a method that accepts Union{String, Missing} General Usage	12	1806	August 28, 2018
How to filter out missings (using DataFramesMeta @where) New to Julia dataframes	2	691	July 22, 2019
Query.jl - filtering on missing data Data	7	1557	September 21, 2018
Select observations in a DataFrame from variable with some missing observations Data dataframes	15	521	April 17, 2021

Changing missing logic

Related topics