Finding DataFrame rows with `missing` values in specific columns?

forestacorn · February 4, 2022, 5:21pm

I have a DataFrame, like the following example, where I am trying to find the rows where the columns x2 and x3 are missing values:

df = DataFrame([1 2 3; 4 5 6; 7 missing missing; 10 11 12; 13 missing missing; 16 17 18], :auto)

    x1      x2       x3
    Int64?  Int64?   Int64?
1   1       2        3
2   4       5        6
3   7       missing  missing
4   10      11       12
5   13      missing  missing
6   16      17       18

After much trial and error I found the following code would work:

e = [:x2, :x3]

[all(r) for r in eachrow(ismissing.(df[:,e]))]

producing a desired result:

[false false true false true false]

However the code seems overly complicated for what it does and I believe I am missing a much simpler solution. Can anyone help?

pdeffebach · February 4, 2022, 5:31pm

I guess this can go one of two ways. Do you know that :x2 and :x3 are the columns you need to analyze, or are you looking for a solution where you don’t know the columns that are important ex-ante and need a general solution?

If you know it’s :x2 and :x3 than maybe

ismissing.(df.x2) .&& ismissing.(df.x3)

(which works in 1.7)

If you don’t know the columns ex-ante, then the current solution is not bad. I would maybe use map though.

julia> e = [:x2, :x3];

julia> map(eachrow(df)) do r
           all(ismissing, r[e])
       end
6-element Vector{Bool}:
 0
 0
 1
 0
 1
 0

From the perspective of DataFramesMeta , for the first option you can write

julia> @with df @byrow ismissing(:x2) && ismissing(:x3)
6-element Vector{Bool}:
 0
 0
 1
 0
 1
 0

or

julia> @with df @. ismissing(:x2) && ismissing(:x3)
6-element BitVector:
 0
 0
 1
 0
 1
 0

With DataFramesMeta in the 2nd perspective, that exact feature exists on master, but is not fully implemented.

oheil · February 4, 2022, 5:36pm

This works also in 1.6 and 1.8:

ismissing.(df.x2) .& ismissing.(df.x3)
#or
ismissing.(df[:,:x2]) .& ismissing.(df[:,:x3])

oheil · February 4, 2022, 5:47pm

And another one:

all.(map.(ismissing,eachrow(df[:,e])))

which is the same as @pdeffebach map version but in one line.

pdeffebach · February 4, 2022, 6:08pm

They are not the same.

df[:, e]

makes a copy of all the columns. This will be expensive for large data frames.

Similarly, & is not the same as &&. 1 & 2 is a bit-wise operation, 1 && 2 will fail.

oheil · February 4, 2022, 7:36pm

Ah, great, thanx for the clarification, didn’t pay attention to this.

This I know, I didn’t say they are the same. Why is & a problem here? Or why is && to be prefered?

pdeffebach · February 4, 2022, 7:39pm

It’s a not problem in this exact instance, but it is less safe to use in general. If you were to forget to write ismissing.(x) and just did x, the code would still run, but the result would be nonsense.

forestacorn · February 4, 2022, 9:34pm

Thanks for your replies, they have helped my understanding, leading me to try this code:

map(r -> all(ismissing, r[e]), eachrow(df))

as suggested by the do loop documentation and the help of @pdeffebach’s code . It takes around 22\mu s in Pluto on my old system as does @pdeffebach’s do loop code, @oheil 's version takes about 92\mu s, which although slower is helpful for seeing the effect of different approaches. My original code takes 60\mu s.

pdeffebach · February 4, 2022, 9:36pm

If you really need good performance, instead of eachrow(df) use Tables.namedtupleiterator(df[!, e]), which will be faster. Assuming e isn’t too big, hopefully not bigger than 20.

Of course, there are even faster ways to go about this that we can dig into if you need.

rocco_sprmnt21 · February 5, 2022, 3:12pm

ismissing.(coalesce.(df.x2, df.x3))

rocco_sprmnt21 · February 5, 2022, 5:12pm

try this, for the generic case

e=eachcol(df[!,names(df, !=("x1"))])

ismissing.(reduce((x,y)->coalesce.(x,y), e))

rocco_sprmnt21 · February 5, 2022, 7:17pm

cols1=eachcol(df1[!,names(df1, !=("x1"))])
mapreduce(x->findall(ismissing,x), intersect, cols1)

Although not as fast as the use of broadcast coalesce on columns, the filter and subset functions would do the job.

filter(x -> all(ismissing,x[2:end]), df)

subset(df, names(df,!=("x1"))=>ByRow((r...)->all(ismissing,r)) )

PS
the syntax of the subset function in this case is much less intuitive and direct (it is necessary to slurp the input variables and it is not the first thing that comes to mind)

pdeffebach · February 7, 2022, 6:01pm

That’s what AsTable is for

Topic		Replies	Views
Find DataFrame row with missing values present General Usage dataframes	8	687	January 9, 2023
Detecting missing in DataFrame columns New to Julia	6	5782	April 6, 2021
Show only rows with missing in one column New to Julia	5	1512	September 1, 2018
How do I drop only rows that are fully filled with missing values? General Usage question , package , dataframes	3	183	January 24, 2023
How to compare non-missing elements of two DataFrames New to Julia	3	567	July 1, 2020

Finding DataFrame rows with `missing` values in specific columns?

Related topics