Finding DataFrame rows with `missing` values in specific columns?

I have a DataFrame, like the following example, where I am trying to find the rows where the columns x2 and x3 are missing values:

df = DataFrame([1 2 3; 4 5 6; 7 missing missing; 10 11 12; 13 missing missing; 16 17 18], :auto)

    x1      x2       x3
    Int64?  Int64?   Int64?
1   1       2        3
2   4       5        6
3   7       missing  missing
4   10      11       12
5   13      missing  missing
6   16      17       18

After much trial and error I found the following code would work:

e = [:x2, :x3]

[all(r) for r in eachrow(ismissing.(df[:,e]))]

producing a desired result:

[false false true false true false]

However the code seems overly complicated for what it does and I believe I am missing a much simpler solution. Can anyone help?

I guess this can go one of two ways. Do you know that :x2 and :x3 are the columns you need to analyze, or are you looking for a solution where you don’t know the columns that are important ex-ante and need a general solution?

If you know it’s :x2 and :x3 than maybe

ismissing.(df.x2) .&& ismissing.(df.x3)

(which works in 1.7)

If you don’t know the columns ex-ante, then the current solution is not bad. I would maybe use map though.

julia> e = [:x2, :x3];

julia> map(eachrow(df)) do r
           all(ismissing, r[e])
       end
6-element Vector{Bool}:
 0
 0
 1
 0
 1
 0

From the perspective of DataFramesMeta , for the first option you can write

julia> @with df @byrow ismissing(:x2) && ismissing(:x3)
6-element Vector{Bool}:
 0
 0
 1
 0
 1
 0

or

julia> @with df @. ismissing(:x2) && ismissing(:x3)
6-element BitVector:
 0
 0
 1
 0
 1
 0

With DataFramesMeta in the 2nd perspective, that exact feature exists on master, but is not fully implemented.

3 Likes

This works also in 1.6 and 1.8:

ismissing.(df.x2) .& ismissing.(df.x3)
#or
ismissing.(df[:,:x2]) .& ismissing.(df[:,:x3])

And another one:

all.(map.(ismissing,eachrow(df[:,e])))

which is the same as @pdeffebach map version but in one line.

1 Like

They are not the same.

df[:, e]

makes a copy of all the columns. This will be expensive for large data frames.

Similarly, & is not the same as &&. 1 & 2 is a bit-wise operation, 1 && 2 will fail.

2 Likes

Ah, great, thanx for the clarification, didn’t pay attention to this.

This I know, I didn’t say they are the same. Why is & a problem here? Or why is && to be prefered?

It’s a not problem in this exact instance, but it is less safe to use in general. If you were to forget to write ismissing.(x) and just did x, the code would still run, but the result would be nonsense.

1 Like

Thanks for your replies, they have helped my understanding, leading me to try this code:

map(r -> all(ismissing, r[e]), eachrow(df))

as suggested by the do loop documentation and the help of @pdeffebach’s code . It takes around 22\mu s in Pluto on my old system as does @pdeffebach’s do loop code, @oheil 's version takes about 92\mu s, which although slower is helpful for seeing the effect of different approaches. My original code takes 60\mu s.

1 Like

If you really need good performance, instead of eachrow(df) use Tables.namedtupleiterator(df[!, e]), which will be faster. Assuming e isn’t too big, hopefully not bigger than 20.

Of course, there are even faster ways to go about this that we can dig into if you need.

3 Likes
ismissing.(coalesce.(df.x2, df.x3))
1 Like

try this, for the generic case

e=eachcol(df[!,names(df, !=("x1"))])

ismissing.(reduce((x,y)->coalesce.(x,y), e))
cols1=eachcol(df1[!,names(df1, !=("x1"))])
mapreduce(x->findall(ismissing,x), intersect, cols1)

Although not as fast as the use of broadcast coalesce on columns, the filter and subset functions would do the job.

filter(x -> all(ismissing,x[2:end]), df)

subset(df, names(df,!=("x1"))=>ByRow((r...)->all(ismissing,r)) )

PS
the syntax of the subset function in this case is much less intuitive and direct (it is necessary to slurp the input variables and it is not the first thing that comes to mind)

That’s what AsTable is for

1 Like