Filter a number when missing values are in column

Hi i want filter the rows having perticular value, but the column contains missing values as well. can some body help me to do this
data frame i have

df = DataFrame()
df.a = ([3,5,6,0,missing])
df.b = 1:5

i want to filter rows with value 5 in column df.a

df = filter(r -> r.a == 5 , df)

this code gives me an non-boolean error

data frame i want is

df = DataFrame()
df.a = 5
df.b = 2

Thanks

df = filter( r -> !ismissing(r.a) && r.a == 5 , df)
2 Likes

For working with missing imho the best thing is generally to use isequal, because

julia> 5 == missing
missing

julia> isequal(5, missing)
false

so

julia> df[isequal.(df.a, 5), :]
1Γ—2 DataFrame
 Row β”‚ a       b     
     β”‚ Int64?  Int64 
─────┼───────────────
   1 β”‚      5      2

(Incidentally, are you aware that you can assign columns directly in the DataFrame construtor, i.e. DataFrame(a = [3, 5, 6, 0, missing], b = 1:5])?

4 Likes

What’s the rationale behind that?

Doesn’t explain it but there is some logic inside:

julia> missing == missing
missing

julia> missing === missing
true

The idea is that you don’t know - missing represents a value that’s missing in a statistical sense, i.e. the value exists but is not known (for example because it was not measured, someone didn’t answer that survey question etc.). You therefore can’t be sure that it’s not 5, so returning false wouldn’t be correct.

(There’s been a lot of ink spilled over this and how it can make working with DataFrames in Julia less ergonomic than some other popular languages, but it is what it is!)

5 Likes

Thanks, seems useful in many contexts. Of course it is then arbitrary that isequal returns false, but that allows one to choose what one wants, and probably that is the reason for the two alternatives.

Yes, at least that’s what the docstring suggests:

help?> isequal

  isequal(x, y)

  Similar to ==, except for the treatment of floating point numbers and of missing values. 
isequal treats all floating-point NaN values as equal to each other, treats -0.0 as unequal 
to 0.0, and missing as equal to missing. Always returns a Bool value.
2 Likes

thanks @oheil @nilshg @lmiq for your help.

1 Like

how can remove only df.a = 5 row, keep the remianing data frame as such.

I wrote a package, MissingsAsFalse.jl which solves this exact problem via a macro.

@mfalse missing == 5
1 Like

I would rather recommend (I ignore performance considerations):

filter(r -> coalesce(r.a == 5, false) , df)

for the following reasons:

  • by passing true you can decide to keep these rows
  • it is more explicit what is going on
  • == and isequal have a bit different behavior, so if you want == it is best to stick to using ==
2 Likes

MissingsAsFalse.jl is of course a valid alternative. I was commenting on using isequal.

1 Like

Couple more tips

get a view of the dataframe with no missings in any column (no duplicating data)

julia> dfv  = dropmissing(df; view=true)
4Γ—2 SubDataFrame
 Row β”‚ a       b     
     β”‚ Int64?  Int64 
─────┼───────────────
   1 β”‚      3      1
   2 β”‚      5      2
   3 β”‚      6      3
   4 β”‚      0      4

or just particular columns

julia> dfv  = dropmissing(df, [:a]; view=true)
4Γ—2 SubDataFrame
 Row β”‚ a       b     
     β”‚ Int64?  Int64 
─────┼───────────────
   1 β”‚      3      1
   2 β”‚      5      2
   3 β”‚      6      3
   4 β”‚      0      4

specifying just the column in the filter is more efficient

julia> filter(:a=>a -> a == 5 , dfv)
1Γ—2 DataFrame
 Row β”‚ a       b     
     β”‚ Int64?  Int64 
─────┼───────────────
   1 β”‚      5      2

or multiples

julia> filter([:a, :b]=>(a,b) -> a >b, dfv)
3Γ—2 DataFrame
 Row β”‚ a       b     
     β”‚ Int64?  Int64 
─────┼───────────────
   1 β”‚      3      1
   2 β”‚      5      2
   3 β”‚      6      3

and finally, again you can filter to a view

julia> filter([:a]=>a -> a == 5 , dfv; view=true)
1Γ—2 SubDataFrame
 Row β”‚ a       b     
     β”‚ Int64?  Int64 
─────┼───────────────
   1 β”‚      5      2

how to remove only df.a = 5, so ill get a dataframe

df2 = DataFrame(a = [3,6,0,missing], b = [1,3,4,5] )

I’m going to give an answer using DataFramesMeta.jl

@rsubset df @mfalse :a != 5
1 Like

i have loaded DataFramesMeta , still if shows @mfalse not defined


julia> using DataFramesMeta

julia> amik9d = @rsubset amik9 @mfalse :RATE != 0
ERROR: LoadError: UndefVarError: @mfalse not defined

You also need to add the package MissingsAsFalse.jl as well. Though actually the @rsubset command will work even without @mfalse.

1 Like

it removed both missing and 5 from df.a, i wanted to remove only 5 and keep missing as such

Ah interesting! That would indeed happen and is a bit of a corner case I hadn’t thought of. You would have to specify this directly, unfortunately

@rsubset df @mfalse ismissing(:a) || :a != 5

I will have to think about a better syntax to support this.

EDIT: This isn’t going to work either. You will have to use if ... else commands.

@rsubset df ismissing(:a) ? true : :a != 5
1 Like