Filter a number when missing values are in column

sai_matcha · February 11, 2022, 2:11pm

Hi i want filter the rows having perticular value, but the column contains missing values as well. can some body help me to do this
data frame i have

df = DataFrame()
df.a = ([3,5,6,0,missing])
df.b = 1:5

i want to filter rows with value 5 in column df.a

df = filter(r -> r.a == 5 , df)

this code gives me an non-boolean error

data frame i want is

df = DataFrame()
df.a = 5
df.b = 2

Thanks

oheil · February 11, 2022, 2:14pm

df = filter( r -> !ismissing(r.a) && r.a == 5 , df)

nilshg · February 11, 2022, 2:19pm

For working with missing imho the best thing is generally to use isequal, because

julia> 5 == missing
missing

julia> isequal(5, missing)
false

so

julia> df[isequal.(df.a, 5), :]
1×2 DataFrame
 Row │ a       b     
     │ Int64?  Int64 
─────┼───────────────
   1 │      5      2

(Incidentally, are you aware that you can assign columns directly in the DataFrame construtor, i.e. DataFrame(a = [3, 5, 6, 0, missing], b = 1:5])?

lmiq · February 11, 2022, 2:20pm

What’s the rationale behind that?

oheil · February 11, 2022, 2:22pm

Doesn’t explain it but there is some logic inside:

julia> missing == missing
missing

julia> missing === missing
true

nilshg · February 11, 2022, 2:22pm

The idea is that you don’t know - missing represents a value that’s missing in a statistical sense, i.e. the value exists but is not known (for example because it was not measured, someone didn’t answer that survey question etc.). You therefore can’t be sure that it’s not 5, so returning false wouldn’t be correct.

(There’s been a lot of ink spilled over this and how it can make working with DataFrames in Julia less ergonomic than some other popular languages, but it is what it is!)

lmiq · February 11, 2022, 2:25pm

Thanks, seems useful in many contexts. Of course it is then arbitrary that isequal returns false, but that allows one to choose what one wants, and probably that is the reason for the two alternatives.

nilshg · February 11, 2022, 2:27pm

Yes, at least that’s what the docstring suggests:

help?> isequal

  isequal(x, y)

  Similar to ==, except for the treatment of floating point numbers and of missing values. 
isequal treats all floating-point NaN values as equal to each other, treats -0.0 as unequal 
to 0.0, and missing as equal to missing. Always returns a Bool value.

sai_matcha · February 11, 2022, 2:39pm

thanks @oheil @nilshg @lmiq for your help.

sai_matcha · February 11, 2022, 2:40pm

how can remove only df.a = 5 row, keep the remianing data frame as such.

pdeffebach · February 11, 2022, 2:42pm

I wrote a package, MissingsAsFalse.jl which solves this exact problem via a macro.

@mfalse missing == 5

bkamins · February 11, 2022, 2:43pm

I would rather recommend (I ignore performance considerations):

filter(r -> coalesce(r.a == 5, false) , df)

for the following reasons:

by passing true you can decide to keep these rows
it is more explicit what is going on
== and isequal have a bit different behavior, so if you want == it is best to stick to using ==

bkamins · February 11, 2022, 2:44pm

MissingsAsFalse.jl is of course a valid alternative. I was commenting on using isequal.

lawless-m · February 11, 2022, 2:45pm

Couple more tips

get a view of the dataframe with no missings in any column (no duplicating data)

julia> dfv  = dropmissing(df; view=true)
4×2 SubDataFrame
 Row │ a       b     
     │ Int64?  Int64 
─────┼───────────────
   1 │      3      1
   2 │      5      2
   3 │      6      3
   4 │      0      4

or just particular columns

julia> dfv  = dropmissing(df, [:a]; view=true)
4×2 SubDataFrame
 Row │ a       b     
     │ Int64?  Int64 
─────┼───────────────
   1 │      3      1
   2 │      5      2
   3 │      6      3
   4 │      0      4

specifying just the column in the filter is more efficient

julia> filter(:a=>a -> a == 5 , dfv)
1×2 DataFrame
 Row │ a       b     
     │ Int64?  Int64 
─────┼───────────────
   1 │      5      2

or multiples

julia> filter([:a, :b]=>(a,b) -> a >b, dfv)
3×2 DataFrame
 Row │ a       b     
     │ Int64?  Int64 
─────┼───────────────
   1 │      3      1
   2 │      5      2
   3 │      6      3

and finally, again you can filter to a view

julia> filter([:a]=>a -> a == 5 , dfv; view=true)
1×2 SubDataFrame
 Row │ a       b     
     │ Int64?  Int64 
─────┼───────────────
   1 │      5      2

sai_matcha · February 11, 2022, 2:50pm

how to remove only df.a = 5, so ill get a dataframe

df2 = DataFrame(a = [3,6,0,missing], b = [1,3,4,5] )

pdeffebach · February 11, 2022, 2:53pm

I’m going to give an answer using DataFramesMeta.jl

@rsubset df @mfalse :a != 5

sai_matcha · February 11, 2022, 3:00pm

i have loaded DataFramesMeta , still if shows @mfalse not defined


julia> using DataFramesMeta

julia> amik9d = @rsubset amik9 @mfalse :RATE != 0
ERROR: LoadError: UndefVarError: @mfalse not defined

pdeffebach · February 11, 2022, 3:05pm

You also need to add the package MissingsAsFalse.jl as well. Though actually the @rsubset command will work even without @mfalse.

sai_matcha · February 11, 2022, 3:09pm

it removed both missing and 5 from df.a, i wanted to remove only 5 and keep missing as such

pdeffebach · February 11, 2022, 3:11pm

Ah interesting! That would indeed happen and is a bit of a corner case I hadn’t thought of. You would have to specify this directly, unfortunately

@rsubset df @mfalse ismissing(:a) || :a != 5

I will have to think about a better syntax to support this.

EDIT: This isn’t going to work either. You will have to use if ... else commands.

@rsubset df ismissing(:a) ? true : :a != 5

Topic		Replies	Views
Find DataFrame row with missing values present General Usage dataframes	8	690	January 9, 2023
How to remove rows containing missing from DataFrame? New to Julia	6	13217	July 22, 2019
How do I drop only rows that are fully filled with missing values? General Usage question , package , dataframes	3	183	January 24, 2023
How to filter out missings (using DataFramesMeta @where) New to Julia dataframes	2	692	July 22, 2019
Function occursin() needs a method that accepts Union{String, Missing} General Usage	12	1807	August 28, 2018

Filter a number when missing values are in column

Related topics