Filter a number when missing values are in column

Hi i want filter the rows having perticular value, but the column contains missing values as well. can some body help me to do this
data frame i have

``````df = DataFrame()
df.a = ([3,5,6,0,missing])
df.b = 1:5
``````

i want to filter rows with value 5 in column df.a

``````df = filter(r -> r.a == 5 , df)
``````

this code gives me an non-boolean error

data frame i want is

``````df = DataFrame()
df.a = 5
df.b = 2
``````

Thanks

``````df = filter( r -> !ismissing(r.a) && r.a == 5 , df)
``````
2 Likes

For working with `missing` imho the best thing is generally to use `isequal`, because

``````julia> 5 == missing
missing

julia> isequal(5, missing)
false
``````

so

``````julia> df[isequal.(df.a, 5), :]
1Γ2 DataFrame
Row β a       b
β Int64?  Int64
ββββββΌβββββββββββββββ
1 β      5      2
``````

(Incidentally, are you aware that you can assign columns directly in the DataFrame construtor, i.e. `DataFrame(a = [3, 5, 6, 0, missing], b = 1:5])`?

4 Likes

Whatβs the rationale behind that?

Doesnβt explain it but there is some logic inside:

``````julia> missing == missing
missing

julia> missing === missing
true
``````

The idea is that you donβt know - `missing` represents a value thatβs missing in a statistical sense, i.e. the value exists but is not known (for example because it was not measured, someone didnβt answer that survey question etc.). You therefore canβt be sure that itβs not 5, so returning `false` wouldnβt be correct.

(Thereβs been a lot of ink spilled over this and how it can make working with DataFrames in Julia less ergonomic than some other popular languages, but it is what it is!)

4 Likes

Thanks, seems useful in many contexts. Of course it is then arbitrary that `isequal` returns `false`, but that allows one to choose what one wants, and probably that is the reason for the two alternatives.

Yes, at least thatβs what the docstring suggests:

``````help?> isequal

isequal(x, y)

Similar to ==, except for the treatment of floating point numbers and of missing values.
isequal treats all floating-point NaN values as equal to each other, treats -0.0 as unequal
to 0.0, and missing as equal to missing. Always returns a Bool value.
``````
2 Likes

thanks @oheil @nilshg @lmiq for your help.

1 Like

how can remove only df.a = 5 row, keep the remianing data frame as such.

I wrote a package, MissingsAsFalse.jl which solves this exact problem via a macro.

``````@mfalse missing == 5
``````
1 Like

I would rather recommend (I ignore performance considerations):

``````filter(r -> coalesce(r.a == 5, false) , df)
``````

for the following reasons:

• by passing `true` you can decide to keep these rows
• it is more explicit what is going on
• `==` and `isequal` have a bit different behavior, so if you want `==` it is best to stick to using `==`
2 Likes

MissingsAsFalse.jl is of course a valid alternative. I was commenting on using `isequal`.

1 Like

Couple more tips

get a view of the dataframe with no missings in any column (no duplicating data)

``````julia> dfv  = dropmissing(df; view=true)
4Γ2 SubDataFrame
Row β a       b
β Int64?  Int64
ββββββΌβββββββββββββββ
1 β      3      1
2 β      5      2
3 β      6      3
4 β      0      4
``````

or just particular columns

``````julia> dfv  = dropmissing(df, [:a]; view=true)
4Γ2 SubDataFrame
Row β a       b
β Int64?  Int64
ββββββΌβββββββββββββββ
1 β      3      1
2 β      5      2
3 β      6      3
4 β      0      4
``````

specifying just the column in the filter is more efficient

``````julia> filter(:a=>a -> a == 5 , dfv)
1Γ2 DataFrame
Row β a       b
β Int64?  Int64
ββββββΌβββββββββββββββ
1 β      5      2

``````

or multiples

``````julia> filter([:a, :b]=>(a,b) -> a >b, dfv)
3Γ2 DataFrame
Row β a       b
β Int64?  Int64
ββββββΌβββββββββββββββ
1 β      3      1
2 β      5      2
3 β      6      3

``````

and finally, again you can filter to a view

``````julia> filter([:a]=>a -> a == 5 , dfv; view=true)
1Γ2 SubDataFrame
Row β a       b
β Int64?  Int64
ββββββΌβββββββββββββββ
1 β      5      2

``````

how to remove only df.a = 5, so ill get a dataframe

``````df2 = DataFrame(a = [3,6,0,missing], b = [1,3,4,5] )
``````

Iβm going to give an answer using DataFramesMeta.jl

``````@rsubset df @mfalse :a != 5
``````
1 Like

i have loaded DataFramesMeta , still if shows @mfalse not defined

``````
julia> using DataFramesMeta

julia> amik9d = @rsubset amik9 @mfalse :RATE != 0
ERROR: LoadError: UndefVarError: @mfalse not defined
``````

You also need to add the package MissingsAsFalse.jl as well. Though actually the `@rsubset` command will work even without `@mfalse`.

1 Like

it removed both missing and 5 from df.a, i wanted to remove only 5 and keep missing as such

Ah interesting! That would indeed happen and is a bit of a corner case I hadnβt thought of. You would have to specify this directly, unfortunately

``````@rsubset df @mfalse ismissing(:a) || :a != 5
``````

I will have to think about a better syntax to support this.

EDIT: This isnβt going to work either. You will have to use `if ... else` commands.

``````@rsubset df ismissing(:a) ? true : :a != 5
``````
1 Like