Hi i want filter the rows having perticular value, but the column contains missing values as well. can some body help me to do this
data frame i have
df = DataFrame()
df.a = ([3,5,6,0,missing])
df.b = 1:5
i want to filter rows with value 5 in column df.a
df = filter(r -> r.a == 5 , df)
this code gives me an non-boolean error
data frame i want is
df = DataFrame()
df.a = 5
df.b = 2
Thanks
1 Like
oheil
February 11, 2022, 2:14pm
2
df = filter( r -> !ismissing(r.a) && r.a == 5 , df)
2 Likes
nilshg
February 11, 2022, 2:19pm
3
For working with missing
imho the best thing is generally to use isequal
, because
julia> 5 == missing
missing
julia> isequal(5, missing)
false
so
julia> df[isequal.(df.a, 5), :]
1Γ2 DataFrame
Row β a b
β Int64? Int64
ββββββΌβββββββββββββββ
1 β 5 2
(Incidentally, are you aware that you can assign columns directly in the DataFrame construtor, i.e. DataFrame(a = [3, 5, 6, 0, missing], b = 1:5])
?
4 Likes
lmiq
February 11, 2022, 2:20pm
4
Whatβs the rationale behind that?
oheil
February 11, 2022, 2:22pm
5
Doesnβt explain it but there is some logic inside:
julia> missing == missing
missing
julia> missing === missing
true
nilshg
February 11, 2022, 2:22pm
6
The idea is that you donβt know - missing
represents a value thatβs missing in a statistical sense, i.e. the value exists but is not known (for example because it was not measured, someone didnβt answer that survey question etc.). You therefore canβt be sure that itβs not 5, so returning false
wouldnβt be correct.
(Thereβs been a lot of ink spilled over this and how it can make working with DataFrames in Julia less ergonomic than some other popular languages, but it is what it is!)
5 Likes
lmiq
February 11, 2022, 2:25pm
7
Thanks, seems useful in many contexts. Of course it is then arbitrary that isequal
returns false
, but that allows one to choose what one wants, and probably that is the reason for the two alternatives.
nilshg
February 11, 2022, 2:27pm
8
Yes, at least thatβs what the docstring suggests:
help?> isequal
isequal(x, y)
Similar to ==, except for the treatment of floating point numbers and of missing values.
isequal treats all floating-point NaN values as equal to each other, treats -0.0 as unequal
to 0.0, and missing as equal to missing. Always returns a Bool value.
2 Likes
thanks @oheil @nilshg @lmiq for your help.
1 Like
how can remove only df.a = 5 row, keep the remianing data frame as such.
I wrote a package, MissingsAsFalse.jl which solves this exact problem via a macro.
@mfalse missing == 5
1 Like
bkamins
February 11, 2022, 2:43pm
12
I would rather recommend (I ignore performance considerations):
filter(r -> coalesce(r.a == 5, false) , df)
for the following reasons:
by passing true
you can decide to keep these rows
it is more explicit what is going on
==
and isequal
have a bit different behavior, so if you want ==
it is best to stick to using ==
2 Likes
bkamins
February 11, 2022, 2:44pm
13
MissingsAsFalse.jl is of course a valid alternative. I was commenting on using isequal
.
1 Like
Couple more tips
get a view of the dataframe with no missings in any column (no duplicating data)
julia> dfv = dropmissing(df; view=true)
4Γ2 SubDataFrame
Row β a b
β Int64? Int64
ββββββΌβββββββββββββββ
1 β 3 1
2 β 5 2
3 β 6 3
4 β 0 4
or just particular columns
julia> dfv = dropmissing(df, [:a]; view=true)
4Γ2 SubDataFrame
Row β a b
β Int64? Int64
ββββββΌβββββββββββββββ
1 β 3 1
2 β 5 2
3 β 6 3
4 β 0 4
specifying just the column in the filter is more efficient
julia> filter(:a=>a -> a == 5 , dfv)
1Γ2 DataFrame
Row β a b
β Int64? Int64
ββββββΌβββββββββββββββ
1 β 5 2
or multiples
julia> filter([:a, :b]=>(a,b) -> a >b, dfv)
3Γ2 DataFrame
Row β a b
β Int64? Int64
ββββββΌβββββββββββββββ
1 β 3 1
2 β 5 2
3 β 6 3
and finally, again you can filter to a view
julia> filter([:a]=>a -> a == 5 , dfv; view=true)
1Γ2 SubDataFrame
Row β a b
β Int64? Int64
ββββββΌβββββββββββββββ
1 β 5 2
how to remove only df.a = 5, so ill get a dataframe
df2 = DataFrame(a = [3,6,0,missing], b = [1,3,4,5] )
Iβm going to give an answer using DataFramesMeta.jl
@rsubset df @mfalse :a != 5
1 Like
pdeffebach:
ataFrames
i have loaded DataFramesMeta , still if shows @mfalse not defined
julia> using DataFramesMeta
julia> amik9d = @rsubset amik9 @mfalse :RATE != 0
ERROR: LoadError: UndefVarError: @mfalse not defined
You also need to add the package MissingsAsFalse.jl as well. Though actually the @rsubset
command will work even without @mfalse
.
1 Like
it removed both missing and 5 from df.a, i wanted to remove only 5 and keep missing as such
Ah interesting! That would indeed happen and is a bit of a corner case I hadnβt thought of. You would have to specify this directly, unfortunately
@rsubset df @mfalse ismissing(:a) || :a != 5
I will have to think about a better syntax to support this.
EDIT: This isnβt going to work either. You will have to use if ... else
commands.
@rsubset df ismissing(:a) ? true : :a != 5
1 Like