DataFrames: How to remove rows containing NaNs when there are also missings

Assume I have the following DataFrame and want to remove rows containing NaN:

df = DataFrame(a=[NaN, 1.1, NaN, missing, missing], b=[1.1, 2, 3, missing, NaN], c='a':'e');

For just one column I could do something like:
filter(x->(ismissing(x.a) || !isnan(x.a)), df)

To extend this to all columns I tried to use the subset function in combination with the usual DataFrame transformation syntax, but couldn’t get it to work:
subset(df, :a => ByRow(x->(ismissing(x) || !isnan(x)))) (works)
subset(df, names(df, Union{Float64, Missing}) .=> ByRow(x->(ismissing(x) || !isnan(x)))) (doesn’t work)

1 Like

The simplest is probably:

filter(row -> all(x -> !(x isa Number && isnan(x)), row), df)

You can also write:

subset(df, (names(df) .=> ByRow(x -> !(x isa Number && isnan(x))))...)

Note that names(df, Union{Float64, Missing}) is not fully correct, as your column could have e.g. Any type and still contain NaN.

3 Likes

Sorry for revive this topic. But why there is no dropna function like pandas? We have dropmissing but no dropna

Because pandas in the past did not have a first class support for missing values, so it used NaN as a surrogate.

In DataFrames.jl by design missing values are properly supported, so we have dropmissing. In Julia NaN should not be used to indicate missingness.

1 Like

But if you do mmap array files on disk and merge them without copy to a DataFrame you cannot use a column type with union of missing. And imagine some new user come from python with deep habit of using NaN as missing. There could still be value adding the dropna function by default.

1 Like

Is there a reason (or reasons) why isnan() is not defined for characters?

julia> any(isnan, df[4,:])
ERROR: MethodError: no method matching isnan(::Char)

By its definition, NaN is a value of floating point representations, so it is “non-sensical” to ask a non floating point if it is NaN. Similarly, there is no iszero for Char.

Seems like a legit choice to me.

mmmh … will be.
But do you admit that it sounds strange at least from an “aesthetic” point of view that isa(NaN, Number) = true?

filter(row -> all(x -> !(x === NaN), row), df)

is okay, since

isa(1im, Number) == true

and NaN is a special number. You can actually get it returned by math ops:

@fastmath sqrt(-2.0) == NaN

Of course, other choices could be made, but this is one of them.

I just noticed that it reads like:

“Is a Not_a_Number a Number?”
“yes, Not_a_Number is a Number”

1 Like