Hello,
I am trying to filter a dataframe to exclude NaN’a values but not working so far. How can i deal with NaN’s?
I am trying:
filter(!isnan(df.salary))
What are my options to deal with NaN’s?
Thanks
Hello,
I am trying to filter a dataframe to exclude NaN’a values but not working so far. How can i deal with NaN’s?
I am trying:
filter(!isnan(df.salary))
What are my options to deal with NaN’s?
Thanks
Perhaps you mean to have missing
instead of NaNs. You can dropmissing
in that case.
Nope. They were NaN. I tried skipmissing and the NaN values was still there. I replaced NaN’s with zeros then i filtered values > 0 but there should be a smarter way to do this. thx
Are you reading from a csv or what?
yes, i am reading from a csv.
missingstring=
. Or do they have another meaning in your dataset?I am not sure if you know the enron.csv dataset, its a dataset from kaggle. In this dataset theres a column named salary. When I read it, some of the values appears as NaN’s. Reading the csv in LIbreOffice they appear as NaN as well. So my approach is just to delete them.
In this case, its just 164 values, but my concern about my approach (replacing NaN’s with Zeros then filter) is that this would run slow with larger datasets.
Figure out how the nans are represented in the text of the file, and then use missingstring=
to turn those values into missing
s, and then use dropmissing
to drop those rows.
Using filter
, you could write something like this:
julia> df = DataFrame(x=rand(10), y=rand((1, 2, NaN), 10).*rand(10))
10×2 DataFrame
Row │ x y
│ Float64 Float64
─────┼──────────────────────
1 │ 0.214967 0.955162
2 │ 0.717033 NaN
3 │ 0.313037 NaN
4 │ 0.925829 0.876826
5 │ 0.77564 0.466991
6 │ 0.524314 0.50265
7 │ 0.564011 NaN
8 │ 0.495634 0.611079
9 │ 0.366369 0.761782
10 │ 0.949455 1.41635
julia> filter(:y => !isnan, df)
7×2 DataFrame
Row │ x y
│ Float64 Float64
─────┼────────────────────
1 │ 0.214967 0.955162
2 │ 0.925829 0.876826
3 │ 0.77564 0.466991
4 │ 0.524314 0.50265
5 │ 0.495634 0.611079
6 │ 0.366369 0.761782
7 │ 0.949455 1.41635
This creates a new DataFrame. If instead you want to delete the rows in the original DataFrame, you can use delete!
, for example like so:
julia> delete!(df, isnan.(df.y))
7×2 DataFrame
Row │ x y
│ Float64 Float64
─────┼────────────────────
1 │ 0.214967 0.955162
2 │ 0.925829 0.876826
3 │ 0.77564 0.466991
4 │ 0.524314 0.50265
5 │ 0.495634 0.611079
6 │ 0.366369 0.761782
7 │ 0.949455 1.41635
To remove all rows with NaNs in any column, what are simpler methods than using bitvectors as follows:
# remove all rows with NaNs in any column:
df = DataFrame(x=rand((-1,1,NaN),10), y=rand((-2,2,NaN),10), z=rand((-3,3,NaN),10))
ix = [(!isnan).(c) for c in eachcol(df)]
ix0 = ix[1]
for i in 2:length(ix)
ix0 = ix0 .& ix[i]
end
df[ix0,:]
I didn’t manage to make this work @rafael.guerra dunno why tho.
Uhm… Here it works fine in Julia 1.6.1 on Win10 for both DataFrames v0.21.8 and v1.0.0.
Results below:
julia> df
10×3 DataFrame
│ Row │ x │ y │ z │
│ │ Real │ Real │ Real │
├─────┼──────┼──────┼──────┤
│ 1 │ -1 │ 2 │ -3 │
│ 2 │ -1 │ NaN │ -3 │
│ 3 │ -1 │ NaN │ 3 │
│ 4 │ NaN │ 2 │ NaN │
│ 5 │ -1 │ -2 │ 3 │
│ 6 │ -1 │ NaN │ NaN │
│ 7 │ NaN │ NaN │ 3 │
│ 8 │ NaN │ NaN │ 3 │
│ 9 │ 1 │ 2 │ -3 │
│ 10 │ -1 │ 2 │ -3 │
julia> df[ix0,:]
4×3 DataFrame
│ Row │ x │ y │ z │
│ │ Real │ Real │ Real │
├─────┼──────┼──────┼──────┤
│ 1 │ -1 │ 2 │ -3 │
│ 2 │ -1 │ -2 │ 3 │
│ 3 │ 1 │ 2 │ -3 │
│ 4 │ -1 │ 2 │ -3 │
OP, could you explain a bit more why you are using NaN
s here? there are lots of convenience functions of this sort for missing
values. So you may have an easier time if you work with missing
s.
Here’s one way using filter
:
# Using a lambda that receives a named tuple for each row
filter(AsTable(:) => row->!any(isnan(x) for x in row), df)
You could also write filter(row->!any(isnan(x) for x in row), df)
but that’s slower.
Or you could write filter((:) => (values...)->!any(isnan.(values)), df)
to use a lambda that receives the row as one parameter per column, but that’s even slower apparently.
When I read the CSV the NaN’s are there. Julia gives me a df with lots of NaN’s (not missing) in numeric columns. I cant just delete these rows because i need most of them.
You can do missingstring="NaN"
when you import the CSV.
It’s worth thinking conceptually about whether your data is missing
or NaN
. They have different meanings. missing
means “I don’t know what this value is”. But NaN
means “This is not a number, from an operation like 1/0
or similar.”
@sijo, your second option (filter(row->!any(isnan(x) for x in row), df)
) seems to be the fastest for the example provided?
With DataFrames 1.0 and a 1 million row data frame I get the following timings:
julia> using DataFrames
julia> using BenchmarkTools
julia> df = DataFrame(x=rand((-1,1,NaN),1000000), y=rand((-2,2,NaN),1000000), z=rand((-3,3,NaN),1000000));
julia> @btime filter(AsTable(:) => row->!any(isnan(x) for x in row), $df);
77.555 ms (61 allocations: 32.03 MiB)
julia> @btime filter(row->!any(isnan(x) for x in row), $df);
405.556 ms (7033060 allocations: 148.65 MiB)
julia> @btime filter((:) => (values...)->!any(isnan.(values)), $df);
1.159 s (3704585 allocations: 111.44 MiB)
dropmissing
is also fairly quick and certainly the easiest. You have to first convert the NaN
to missing
(although this can be done when you read the csv as mentioned previously).
allowmissing!(df)
foreach(c->replace!(c,NaN=>missing),eachcol(df))
dropmissing(df)
benchmark results for posterity:
julia> function dropnan(df)
dff = copy(df)
allowmissing!(dff)
foreach(c->replace!(c,NaN=>missing),eachcol(dff))
dropmissing!(dff)
return dff
end
dropnan (generic function with 1 method)
julia> @btime dropnan($df)
271.735 ms (6771432 allocations: 161.38 MiB)
julia> @btime filter(AsTable(:) => row->!any(isnan(x) for x in row), $df);
64.271 ms (61 allocations: 32.05 MiB)
julia> @btime filter(row->!any(isnan(x) for x in row), $df);
333.526 ms (7034381 allocations: 148.70 MiB)
julia> @btime filter((:) => (values...)->!any(isnan.(values)), $df);
868.403 ms (3703999 allocations: 111.45 MiB)