Dealing with NaN's

statspy · April 27, 2021, 7:50am

Hello,

I am trying to filter a dataframe to exclude NaN’a values but not working so far. How can i deal with NaN’s?

I am trying:

filter(!isnan(df.salary))

What are my options to deal with NaN’s?

Thanks

jzr · April 27, 2021, 8:41am

Perhaps you mean to have missing instead of NaNs. You can dropmissing in that case.

statspy · April 27, 2021, 8:44am

Nope. They were NaN. I tried skipmissing and the NaN values was still there. I replaced NaN’s with zeros then i filtered values > 0 but there should be a smarter way to do this. thx

jzr · April 27, 2021, 8:45am

Are you reading from a csv or what?

statspy · April 27, 2021, 8:46am

yes, i am reading from a csv.

jzr · April 27, 2021, 8:49am

In the actual csv file, what do the nans look like?
What do the nans represent? Are they unknown values? If so, use missingstring=. Or do they have another meaning in your dataset?

statspy · April 27, 2021, 9:00am

I am not sure if you know the enron.csv dataset, its a dataset from kaggle. In this dataset theres a column named salary. When I read it, some of the values appears as NaN’s. Reading the csv in LIbreOffice they appear as NaN as well. So my approach is just to delete them.

In this case, its just 164 values, but my concern about my approach (replacing NaN’s with Zeros then filter) is that this would run slow with larger datasets.

jzr · April 27, 2021, 9:43am

Figure out how the nans are represented in the text of the file, and then use missingstring= to turn those values into missings, and then use dropmissing to drop those rows.

ffevotte · April 27, 2021, 10:04am

Using filter, you could write something like this:

julia> df = DataFrame(x=rand(10), y=rand((1, 2, NaN), 10).*rand(10))
10×2 DataFrame
 Row │ x         y          
     │ Float64   Float64    
─────┼──────────────────────
   1 │ 0.214967    0.955162
   2 │ 0.717033  NaN
   3 │ 0.313037  NaN
   4 │ 0.925829    0.876826
   5 │ 0.77564     0.466991
   6 │ 0.524314    0.50265
   7 │ 0.564011  NaN
   8 │ 0.495634    0.611079
   9 │ 0.366369    0.761782
  10 │ 0.949455    1.41635

julia> filter(:y => !isnan, df)
7×2 DataFrame
 Row │ x         y        
     │ Float64   Float64  
─────┼────────────────────
   1 │ 0.214967  0.955162
   2 │ 0.925829  0.876826
   3 │ 0.77564   0.466991
   4 │ 0.524314  0.50265
   5 │ 0.495634  0.611079
   6 │ 0.366369  0.761782
   7 │ 0.949455  1.41635

This creates a new DataFrame. If instead you want to delete the rows in the original DataFrame, you can use delete!, for example like so:

julia> delete!(df, isnan.(df.y))
7×2 DataFrame
 Row │ x         y        
     │ Float64   Float64  
─────┼────────────────────
   1 │ 0.214967  0.955162
   2 │ 0.925829  0.876826
   3 │ 0.77564   0.466991
   4 │ 0.524314  0.50265
   5 │ 0.495634  0.611079
   6 │ 0.366369  0.761782
   7 │ 0.949455  1.41635

statspy · April 27, 2021, 10:11am

Thanks @jzr and @ffevotte for the help. xD

rafael.guerra · April 27, 2021, 11:34am

To remove all rows with NaNs in any column, what are simpler methods than using bitvectors as follows:

# remove all rows with NaNs in any column:
df = DataFrame(x=rand((-1,1,NaN),10), y=rand((-2,2,NaN),10), z=rand((-3,3,NaN),10))

ix = [(!isnan).(c) for c in eachcol(df)]
ix0 = ix[1]
for i in 2:length(ix)
   ix0 = ix0 .& ix[i]
end
df[ix0,:]

statspy · April 27, 2021, 1:34pm

I didn’t manage to make this work @rafael.guerra dunno why tho.

rafael.guerra · April 27, 2021, 1:40pm

Uhm… Here it works fine in Julia 1.6.1 on Win10 for both DataFrames v0.21.8 and v1.0.0.
Results below:

julia> df
10×3 DataFrame
│ Row │ x    │ y    │ z    │
│     │ Real │ Real │ Real │
├─────┼──────┼──────┼──────┤
│ 1   │ -1   │ 2    │ -3   │
│ 2   │ -1   │ NaN  │ -3   │
│ 3   │ -1   │ NaN  │ 3    │
│ 4   │ NaN  │ 2    │ NaN  │
│ 5   │ -1   │ -2   │ 3    │
│ 6   │ -1   │ NaN  │ NaN  │
│ 7   │ NaN  │ NaN  │ 3    │
│ 8   │ NaN  │ NaN  │ 3    │
│ 9   │ 1    │ 2    │ -3   │
│ 10  │ -1   │ 2    │ -3   │

julia> df[ix0,:]
4×3 DataFrame
│ Row │ x    │ y    │ z    │
│     │ Real │ Real │ Real │
├─────┼──────┼──────┼──────┤
│ 1   │ -1   │ 2    │ -3   │
│ 2   │ -1   │ -2   │ 3    │
│ 3   │ 1    │ 2    │ -3   │
│ 4   │ -1   │ 2    │ -3   │

pdeffebach · April 27, 2021, 1:41pm

OP, could you explain a bit more why you are using NaNs here? there are lots of convenience functions of this sort for missing values. So you may have an easier time if you work with missings.

sijo · April 27, 2021, 1:58pm

Here’s one way using filter:

# Using a lambda that receives a named tuple for each row
filter(AsTable(:) => row->!any(isnan(x) for x in row), df)

You could also write filter(row->!any(isnan(x) for x in row), df) but that’s slower.

Or you could write filter((:) => (values...)->!any(isnan.(values)), df) to use a lambda that receives the row as one parameter per column, but that’s even slower apparently.

statspy · April 27, 2021, 2:13pm

When I read the CSV the NaN’s are there. Julia gives me a df with lots of NaN’s (not missing) in numeric columns. I cant just delete these rows because i need most of them.

pdeffebach · April 27, 2021, 2:14pm

You can do missingstring="NaN" when you import the CSV.

It’s worth thinking conceptually about whether your data is missing or NaN. They have different meanings. missing means “I don’t know what this value is”. But NaN means “This is not a number, from an operation like 1/0 or similar.”

rafael.guerra · April 27, 2021, 2:42pm

@sijo, your second option (filter(row->!any(isnan(x) for x in row), df)) seems to be the fastest for the example provided?

sijo · April 27, 2021, 3:53pm

With DataFrames 1.0 and a 1 million row data frame I get the following timings:

julia> using DataFrames

julia> using BenchmarkTools

julia> df = DataFrame(x=rand((-1,1,NaN),1000000), y=rand((-2,2,NaN),1000000), z=rand((-3,3,NaN),1000000));

julia> @btime filter(AsTable(:) => row->!any(isnan(x) for x in row), $df);
  77.555 ms (61 allocations: 32.03 MiB)

julia> @btime filter(row->!any(isnan(x) for x in row), $df);
  405.556 ms (7033060 allocations: 148.65 MiB)

julia> @btime filter((:) => (values...)->!any(isnan.(values)), $df);
  1.159 s (3704585 allocations: 111.44 MiB)

tbeason · April 27, 2021, 4:12pm

dropmissing is also fairly quick and certainly the easiest. You have to first convert the NaN to missing (although this can be done when you read the csv as mentioned previously).

allowmissing!(df)
foreach(c->replace!(c,NaN=>missing),eachcol(df))
dropmissing(df)

benchmark results for posterity:

julia> function dropnan(df)
       dff = copy(df)
       allowmissing!(dff)
       foreach(c->replace!(c,NaN=>missing),eachcol(dff))
       dropmissing!(dff)
       return dff
       end
dropnan (generic function with 1 method)

julia> @btime dropnan($df)
  271.735 ms (6771432 allocations: 161.38 MiB)

julia> @btime filter(AsTable(:) => row->!any(isnan(x) for x in row), $df);
  64.271 ms (61 allocations: 32.05 MiB)

julia> @btime filter(row->!any(isnan(x) for x in row), $df);
  333.526 ms (7034381 allocations: 148.70 MiB)

julia> @btime filter((:) => (values...)->!any(isnan.(values)), $df);
  868.403 ms (3703999 allocations: 111.45 MiB)

Topic		Replies	Views
How to filter out rows with NaN in specific fields? New to Julia dataframes	2	3402	October 24, 2019
DataFrames: How to remove rows containing NaNs when there are also missings General Usage dataframes	9	2321	November 28, 2024
Finding values and deleting rows in a DataFrame General Usage	2	261	December 31, 2020
Replacing missing and NaN values in dataframe New to Julia question , dataframes , missing-values	6	3913	March 29, 2022
DataFrame can't get filter to get rid of rows where a column contains a blank " " New to Julia dataframes	8	1058	February 19, 2022

Dealing with NaN's

Related topics