Dealing with NaN's

Hello,

I am trying to filter a dataframe to exclude NaN’a values but not working so far. How can i deal with NaN’s?

I am trying:

filter(!isnan(df.salary))

What are my options to deal with NaN’s?

Thanks

Perhaps you mean to have missing instead of NaNs. You can dropmissing in that case.

1 Like

Nope. They were NaN. I tried skipmissing and the NaN values was still there. I replaced NaN’s with zeros then i filtered values > 0 but there should be a smarter way to do this. thx

Are you reading from a csv or what?

yes, i am reading from a csv.

  1. In the actual csv file, what do the nans look like?
  2. What do the nans represent? Are they unknown values? If so, use missingstring=. Or do they have another meaning in your dataset?
1 Like

I am not sure if you know the enron.csv dataset, its a dataset from kaggle. In this dataset theres a column named salary. When I read it, some of the values appears as NaN’s. Reading the csv in LIbreOffice they appear as NaN as well. So my approach is just to delete them.

In this case, its just 164 values, but my concern about my approach (replacing NaN’s with Zeros then filter) is that this would run slow with larger datasets.

Figure out how the nans are represented in the text of the file, and then use missingstring= to turn those values into missings, and then use dropmissing to drop those rows.

2 Likes

Using filter, you could write something like this:

julia> df = DataFrame(x=rand(10), y=rand((1, 2, NaN), 10).*rand(10))
10×2 DataFrame
 Row │ x         y          
     │ Float64   Float64    
─────┼──────────────────────
   1 │ 0.214967    0.955162
   2 │ 0.717033  NaN
   3 │ 0.313037  NaN
   4 │ 0.925829    0.876826
   5 │ 0.77564     0.466991
   6 │ 0.524314    0.50265
   7 │ 0.564011  NaN
   8 │ 0.495634    0.611079
   9 │ 0.366369    0.761782
  10 │ 0.949455    1.41635
julia> filter(:y => !isnan, df)
7×2 DataFrame
 Row │ x         y        
     │ Float64   Float64  
─────┼────────────────────
   1 │ 0.214967  0.955162
   2 │ 0.925829  0.876826
   3 │ 0.77564   0.466991
   4 │ 0.524314  0.50265
   5 │ 0.495634  0.611079
   6 │ 0.366369  0.761782
   7 │ 0.949455  1.41635

This creates a new DataFrame. If instead you want to delete the rows in the original DataFrame, you can use delete!, for example like so:

julia> delete!(df, isnan.(df.y))
7×2 DataFrame
 Row │ x         y        
     │ Float64   Float64  
─────┼────────────────────
   1 │ 0.214967  0.955162
   2 │ 0.925829  0.876826
   3 │ 0.77564   0.466991
   4 │ 0.524314  0.50265
   5 │ 0.495634  0.611079
   6 │ 0.366369  0.761782
   7 │ 0.949455  1.41635
7 Likes

Thanks @jzr and @ffevotte for the help. xD

To remove all rows with NaNs in any column, what are simpler methods than using bitvectors as follows:

# remove all rows with NaNs in any column:
df = DataFrame(x=rand((-1,1,NaN),10), y=rand((-2,2,NaN),10), z=rand((-3,3,NaN),10))

ix = [(!isnan).(c) for c in eachcol(df)]
ix0 = ix[1]
for i in 2:length(ix)
   ix0 = ix0 .& ix[i]
end
df[ix0,:]

I didn’t manage to make this work @rafael.guerra dunno why tho.

Uhm… Here it works fine in Julia 1.6.1 on Win10 for both DataFrames v0.21.8 and v1.0.0.
Results below:

julia> df
10×3 DataFrame
│ Row │ x    │ y    │ z    │
│     │ Real │ Real │ Real │
├─────┼──────┼──────┼──────┤
│ 1   │ -1   │ 2    │ -3   │
│ 2   │ -1   │ NaN  │ -3   │
│ 3   │ -1   │ NaN  │ 3    │
│ 4   │ NaN  │ 2    │ NaN  │
│ 5   │ -1   │ -2   │ 3    │
│ 6   │ -1   │ NaN  │ NaN  │
│ 7   │ NaN  │ NaN  │ 3    │
│ 8   │ NaN  │ NaN  │ 3    │
│ 9   │ 1    │ 2    │ -3   │
│ 10  │ -1   │ 2    │ -3   │

julia> df[ix0,:]
4×3 DataFrame
│ Row │ x    │ y    │ z    │
│     │ Real │ Real │ Real │
├─────┼──────┼──────┼──────┤
│ 1   │ -1   │ 2    │ -3   │
│ 2   │ -1   │ -2   │ 3    │
│ 3   │ 1    │ 2    │ -3   │
│ 4   │ -1   │ 2    │ -3   │

OP, could you explain a bit more why you are using NaNs here? there are lots of convenience functions of this sort for missing values. So you may have an easier time if you work with missings.

1 Like

Here’s one way using filter:

# Using a lambda that receives a named tuple for each row
filter(AsTable(:) => row->!any(isnan(x) for x in row), df)

You could also write filter(row->!any(isnan(x) for x in row), df) but that’s slower.

Or you could write filter((:) => (values...)->!any(isnan.(values)), df) to use a lambda that receives the row as one parameter per column, but that’s even slower apparently.

2 Likes

When I read the CSV the NaN’s are there. Julia gives me a df with lots of NaN’s (not missing) in numeric columns. I cant just delete these rows because i need most of them.

You can do missingstring="NaN" when you import the CSV.

It’s worth thinking conceptually about whether your data is missing or NaN. They have different meanings. missing means “I don’t know what this value is”. But NaN means “This is not a number, from an operation like 1/0 or similar.”

2 Likes

@sijo, your second option (filter(row->!any(isnan(x) for x in row), df)) seems to be the fastest for the example provided?

With DataFrames 1.0 and a 1 million row data frame I get the following timings:

julia> using DataFrames

julia> using BenchmarkTools

julia> df = DataFrame(x=rand((-1,1,NaN),1000000), y=rand((-2,2,NaN),1000000), z=rand((-3,3,NaN),1000000));

julia> @btime filter(AsTable(:) => row->!any(isnan(x) for x in row), $df);
  77.555 ms (61 allocations: 32.03 MiB)

julia> @btime filter(row->!any(isnan(x) for x in row), $df);
  405.556 ms (7033060 allocations: 148.65 MiB)

julia> @btime filter((:) => (values...)->!any(isnan.(values)), $df);
  1.159 s (3704585 allocations: 111.44 MiB)

dropmissing is also fairly quick and certainly the easiest. You have to first convert the NaN to missing (although this can be done when you read the csv as mentioned previously).

allowmissing!(df)
foreach(c->replace!(c,NaN=>missing),eachcol(df))
dropmissing(df)

benchmark results for posterity:

julia> function dropnan(df)
       dff = copy(df)
       allowmissing!(dff)
       foreach(c->replace!(c,NaN=>missing),eachcol(dff))
       dropmissing!(dff)
       return dff
       end
dropnan (generic function with 1 method)

julia> @btime dropnan($df)
  271.735 ms (6771432 allocations: 161.38 MiB)

julia> @btime filter(AsTable(:) => row->!any(isnan(x) for x in row), $df);
  64.271 ms (61 allocations: 32.05 MiB)

julia> @btime filter(row->!any(isnan(x) for x in row), $df);
  333.526 ms (7034381 allocations: 148.70 MiB)

julia> @btime filter((:) => (values...)->!any(isnan.(values)), $df);
  868.403 ms (3703999 allocations: 111.45 MiB)
1 Like