Replace all NaN's with zeros in DataFrame

tk3369 · March 27, 2018, 5:47am

What’s the best way to replace all NaN’s in a DataFrame with zero? I can write a nested for-loop and check every cell but I thought there may be a simpler way to do that…

Tamas_Papp · March 27, 2018, 5:53am

Eg

using DataFrames
replace_nan(v) = map(x -> isnan(x) ? zero(x) : x, v)
df = DataFrame(a = [NaN, 2.0, 3.0], b = [4.0, 5.0, NaN])
df2 = map(replace_nan, eachcol(df))

nalimilan · March 27, 2018, 10:07am

Note that this version allocates new columns, so you may want to use map! instead.

tk3369 · March 27, 2018, 2:02pm

This only maps the last row… seems unintuitive to use last here. Are you using a different version of Julia or DataFrame?

julia> df
3×2 DataFrames.DataFrame
│ Row │ a   │ b   │
├─────┼─────┼─────┤
│ 1   │ NaN │ 4.0 │
│ 2   │ 2.0 │ 5.0 │
│ 3   │ 3.0 │ NaN │

julia> map(replace_nan ∘ last, eachcol(df))
1×2 DataFrames.DataFrame
│ Row │ a   │ b   │
├─────┼─────┼─────┤
│ 1   │ 3.0 │ 0.0 │

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> Pkg.installed("DataFrames")
v"0.11.5"

Tamas_Papp · March 27, 2018, 2:40pm

I missed the fact that DataFrames defined a map for DFColumnIterator. So the version

using DataFrames
df = DataFrame(a = [NaN, 2.0, 3.0], b = [4.0, 5.0, NaN])
replace_nan(v::AbstractVector) = map(x -> isnan(x) ? zero(x) : x, v)
replace_nan!(v::AbstractVector) = map!(x -> isnan(x) ? zero(x) : x, v, v)
map(replace_nan, eachcol(df))
map(replace_nan!, eachcol(df))

works as is. Sorry for the confusion.

tbeason · March 27, 2018, 4:06pm

FYI, this only works if every column contains Floats. Otherwise isnan will throw an error.

EDIT: To be a bit more helpful, the general idea of the map function is what I use, but I actually loop over each column and check that it is Vector{<:AbstractFloat} first before I apply the NaN to 0 (or missing, in my case) conversion.

Tamas_Papp · March 27, 2018, 4:29pm

Not quite,

julia> methods(isnan)                       
# 6 methods for generic function "isnan":   
isnan(x::BigFloat) in Base.MPFR at mpfr.jl:828                                          
isnan(x::Float16) in Base at float.jl:522   
isnan(x::AbstractFloat) in Base at float.jl:521                                         
isnan(x::Real) in Base at float.jl:523      
isnan(z::Complex) in Base at complex.jl:118 
isnan(x::AbstractArray{T,N} where N) where T<:Number in Base at deprecated.jl:56

pasha · March 27, 2018, 4:30pm

This is where a function like R’s dplyr::mutate_if() will be awesome someday in Julia. I imagine that someday one of us will build that into Query.jl or DataFramesMeta.jl. Until then, the alternatives are not bad at all.

tbeason · March 27, 2018, 4:42pm

Sorry, a subtype of Number I guess. Still, String columns would cause an error here. I suppose the anonymous function could have an additional logical layer that checks for proper element type first, and that would make this a robust option.

pdeffebach · March 27, 2018, 5:43pm

I agree that a mutate_if function is important, but with DataFramesMeta you can also use the @byrow! macro to get similar results.

oheil · March 29, 2018, 12:17pm

Thhis function is what I use for this:

function checkForNotANumber(x::Any)
    (!isa(x,Integer) && !isa(x,Real)) || isnan(x)
end

Tamas_Papp · March 29, 2018, 12:32pm

Note that Integer <: Real, so checking for the latter is sufficient.

Tamas_Papp · March 29, 2018, 12:51pm

Thinking about this discussion,

# FIXME get letter of marque for type piracy 
Base.map(f, df::AbstractDataFrame) = map(col -> map(f, col), eachcol(df))

would solve this problem 90% of the time (when I don’t want to do something different for columns). Eg

map(x -> x isa Real && isnan(x) ? zero(x) : x, df)

sylvaticus · September 2, 2019, 9:14am

For missing (replace ismissing() with isnan() for NaNs) you can use also this trick, that makes all missing in numeric columns equal to 0, and all missing in strinng columns equal to "":

[df[ismissing.(df[!,i]), i] .= 0 for i in names(df) if Base.nonmissingtype(eltype(df[!,i])) <: Number]
[df[ismissing.(df[!,i]), i] .= "" for i in names(df) if Base.nonmissingtype(eltype(df[!,i])) <: String]

nalimilan · September 5, 2019, 8:46pm

In recent versions you can now do this:

mapcols(col -> replace!(col, NaN=>0), df) # In-place

or

ifelse.(isnan.(df), 0, df)

bkamins · September 5, 2019, 8:53pm

Yes, also it will work in place: df .= ifelse.(isnan.(df), 0, df).

VivMendes · August 3, 2021, 4:00pm

Sorry for a late question related to this topic. In my case, it is not NaN but negative values. I have a time series with positive and negative entries. I need to transform the negative values into zero. I do not want to eliminate them; I need to keep them as observations, but with a value of zero. Help will be very much appreciated. Thanks.

pdeffebach · August 3, 2021, 4:05pm

This will do it

julia> df = DataFrame(a = randn(5), b = randn(5))
5×2 DataFrame
 Row │ a          b         
     │ Float64    Float64   
─────┼──────────────────────
   1 │  0.8805     0.667461
   2 │  0.17179   -0.618585
   3 │ -0.667805  -0.32467
   4 │ -0.517509  -0.321862
   5 │  1.64746   -0.344586

julia> mapcols(t -> ifelse.(t .< 0, 0, t), df)
5×2 DataFrame
 Row │ a        b        
     │ Real     Real     
─────┼───────────────────
   1 │ 0.8805   0.667461
   2 │ 0.17179  0
   3 │ 0        0
   4 │ 0        0
   5 │ 1.64746  0

VivMendes · August 3, 2021, 4:14pm

@pdeffebach Thank you very much. After two hours of stumbling by me, it took you just a minute to do it. Thanks.

bkamins · August 3, 2021, 4:54pm

or even shorter:

julia> df = DataFrame(a = randn(5), b = randn(5))
5×2 DataFrame
 Row │ a          b
     │ Float64    Float64
─────┼───────────────────────
   1 │ -0.386397   0.392352
   2 │ -0.476617  -0.0270584
   3 │ -0.218456  -0.224436
   4 │ -1.17403   -0.520317
   5 │ -1.70785    0.0390936

julia> ifelse.(df .< 0, 0, df)
5×2 DataFrame
 Row │ a      b
     │ Int64  Float64
─────┼──────────────────
   1 │     0  0.392352
   2 │     0  0.0
   3 │     0  0.0
   4 │     0  0.0
   5 │     0  0.0390936

Topic		Replies	Views
Replacing missing and NaN values in dataframe New to Julia question , dataframes , missing-values	6	4093	March 29, 2022
How array with many NaN values convert to DataFrames with Zero instead of NaN New to Julia	2	586	September 2, 2019
Replace missing values with zeros and nonmissing values with one in Dataframe New to Julia	1	364	January 20, 2021
Iterate over all numeric columns in DataFrames Data	21	4854	February 11, 2018
Dealing with NaN's General Usage dataframes	21	5541	April 27, 2021

Replace all NaN's with zeros in DataFrame

Related topics