Replace all NaN's with zeros in DataFrame

What’s the best way to replace all NaN’s in a DataFrame with zero? I can write a nested for-loop and check every cell but I thought there may be a simpler way to do that…

1 Like

Eg

using DataFrames
replace_nan(v) = map(x -> isnan(x) ? zero(x) : x, v)
df = DataFrame(a = [NaN, 2.0, 3.0], b = [4.0, 5.0, NaN])
df2 = map(replace_nan, eachcol(df))
1 Like

Note that this version allocates new columns, so you may want to use map! instead.

This only maps the last row… seems unintuitive to use last here. Are you using a different version of Julia or DataFrame?

julia> df
3×2 DataFrames.DataFrame
│ Row │ a   │ b   │
├─────┼─────┼─────┤
│ 1   │ NaN │ 4.0 │
│ 2   │ 2.0 │ 5.0 │
│ 3   │ 3.0 │ NaN │

julia> map(replace_nan ∘ last, eachcol(df))
1×2 DataFrames.DataFrame
│ Row │ a   │ b   │
├─────┼─────┼─────┤
│ 1   │ 3.0 │ 0.0 │

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> Pkg.installed("DataFrames")
v"0.11.5"

I missed the fact that DataFrames defined a map for DFColumnIterator. So the version

using DataFrames
df = DataFrame(a = [NaN, 2.0, 3.0], b = [4.0, 5.0, NaN])
replace_nan(v::AbstractVector) = map(x -> isnan(x) ? zero(x) : x, v)
replace_nan!(v::AbstractVector) = map!(x -> isnan(x) ? zero(x) : x, v, v)
map(replace_nan, eachcol(df))
map(replace_nan!, eachcol(df))

works as is. Sorry for the confusion.

FYI, this only works if every column contains Floats. Otherwise isnan will throw an error.

EDIT: To be a bit more helpful, the general idea of the map function is what I use, but I actually loop over each column and check that it is Vector{<:AbstractFloat} first before I apply the NaN to 0 (or missing, in my case) conversion.

Not quite,

julia> methods(isnan)                       
# 6 methods for generic function "isnan":   
isnan(x::BigFloat) in Base.MPFR at mpfr.jl:828                                          
isnan(x::Float16) in Base at float.jl:522   
isnan(x::AbstractFloat) in Base at float.jl:521                                         
isnan(x::Real) in Base at float.jl:523      
isnan(z::Complex) in Base at complex.jl:118 
isnan(x::AbstractArray{T,N} where N) where T<:Number in Base at deprecated.jl:56        

This is where a function like R’s dplyr::mutate_if() will be awesome someday in Julia. I imagine that someday one of us will build that into Query.jl or DataFramesMeta.jl. Until then, the alternatives are not bad at all.

Sorry, a subtype of Number I guess. Still, String columns would cause an error here. I suppose the anonymous function could have an additional logical layer that checks for proper element type first, and that would make this a robust option.

I agree that a mutate_if function is important, but with DataFramesMeta you can also use the @byrow! macro to get similar results.

Thhis function is what I use for this:

function checkForNotANumber(x::Any)
    (!isa(x,Integer) && !isa(x,Real)) || isnan(x)
end

Note that Integer <: Real, so checking for the latter is sufficient.

Thinking about this discussion,

# FIXME get letter of marque for type piracy 
Base.map(f, df::AbstractDataFrame) = map(col -> map(f, col), eachcol(df))

would solve this problem 90% of the time (when I don’t want to do something different for columns). Eg

map(x -> x isa Real && isnan(x) ? zero(x) : x, df)

For missing (replace ismissing() with isnan() for NaNs) you can use also this trick, that makes all missing in numeric columns equal to 0, and all missing in strinng columns equal to "":

[df[ismissing.(df[!,i]), i] .= 0 for i in names(df) if Base.nonmissingtype(eltype(df[!,i])) <: Number]
[df[ismissing.(df[!,i]), i] .= "" for i in names(df) if Base.nonmissingtype(eltype(df[!,i])) <: String]

In recent versions you can now do this:

mapcols(col -> replace!(col, NaN=>0), df) # In-place

or

ifelse.(isnan.(df), 0, df)
2 Likes

Yes, also it will work in place: df .= ifelse.(isnan.(df), 0, df).

4 Likes

Sorry for a late question related to this topic. In my case, it is not NaN but negative values. I have a time series with positive and negative entries. I need to transform the negative values into zero. I do not want to eliminate them; I need to keep them as observations, but with a value of zero. Help will be very much appreciated. Thanks.

This will do it

julia> df = DataFrame(a = randn(5), b = randn(5))
5×2 DataFrame
 Row │ a          b         
     │ Float64    Float64   
─────┼──────────────────────
   1 │  0.8805     0.667461
   2 │  0.17179   -0.618585
   3 │ -0.667805  -0.32467
   4 │ -0.517509  -0.321862
   5 │  1.64746   -0.344586

julia> mapcols(t -> ifelse.(t .< 0, 0, t), df)
5×2 DataFrame
 Row │ a        b        
     │ Real     Real     
─────┼───────────────────
   1 │ 0.8805   0.667461
   2 │ 0.17179  0
   3 │ 0        0
   4 │ 0        0
   5 │ 1.64746  0
2 Likes

@pdeffebach Thank you very much. After two hours of stumbling by me, it took you just a minute to do it. Thanks.

or even shorter:

julia> df = DataFrame(a = randn(5), b = randn(5))
5×2 DataFrame
 Row │ a          b
     │ Float64    Float64
─────┼───────────────────────
   1 │ -0.386397   0.392352
   2 │ -0.476617  -0.0270584
   3 │ -0.218456  -0.224436
   4 │ -1.17403   -0.520317
   5 │ -1.70785    0.0390936

julia> ifelse.(df .< 0, 0, df)
5×2 DataFrame
 Row │ a      b
     │ Int64  Float64
─────┼──────────────────
   1 │     0  0.392352
   2 │     0  0.0
   3 │     0  0.0
   4 │     0  0.0
   5 │     0  0.0390936
2 Likes