Replace all NaN's with zeros in DataFrame


#1

What’s the best way to replace all NaN’s in a DataFrame with zero? I can write a nested for-loop and check every cell but I thought there may be a simpler way to do that…


#2

Eg

using DataFrames
replace_nan(v) = map(x -> isnan(x) ? zero(x) : x, v)
df = DataFrame(a = [NaN, 2.0, 3.0], b = [4.0, 5.0, NaN])
df2 = map(replace_nan, eachcol(df))

#3

Note that this version allocates new columns, so you may want to use map! instead.


#5

This only maps the last row… seems unintuitive to use last here. Are you using a different version of Julia or DataFrame?

julia> df
3×2 DataFrames.DataFrame
│ Row │ a   │ b   │
├─────┼─────┼─────┤
│ 1   │ NaN │ 4.0 │
│ 2   │ 2.0 │ 5.0 │
│ 3   │ 3.0 │ NaN │

julia> map(replace_nan ∘ last, eachcol(df))
1×2 DataFrames.DataFrame
│ Row │ a   │ b   │
├─────┼─────┼─────┤
│ 1   │ 3.0 │ 0.0 │

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> Pkg.installed("DataFrames")
v"0.11.5"

#6

I missed the fact that DataFrames defined a map for DFColumnIterator. So the version

using DataFrames
df = DataFrame(a = [NaN, 2.0, 3.0], b = [4.0, 5.0, NaN])
replace_nan(v::AbstractVector) = map(x -> isnan(x) ? zero(x) : x, v)
replace_nan!(v::AbstractVector) = map!(x -> isnan(x) ? zero(x) : x, v, v)
map(replace_nan, eachcol(df))
map(replace_nan!, eachcol(df))

works as is. Sorry for the confusion.


#7

FYI, this only works if every column contains Floats. Otherwise isnan will throw an error.

EDIT: To be a bit more helpful, the general idea of the map function is what I use, but I actually loop over each column and check that it is Vector{<:AbstractFloat} first before I apply the NaN to 0 (or missing, in my case) conversion.


#8

Not quite,

julia> methods(isnan)                       
# 6 methods for generic function "isnan":   
isnan(x::BigFloat) in Base.MPFR at mpfr.jl:828                                          
isnan(x::Float16) in Base at float.jl:522   
isnan(x::AbstractFloat) in Base at float.jl:521                                         
isnan(x::Real) in Base at float.jl:523      
isnan(z::Complex) in Base at complex.jl:118 
isnan(x::AbstractArray{T,N} where N) where T<:Number in Base at deprecated.jl:56        

#9

This is where a function like R’s dplyr::mutate_if() will be awesome someday in Julia. I imagine that someday one of us will build that into Query.jl or DataFramesMeta.jl. Until then, the alternatives are not bad at all.


#10

Sorry, a subtype of Number I guess. Still, String columns would cause an error here. I suppose the anonymous function could have an additional logical layer that checks for proper element type first, and that would make this a robust option.


#11

I agree that a mutate_if function is important, but with DataFramesMeta you can also use the @byrow! macro to get similar results.


#12

Thhis function is what I use for this:

function checkForNotANumber(x::Any)
    (!isa(x,Integer) && !isa(x,Real)) || isnan(x)
end

#13

Note that Integer <: Real, so checking for the latter is sufficient.


#14

Thinking about this discussion,

# FIXME get letter of marque for type piracy 
Base.map(f, df::AbstractDataFrame) = map(col -> map(f, col), eachcol(df))

would solve this problem 90% of the time (when I don’t want to do something different for columns). Eg

map(x -> x isa Real && isnan(x) ? zero(x) : x, df)