What’s the best way to replace all NaN’s in a DataFrame with zero? I can write a nested for-loop and check every cell but I thought there may be a simpler way to do that…
Eg
using DataFrames
replace_nan(v) = map(x -> isnan(x) ? zero(x) : x, v)
df = DataFrame(a = [NaN, 2.0, 3.0], b = [4.0, 5.0, NaN])
df2 = map(replace_nan, eachcol(df))
Note that this version allocates new columns, so you may want to use map!
instead.
This only maps the last row… seems unintuitive to use last
here. Are you using a different version of Julia or DataFrame?
julia> df
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼─────┼─────┤
│ 1 │ NaN │ 4.0 │
│ 2 │ 2.0 │ 5.0 │
│ 3 │ 3.0 │ NaN │
julia> map(replace_nan ∘ last, eachcol(df))
1×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼─────┼─────┤
│ 1 │ 3.0 │ 0.0 │
julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
julia> Pkg.installed("DataFrames")
v"0.11.5"
I missed the fact that DataFrames
defined a map
for DFColumnIterator
. So the version
using DataFrames
df = DataFrame(a = [NaN, 2.0, 3.0], b = [4.0, 5.0, NaN])
replace_nan(v::AbstractVector) = map(x -> isnan(x) ? zero(x) : x, v)
replace_nan!(v::AbstractVector) = map!(x -> isnan(x) ? zero(x) : x, v, v)
map(replace_nan, eachcol(df))
map(replace_nan!, eachcol(df))
works as is. Sorry for the confusion.
FYI, this only works if every column contains Floats. Otherwise isnan
will throw an error.
EDIT: To be a bit more helpful, the general idea of the map
function is what I use, but I actually loop over each column and check that it is Vector{<:AbstractFloat}
first before I apply the NaN
to 0 (or missing
, in my case) conversion.
Not quite,
julia> methods(isnan)
# 6 methods for generic function "isnan":
isnan(x::BigFloat) in Base.MPFR at mpfr.jl:828
isnan(x::Float16) in Base at float.jl:522
isnan(x::AbstractFloat) in Base at float.jl:521
isnan(x::Real) in Base at float.jl:523
isnan(z::Complex) in Base at complex.jl:118
isnan(x::AbstractArray{T,N} where N) where T<:Number in Base at deprecated.jl:56
This is where a function like R’s dplyr::mutate_if()
will be awesome someday in Julia. I imagine that someday one of us will build that into Query.jl or DataFramesMeta.jl. Until then, the alternatives are not bad at all.
Sorry, a subtype of Number
I guess. Still, String
columns would cause an error here. I suppose the anonymous function could have an additional logical layer that checks for proper element type first, and that would make this a robust option.
I agree that a mutate_if
function is important, but with DataFramesMeta
you can also use the @byrow!
macro to get similar results.
Thhis function is what I use for this:
function checkForNotANumber(x::Any)
(!isa(x,Integer) && !isa(x,Real)) || isnan(x)
end
Note that Integer <: Real
, so checking for the latter is sufficient.
Thinking about this discussion,
# FIXME get letter of marque for type piracy
Base.map(f, df::AbstractDataFrame) = map(col -> map(f, col), eachcol(df))
would solve this problem 90% of the time (when I don’t want to do something different for columns). Eg
map(x -> x isa Real && isnan(x) ? zero(x) : x, df)
For missing
(replace ismissing()
with isnan()
for NaN
s) you can use also this trick, that makes all missing
in numeric columns equal to 0, and all missing
in strinng columns equal to ""
:
[df[ismissing.(df[!,i]), i] .= 0 for i in names(df) if Base.nonmissingtype(eltype(df[!,i])) <: Number]
[df[ismissing.(df[!,i]), i] .= "" for i in names(df) if Base.nonmissingtype(eltype(df[!,i])) <: String]
In recent versions you can now do this:
mapcols(col -> replace!(col, NaN=>0), df) # In-place
or
ifelse.(isnan.(df), 0, df)
Yes, also it will work in place: df .= ifelse.(isnan.(df), 0, df)
.
Sorry for a late question related to this topic. In my case, it is not NaN but negative values. I have a time series with positive and negative entries. I need to transform the negative values into zero. I do not want to eliminate them; I need to keep them as observations, but with a value of zero. Help will be very much appreciated. Thanks.
This will do it
julia> df = DataFrame(a = randn(5), b = randn(5))
5×2 DataFrame
Row │ a b
│ Float64 Float64
─────┼──────────────────────
1 │ 0.8805 0.667461
2 │ 0.17179 -0.618585
3 │ -0.667805 -0.32467
4 │ -0.517509 -0.321862
5 │ 1.64746 -0.344586
julia> mapcols(t -> ifelse.(t .< 0, 0, t), df)
5×2 DataFrame
Row │ a b
│ Real Real
─────┼───────────────────
1 │ 0.8805 0.667461
2 │ 0.17179 0
3 │ 0 0
4 │ 0 0
5 │ 1.64746 0
@pdeffebach Thank you very much. After two hours of stumbling by me, it took you just a minute to do it. Thanks.
or even shorter:
julia> df = DataFrame(a = randn(5), b = randn(5))
5×2 DataFrame
Row │ a b
│ Float64 Float64
─────┼───────────────────────
1 │ -0.386397 0.392352
2 │ -0.476617 -0.0270584
3 │ -0.218456 -0.224436
4 │ -1.17403 -0.520317
5 │ -1.70785 0.0390936
julia> ifelse.(df .< 0, 0, df)
5×2 DataFrame
Row │ a b
│ Int64 Float64
─────┼──────────────────
1 │ 0 0.392352
2 │ 0 0.0
3 │ 0 0.0
4 │ 0 0.0
5 │ 0 0.0390936