# Iterate over all numeric columns in DataFrames

beginnerβs questions:

• how do I iterate over all numeric columns in a DataFrame, e.g., to calculate means or sum-cubed?

• bonus question: can DataFrames consider NaN to be a missing value for Float32 and Float64, too, as in completecases for example?

Do you know which columns you want beforehand? If so one of these methods would work:

``````Main> df = DataFrame(A=[1,3,4,5,6,432], B=[1,3,4,5,6,3], C=[1,3,5,564,454,654])
6Γ3 DataFrames.DataFrame
β Row β A   β B β C   β
βββββββΌββββββΌββββΌββββββ€
β 1   β 1   β 1 β 1   β
β 2   β 3   β 3 β 3   β
β 3   β 4   β 4 β 5   β
β 4   β 5   β 5 β 564 β
β 5   β 6   β 6 β 454 β
β 6   β 432 β 3 β 654 β

Main> [mean(x[2]) for x in eachcol(df[[:A, :B]])] #[2] because eachcol returns a tuple
2-element Array{Float64,1}:
75.1667
3.66667

Main> colwise(mean,df[[:B, :C]])
2-element Array{Float64,1}:
3.66667
280.167

``````

The only accepted value is `missing` so you can not use NaN as a missing value. However, because you can test for a missing value you can replace NaN with missing (Assuming your column can accept missing values).

``````Main> using Missings

Main> df = DataFrame(C=[missing,NaN,3432.34,432.2,NaN, 43.])
6Γ1 DataFrames.DataFrame
β Row β C       β
βββββββΌββββββββββ€
β 1   β missing β
β 2   β NaN     β
β 3   β 3432.34 β
β 4   β 432.2   β
β 5   β NaN     β
β 6   β 43.0    β

Main> df[:C] = [ isnan(x) ? missing : x for x in df[:C] ]
6-element Array{Any,1}:
missing
missing
3432.34
432.2
missing
43.0

Main> df
6Γ1 DataFrames.DataFrame
β Row β C       β
βββββββΌββββββββββ€
β 1   β missing β
β 2   β missing β
β 3   β 3432.34 β
β 4   β 432.2   β
β 5   β missing β
β 6   β 43.0    β
``````
4 Likes

Also forgot to add the case where you do not know if a column is a number or not you can use something that expands on this basic case

``````Main> df = DataFrame(A=[1,3,4,5,6,432], B=[1,3,4,5,6,3], C=["a", "b", "c", "d", "e", "f"])
6Γ3 DataFrames.DataFrame
β Row β A   β B β C β
βββββββΌββββββΌββββΌββββ€
β 1   β 1   β 1 β a β
β 2   β 3   β 3 β b β
β 3   β 4   β 4 β c β
β 4   β 5   β 5 β d β
β 5   β 6   β 6 β e β
β 6   β 432 β 3 β f β

Main> df[ :, colwise(x -> (eltype(x) <: Number),df)]
6Γ2 DataFrames.DataFrame
β Row β A   β B β
βββββββΌββββββΌββββ€
β 1   β 1   β 1 β
β 2   β 3   β 3 β
β 3   β 4   β 4 β
β 4   β 5   β 5 β
β 5   β 6   β 6 β
β 6   β 432 β 3 β

``````
2 Likes
``````srand(0)
using DataFrames
df = DataFrame(a = 1:5, b = rand(5), c = 'a':'e')
categorical!(df, :c)
df[2,:b] = NaN # Make an observation NaN
# Intermediate Step
allowmissing!(df)
# Helpers
consideredmissing(obj::Any) = ismissing(obj)
consideredmissing(obj::Real) = ismissing(obj) | isnan(obj)
function nonmissing!(obj::AbstractDataFrame)
for (name, col) β eachcol(df)
if !any(ismissing.(col))
obj[name] = disallowmissing(col)
end
end
end
function isreal_col(obj::AbstractVector) # Assumes you Real (could be Number)
T = eltype(obj)
return T <: Real | T <: Union{Real, Missing}
end
mean_rmmissing(obj::AbstractVector) = mean(skipmissing(obj))
sumcubicpower_rmmissing(obj::AbstractVector) = sum(elem -> elem^3, skipmissing(obj))
# Replace NaN with Missings
colwise(col -> col[consideredmissing.(col)] = missing, df)
# Assuming your functions want to exclude missing values (or previous NaN)
map(mean_rmmissing, filter(isreal_col, df.columns))
map(sumcubicpower_rmmissing, filter(isreal_col, df.columns))
``````
2 Likes

I think we should improve `colwise` so that when a `MethodError` is thrown for one column, it uses `missing` for that column. That way you could just skip columns with `missing` after doing the computation. Or at least it should be an option.

2 Likes

I would go with an option and even then I am not sure. The suggestion introduces ambiguity in a result that one would then need to test. This is because missing in a resultant vector could mean:

• func(col) β missing
or
• func(col) threw a method error

It would be impossible to tell which one unless you go back and re-evaluate all columns that resulted in missing.

As an add on, something I would like to see, especially now that missing is part of base, is aggregation functions surfacing a remove missing option.

Computing the `mean` of a categorical variable should throw a `MethodError`. However, more generally, it would be nice to expand methods for `AbstractArray{T} where T` such that `func(obj::AbstractVector{T}) where {T <: Union{S, Missing}}` can work either by returning `missing` or `MethodError` (i.e., Γ  la Rβs `, na.rm = True)`.

Yes, thatβs the main limit. Pandas does this IIRC (using `NaN`). Iβm not sure it would really be problematic if it was an option.

The official way to skip missing values is to use `skipmissing`. The dominant opinion is that itβs not a good idea to add arguments to all reduction functions given that `skipmissing` is universal, efficient and as easy to use as a keyword argument.

Indeed. AFAIK thatβs the case for most Base functions, including `sum` and `mean`. Thatβs not the case for many functions in StatsBase, though: see this issue.

1 Like

skipmissing is nice, but I think it should have an option (not a default) to consider NaN missing, too. After all, NaN s a much faster hardware way to designate missing observations in Floats. Many ops take three times as long when they have to deal with missing as special cases.

Itβs really easy to create your own `skipnan` function. Actually you can just do `Base.Iterators.filter(!isnan, x)`, I suspect it will be as efficient as `skipmissing`.

There was a lengthly discussion on `missing` and `NaN` not too long ago. They really represent different concepts. I would argue that the semantics and proper representation of the data beats the slight computational efficiency it might attain. It should be better to work with the `missing` rather than `NaN`, but if what you truly want is whether the data is available for computation you could do something like

``````using Missings
isvalidforcomputation(obj::Any) = false
isvalidforcomputation(obj::Real) = !isnan(obj)
isvalidforcomputation.([one(Int64), one(Float64), 'a', "hi", complex(one(Float64)), NaN, missing])
``````

then you can make a `skipnv` function as @nalimilan suggested for `skipnan`.

1 Like

apologies. more beginnerβs question. so I want to write some functions, the julia way (yet still not too clever to remain readable), that convert every NaN to missings, where appropriate.

``````julia> using DataFrames ## includes Missing

julia> XFloat64= eltype([ 1.0, missing ])
Union{Float64, Missings.Missing}

julia> VXFloat64= Vector{XFloat64}; VFloat64= Vector{Float64};

julia> function nan2missing!(x::VXFloat64)::VXFloat64
x= ifelse.( isnan.(x), missing, x )
end#function nan2missing
nan2missing! (generic function with 1 method)

julia> function nan2missing!(x::VFloat64)::VXFloat64
nan2missing!(convert(VXFloat64, x))
end#function nan2missing
nan2missing! (generic function with 1 method)
## can the above two function defs with one?

julia> nan2missing!( [ 1.0, NaN, 2.0] )
3-element Array{Union{Float64, Missings.Missing},1}:
1.0
missing
2.0

julia> nan2missing!( [ 1.0, NaN, missing, 3.0 ])
``````

Ultimately, I want to go here:

``````julia>  function nan2missing!(din::DataFrames.DataFrame)::DataFrames.DataFrame
for col in eachcol(din)
if (eltype(col) <: Float64)
df[Symbol(col)]=nan2missing!(df[Symbol(col)])
end#if
end#for
din
end#function nan2missing

julia> df= DataFrame( n1=[1,2,3,4], n2=x1 )  ## example

julia> nan2missing!(df); df
3Γ2 DataFrames.DataFrame
β Row β n1 β n2  β
βββββββΌβββββΌββββββ€
β 1   β 1  β 1.0 β
β 2   β 2  β missing β
β 3   β 3  β 3.0 β

``````

Apologiesβthis reflects multiple levels of lack of understanding by a beginner.

The problem is that this function does not change `x`, it just assigns a new vector to the local variable `x` and returns it. In fact you cannot change the type of `x` from inside the function, you can only change its values.

When looping over the data frame, instead of:

``````df[Symbol(col)]=nan2missing!(df[Symbol(col)])
``````

do:

``````v = allowmissing(df[Symbol(col)])
v[isnan.(v)] = missing
df[Symbol(col)] = v
``````

or just (using `recode` from CategoricalArrays):

``````df[Symbol(col)] = recode(df[Symbol(col)], NaN => missing)
``````

dear nalimilanβunfortunately, these were not the problems.

``````using DataFrames ## includes Missing

( XFloat64= eltype([ 1.0, missing ]);
VXFloat64= Vector{XFloat64}; VFloat64= Vector{Float64}; )

function nan2missing!(x::VXFloat64)::VXFloat64
x= allowmissing(x)  ## random attempt; makes no difference
ifelse.( isnan.(x), missing, x )
end#function nan2missing!(VXFloat64)

function nan2missing!(x::VFloat64)::VXFloat64
nan2missing!(convert(VXFloat64, x))
end#function nan2missing!(VFloat64)

println( "plain vector works: ", nan2missing!( [ 1.0, NaN, 2.0] ) )
println( "recode works: ", recode( [1.0, NaN, missing, 4.0], NaN => missing ))

try
println( "union-vector works: ", nan2missing!( [ 1.0, NaN, missing, 3.0 ]) )
catch ; println("union-vector fails ") ; end

df= DataFrame( n1=[ 1,2,3,missing ], n2=[ 1.0,2.0,3.0,missing ],
n3=[ 1.0, NaN, 2.0, missing ], n4=[ NaN, 2.0, 3.0, 4.0 ] )

function nan2missing!(din::DataFrames.DataFrame)::DataFrames.DataFrame
for col in eachcol(din)
if (eltype(col) <: Float64)
df[Symbol(col)]= recode(df[Symbol(col)], NaN => missing)
## or v=allowmissing(df[Symbol(col)]); v[isnan.(v)= missing; df[Symbol(col)]= v
end#if
end#for
din
end#function nan2missing!(Dataframes)

println(nan2missing!(df))  ## return value
println(df)  ## altered value

(any(isnan.(df[ :n3 ]))) && println("NaNs were not replaced.")
``````

relevant output is

``````plain vector works: Union{Float64, Missings.Missing}[1.0, missing, 2.0]
recode works: Union{Float64, Missings.Missing}[1.0, missing, missing, 4.0]
union-vector fails

4Γ4 DataFrames.DataFrame
β Row β n1      β n2      β n3      β n4  β
βββββββΌββββββββββΌββββββββββΌββββββββββΌββββββ€
β 1   β 1       β 1.0     β 1.0     β NaN β
β 2   β 2       β 2.0     β NaN     β 2.0 β
β 3   β 3       β 3.0     β 2.0     β 3.0 β
β 4   β missing β missing β missing β 4.0 β

NaNs were not replaced.
``````

Try with `for (col, v) in eachcol(din)`.

sorry, the answer is not at all obvious. (

For an in-place function on needs to override the values rather than allocate a new object. The assignment Γ  la `df = DataFrame()` does not over-wrtie `df`, but `df[:,:] = DataFrame()` does as you are replacing the values in `df`. If you look at my in-place function or @nalimilan suggestion:

``````function replace_NaN_to_missing!(df::AbstractDataFrame)
for (name, col) β eachcol(df)
df[name] = recode(col, NaN => missing)
end
end
``````

The way this in-place function works is that rather than assigning `df` an object it overrides each of its values (i.e., each column of df with the desired values).

1 Like

thanks. I think the following works for me

``````function NaN2missing!(df::DataFrame)
for (name, col) in eachcol(df)
if ((eltype(col) == Float64) && (any(isnan.(col))))
df[Symbol(name)]= allowmissing(df[Symbol(name)])
end#if
df[name] = recode(col, NaN => missing)  ## also works on Union type
end#for
df
end#function NaN_to_missing!
``````

It does not work on NaN32.

The change from AbstractDataFrame to DataFrame is due to my lack of understanding where else the former would be used, or what the advantages are.

thank you everybody.

`AbstractDataFrame` is just the βproperβ abstract type. Its children are `DataFrames` and `SubDataFrame`. If the code should work with `SubDataFrame` as well, then `AbstractDataFrame` should be used rather than `DataFrame`.
`allowmissing!(df::DataFrame, cols::AbstractVector{<:Union{Integer, Symbol}})` allows for in-place transformation of the column which might be desirable.
You donβt need the `Symbol(name)` as `name` is already the `Symbol` to query the column.
In order to generalize the code for `Float64` and `Float32`, you could do something like

``````if ((eltype(col) <: AbstractFloat) && (any(isnan.(col))))
if ((eltype(col) <: Real) && (any(isnan.(col))))
``````

thx, N. this was great help. I hope the following is a well-coded function in the julia way of life, using its facilities proper. I hope the recode use:

``````using DataFrames;

function NaN2missing!(df::DataFrame)
for (name, col) in eachcol(df)
if ((eltype(col) <: AbstractFloat) && (any(isnan.(col))))
allowmissing!(df, name)
end#if
recode!(df[name], NaN => missing)  ## also works on Union type
end#for
df
end#function NaN2missing!
``````

Here is a sample use:

``````df= DataFrame( n1=[ 1,2,3,missing ], n2=[ 1.0,2.0,3.0,missing ],
n3=[ 1.0, NaN, 2.0, missing ], n4=[ NaN, 2.0, 3.0, 4.0 ],
n5=Vector{Float32}( [1.0,NaN,3.0,4.0] ) )

showcols(df)
println("\n\n", NaN2missing!(df), "\n\n" );
showcols(df)
``````

producing output of

``````4Γ5 DataFrames.DataFrame
β Col # β Name β Eltype                           β Missing β Values          β
βββββββββΌβββββββΌβββββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββ€
β 1     β n1   β Union{Int64, Missings.Missing}   β 1       β 1  β¦  missing   β
β 2     β n2   β Union{Float64, Missings.Missing} β 1       β 1.0  β¦  missing β
β 3     β n3   β Union{Float64, Missings.Missing} β 1       β 1.0  β¦  missing β
β 4     β n4   β Float64                          β 0       β NaN  β¦  4.0     β
β 5     β n5   β Float32                          β 0       β 1.0  β¦  4.0     β
4Γ5 DataFrames.DataFrame
β Row β n1      β n2      β n3      β n4      β n5      β
βββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β 1   β 1       β 1.0     β 1.0     β missing β 1.0     β
β 2   β 2       β 2.0     β missing β 2.0     β missing β
β 3   β 3       β 3.0     β 2.0     β 3.0     β 3.0     β
β 4   β missing β missing β missing β 4.0     β 4.0     β

4Γ5 DataFrames.DataFrame
β Col # β Name β Eltype                           β Missing β Values          β
βββββββββΌβββββββΌβββββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββββββββββ€
β 1     β n1   β Union{Int64, Missings.Missing}   β 1       β 1  β¦  missing   β
β 2     β n2   β Union{Float64, Missings.Missing} β 1       β 1.0  β¦  missing β
β 3     β n3   β Union{Float64, Missings.Missing} β 2       β 1.0  β¦  missing β
β 4     β n4   β Union{Float64, Missings.Missing} β 1       β missing  β¦  4.0 β
β 5     β n5   β Union{Float32, Missings.Missing} β 1       β 1.0  β¦  4.0     β
``````
1 Like