Iterate over all numeric columns in DataFrames


#1

beginner’s questions:

  • how do I iterate over all numeric columns in a DataFrame, e.g., to calculate means or sum-cubed?

  • bonus question: can DataFrames consider NaN to be a missing value for Float32 and Float64, too, as in completecases for example?


#2

Do you know which columns you want beforehand? If so one of these methods would work:

Main> df = DataFrame(A=[1,3,4,5,6,432], B=[1,3,4,5,6,3], C=[1,3,5,564,454,654])
6Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ A   β”‚ B β”‚ C   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1   β”‚ 1 β”‚ 1   β”‚
β”‚ 2   β”‚ 3   β”‚ 3 β”‚ 3   β”‚
β”‚ 3   β”‚ 4   β”‚ 4 β”‚ 5   β”‚
β”‚ 4   β”‚ 5   β”‚ 5 β”‚ 564 β”‚
β”‚ 5   β”‚ 6   β”‚ 6 β”‚ 454 β”‚
β”‚ 6   β”‚ 432 β”‚ 3 β”‚ 654 β”‚

Main> [mean(x[2]) for x in eachcol(df[[:A, :B]])] #[2] because eachcol returns a tuple
2-element Array{Float64,1}:
 75.1667
  3.66667

Main> colwise(mean,df[[:B, :C]])
2-element Array{Float64,1}:
   3.66667
 280.167

The only accepted value is missing so you can not use NaN as a missing value. However, because you can test for a missing value you can replace NaN with missing (Assuming your column can accept missing values).

Main> using Missings

Main> df = DataFrame(C=[missing,NaN,3432.34,432.2,NaN, 43.])
6Γ—1 DataFrames.DataFrame
β”‚ Row β”‚ C       β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ missing β”‚
β”‚ 2   β”‚ NaN     β”‚
β”‚ 3   β”‚ 3432.34 β”‚
β”‚ 4   β”‚ 432.2   β”‚
β”‚ 5   β”‚ NaN     β”‚
β”‚ 6   β”‚ 43.0    β”‚

Main> df[:C] = [ isnan(x) ? missing : x for x in df[:C] ]
6-element Array{Any,1}:
     missing
     missing
 3432.34
  432.2
     missing
   43.0

Main> df
6Γ—1 DataFrames.DataFrame
β”‚ Row β”‚ C       β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ missing β”‚
β”‚ 2   β”‚ missing β”‚
β”‚ 3   β”‚ 3432.34 β”‚
β”‚ 4   β”‚ 432.2   β”‚
β”‚ 5   β”‚ missing β”‚
β”‚ 6   β”‚ 43.0    β”‚

#3

Also forgot to add the case where you do not know if a column is a number or not you can use something that expands on this basic case

Main> df = DataFrame(A=[1,3,4,5,6,432], B=[1,3,4,5,6,3], C=["a", "b", "c", "d", "e", "f"])
6Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ A   β”‚ B β”‚ C β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
β”‚ 1   β”‚ 1   β”‚ 1 β”‚ a β”‚
β”‚ 2   β”‚ 3   β”‚ 3 β”‚ b β”‚
β”‚ 3   β”‚ 4   β”‚ 4 β”‚ c β”‚
β”‚ 4   β”‚ 5   β”‚ 5 β”‚ d β”‚
β”‚ 5   β”‚ 6   β”‚ 6 β”‚ e β”‚
β”‚ 6   β”‚ 432 β”‚ 3 β”‚ f β”‚

Main> df[ :, colwise(x -> (eltype(x) <: Number),df)]
6Γ—2 DataFrames.DataFrame
β”‚ Row β”‚ A   β”‚ B β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€
β”‚ 1   β”‚ 1   β”‚ 1 β”‚
β”‚ 2   β”‚ 3   β”‚ 3 β”‚
β”‚ 3   β”‚ 4   β”‚ 4 β”‚
β”‚ 4   β”‚ 5   β”‚ 5 β”‚
β”‚ 5   β”‚ 6   β”‚ 6 β”‚
β”‚ 6   β”‚ 432 β”‚ 3 β”‚


#4
srand(0)
using DataFrames
df = DataFrame(a = 1:5, b = rand(5), c = 'a':'e')
categorical!(df, :c)
df[2,:b] = NaN # Make an observation NaN
# Intermediate Step
allowmissing!(df)
# Helpers
consideredmissing(obj::Any) = ismissing(obj)
consideredmissing(obj::Real) = ismissing(obj) | isnan(obj)
function nonmissing!(obj::AbstractDataFrame)
    for (name, col) ∈ eachcol(df)
        if !any(ismissing.(col))
            obj[name] = disallowmissing(col)
        end
    end
end
function isreal_col(obj::AbstractVector) # Assumes you Real (could be Number)
    T = eltype(obj)
    return T <: Real | T <: Union{Real, Missing}
end
mean_rmmissing(obj::AbstractVector) = mean(skipmissing(obj))
sumcubicpower_rmmissing(obj::AbstractVector) = sum(elem -> elem^3, skipmissing(obj))
# Replace NaN with Missings
colwise(col -> col[consideredmissing.(col)] = missing, df)
# Assuming your functions want to exclude missing values (or previous NaN)
map(mean_rmmissing, filter(isreal_col, df.columns))
map(sumcubicpower_rmmissing, filter(isreal_col, df.columns))

#5

I think we should improve colwise so that when a MethodError is thrown for one column, it uses missing for that column. That way you could just skip columns with missing after doing the computation. Or at least it should be an option.


#6

I would go with an option and even then I am not sure. The suggestion introduces ambiguity in a result that one would then need to test. This is because missing in a resultant vector could mean:

  • func(col) -> missing
    or
  • func(col) threw a method error

It would be impossible to tell which one unless you go back and re-evaluate all columns that resulted in missing.

As an add on, something I would like to see, especially now that missing is part of base, is aggregation functions surfacing a remove missing option.


#7

Computing the mean of a categorical variable should throw a MethodError. However, more generally, it would be nice to expand methods for AbstractArray{T} where T such that func(obj::AbstractVector{T}) where {T <: Union{S, Missing}} can work either by returning missing or MethodError (i.e., Γ  la R’s , na.rm = True).


#8

Yes, that’s the main limit. Pandas does this IIRC (using NaN). I’m not sure it would really be problematic if it was an option.

The official way to skip missing values is to use skipmissing. The dominant opinion is that it’s not a good idea to add arguments to all reduction functions given that skipmissing is universal, efficient and as easy to use as a keyword argument.

Indeed. AFAIK that’s the case for most Base functions, including sum and mean. That’s not the case for many functions in StatsBase, though: see this issue.


#9

skipmissing is nice, but I think it should have an option (not a default) to consider NaN missing, too. After all, NaN s a much faster hardware way to designate missing observations in Floats. Many ops take three times as long when they have to deal with missing as special cases.


#10

It’s really easy to create your own skipnan function. Actually you can just do Base.Iterators.filter(!isnan, x), I suspect it will be as efficient as skipmissing.


#11

There was a lengthly discussion on missing and NaN not too long ago. They really represent different concepts. I would argue that the semantics and proper representation of the data beats the slight computational efficiency it might attain. It should be better to work with the missing rather than NaN, but if what you truly want is whether the data is available for computation you could do something like

using Missings
isvalidforcomputation(obj::Any) = false
isvalidforcomputation(obj::Real) = !isnan(obj)
isvalidforcomputation.([one(Int64), one(Float64), 'a', "hi", complex(one(Float64)), NaN, missing])

then you can make a skipnv function as @nalimilan suggested for skipnan.


#12

apologies. more beginner’s question. so I want to write some functions, the julia way (yet still not too clever to remain readable), that convert every NaN to missings, where appropriate.

julia> using DataFrames ## includes Missing

julia> XFloat64= eltype([ 1.0, missing ])
Union{Float64, Missings.Missing}

julia> VXFloat64= Vector{XFloat64}; VFloat64= Vector{Float64};

julia> function nan2missing!(x::VXFloat64)::VXFloat64
           x= ifelse.( isnan.(x), missing, x )
       end#function nan2missing
nan2missing! (generic function with 1 method)

julia> function nan2missing!(x::VFloat64)::VXFloat64
            nan2missing!(convert(VXFloat64, x))
       end#function nan2missing
nan2missing! (generic function with 1 method)
## can the above two function defs with one?

julia> nan2missing!( [ 1.0, NaN, 2.0] )
3-element Array{Union{Float64, Missings.Missing},1}:
 1.0
  missing
 2.0

julia> nan2missing!( [ 1.0, NaN, missing, 3.0 ])
## epic fail about ifelse

Ultimately, I want to go here:

julia>  function nan2missing!(din::DataFrames.DataFrame)::DataFrames.DataFrame
          for col in eachcol(din)
             if (eltype(col) <: Float64)
               df[Symbol(col)]=nan2missing!(df[Symbol(col)]) 
             end#if
          end#for
          din
        end#function nan2missing

julia> df= DataFrame( n1=[1,2,3,4], n2=x1 )  ## example

julia> nan2missing!(df); df
3Γ—2 DataFrames.DataFrame
β”‚ Row β”‚ n1 β”‚ n2  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1  β”‚ 1.0 β”‚
β”‚ 2   β”‚ 2  β”‚ missing β”‚
β”‚ 3   β”‚ 3  β”‚ 3.0 β”‚

Apologiesβ€”this reflects multiple levels of lack of understanding by a beginner.


#13

The problem is that this function does not change x, it just assigns a new vector to the local variable x and returns it. In fact you cannot change the type of x from inside the function, you can only change its values.

When looping over the data frame, instead of:

df[Symbol(col)]=nan2missing!(df[Symbol(col)])

do:

v = allowmissing(df[Symbol(col)])
v[isnan.(v)] = missing
df[Symbol(col)] = v

or just (using recode from CategoricalArrays):

df[Symbol(col)] = recode(df[Symbol(col)], NaN => missing)

#14

dear nalimilanβ€”unfortunately, these were not the problems.

using DataFrames ## includes Missing

( XFloat64= eltype([ 1.0, missing ]);
  VXFloat64= Vector{XFloat64}; VFloat64= Vector{Float64}; )

function nan2missing!(x::VXFloat64)::VXFloat64
    x= allowmissing(x)  ## random attempt; makes no difference
    ifelse.( isnan.(x), missing, x )
end#function nan2missing!(VXFloat64)

function nan2missing!(x::VFloat64)::VXFloat64
    nan2missing!(convert(VXFloat64, x))
end#function nan2missing!(VFloat64)

println( "plain vector works: ", nan2missing!( [ 1.0, NaN, 2.0] ) )
println( "recode works: ", recode( [1.0, NaN, missing, 4.0], NaN => missing ))

try
    println( "union-vector works: ", nan2missing!( [ 1.0, NaN, missing, 3.0 ]) )
catch ; println("union-vector fails ") ; end


df= DataFrame( n1=[ 1,2,3,missing ], n2=[ 1.0,2.0,3.0,missing ],
               n3=[ 1.0, NaN, 2.0, missing ], n4=[ NaN, 2.0, 3.0, 4.0 ] )

function nan2missing!(din::DataFrames.DataFrame)::DataFrames.DataFrame
    for col in eachcol(din)
        if (eltype(col) <: Float64)
            df[Symbol(col)]= recode(df[Symbol(col)], NaN => missing)
            ## or v=allowmissing(df[Symbol(col)]); v[isnan.(v)= missing; df[Symbol(col)]= v
        end#if
    end#for
    din
end#function nan2missing!(Dataframes)

println(nan2missing!(df))  ## return value
println(df)  ## altered value

(any(isnan.(df[ :n3 ]))) && println("NaNs were not replaced.")

relevant output is

plain vector works: Union{Float64, Missings.Missing}[1.0, missing, 2.0]
recode works: Union{Float64, Missings.Missing}[1.0, missing, missing, 4.0]
union-vector fails 

4Γ—4 DataFrames.DataFrame
β”‚ Row β”‚ n1      β”‚ n2      β”‚ n3      β”‚ n4  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1       β”‚ 1.0     β”‚ 1.0     β”‚ NaN β”‚
β”‚ 2   β”‚ 2       β”‚ 2.0     β”‚ NaN     β”‚ 2.0 β”‚
β”‚ 3   β”‚ 3       β”‚ 3.0     β”‚ 2.0     β”‚ 3.0 β”‚
β”‚ 4   β”‚ missing β”‚ missing β”‚ missing β”‚ 4.0 β”‚

NaNs were not replaced.

#15

Try with for (col, v) in eachcol(din).


#16

sorry, the answer is not at all obvious. (


#17

For an in-place function on needs to override the values rather than allocate a new object. The assignment Γ  la df = DataFrame() does not over-wrtie df, but df[:,:] = DataFrame() does as you are replacing the values in df. If you look at my in-place function or @nalimilan suggestion:

function replace_NaN_to_missing!(df::AbstractDataFrame)
    for (name, col) ∈ eachcol(df)
        df[name] = recode(col, NaN => missing)
    end
end

The way this in-place function works is that rather than assigning df an object it overrides each of its values (i.e., each column of df with the desired values).


#18

thanks. I think the following works for me

function NaN2missing!(df::DataFrame)
           for (name, col) in eachcol(df)
               if ((eltype(col) == Float64) && (any(isnan.(col))))
                   df[Symbol(name)]= allowmissing(df[Symbol(name)])
               end#if
               df[name] = recode(col, NaN => missing)  ## also works on Union type
           end#for
           df
       end#function NaN_to_missing!

It does not work on NaN32.

The change from AbstractDataFrame to DataFrame is due to my lack of understanding where else the former would be used, or what the advantages are.

thank you everybody.


#19

AbstractDataFrame is just the β€œproper” abstract type. Its children are DataFrames and SubDataFrame. If the code should work with SubDataFrame as well, then AbstractDataFrame should be used rather than DataFrame.
allowmissing!(df::DataFrame, cols::AbstractVector{<:Union{Integer, Symbol}}) allows for in-place transformation of the column which might be desirable.
You don’t need the Symbol(name) as name is already the Symbol to query the column.
In order to generalize the code for Float64 and Float32, you could do something like

if ((eltype(col) <: AbstractFloat) && (any(isnan.(col))))
if ((eltype(col) <: Real) && (any(isnan.(col))))

#20

thx, N. this was great help. I hope the following is a well-coded function in the julia way of life, using its facilities proper. I hope the recode use:

using DataFrames;

function NaN2missing!(df::DataFrame)
    for (name, col) in eachcol(df)
        if ((eltype(col) <: AbstractFloat) && (any(isnan.(col))))
            allowmissing!(df, name)
        end#if
        recode!(df[name], NaN => missing)  ## also works on Union type
    end#for
    df
end#function NaN2missing!

Here is a sample use:

df= DataFrame( n1=[ 1,2,3,missing ], n2=[ 1.0,2.0,3.0,missing ],
               n3=[ 1.0, NaN, 2.0, missing ], n4=[ NaN, 2.0, 3.0, 4.0 ],
               n5=Vector{Float32}( [1.0,NaN,3.0,4.0] ) )

showcols(df)
println("\n\n", NaN2missing!(df), "\n\n" );
showcols(df)

producing output of

4Γ—5 DataFrames.DataFrame
β”‚ Col # β”‚ Name β”‚ Eltype                           β”‚ Missing β”‚ Values          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1     β”‚ n1   β”‚ Union{Int64, Missings.Missing}   β”‚ 1       β”‚ 1  …  missing   β”‚
β”‚ 2     β”‚ n2   β”‚ Union{Float64, Missings.Missing} β”‚ 1       β”‚ 1.0  …  missing β”‚
β”‚ 3     β”‚ n3   β”‚ Union{Float64, Missings.Missing} β”‚ 1       β”‚ 1.0  …  missing β”‚
β”‚ 4     β”‚ n4   β”‚ Float64                          β”‚ 0       β”‚ NaN  …  4.0     β”‚
β”‚ 5     β”‚ n5   β”‚ Float32                          β”‚ 0       β”‚ 1.0  …  4.0     β”‚
4Γ—5 DataFrames.DataFrame
β”‚ Row β”‚ n1      β”‚ n2      β”‚ n3      β”‚ n4      β”‚ n5      β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1       β”‚ 1.0     β”‚ 1.0     β”‚ missing β”‚ 1.0     β”‚
β”‚ 2   β”‚ 2       β”‚ 2.0     β”‚ missing β”‚ 2.0     β”‚ missing β”‚
β”‚ 3   β”‚ 3       β”‚ 3.0     β”‚ 2.0     β”‚ 3.0     β”‚ 3.0     β”‚
β”‚ 4   β”‚ missing β”‚ missing β”‚ missing β”‚ 4.0     β”‚ 4.0     β”‚


4Γ—5 DataFrames.DataFrame
β”‚ Col # β”‚ Name β”‚ Eltype                           β”‚ Missing β”‚ Values          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1     β”‚ n1   β”‚ Union{Int64, Missings.Missing}   β”‚ 1       β”‚ 1  …  missing   β”‚
β”‚ 2     β”‚ n2   β”‚ Union{Float64, Missings.Missing} β”‚ 1       β”‚ 1.0  …  missing β”‚
β”‚ 3     β”‚ n3   β”‚ Union{Float64, Missings.Missing} β”‚ 2       β”‚ 1.0  …  missing β”‚
β”‚ 4     β”‚ n4   β”‚ Union{Float64, Missings.Missing} β”‚ 1       β”‚ missing  …  4.0 β”‚
β”‚ 5     β”‚ n5   β”‚ Union{Float32, Missings.Missing} β”‚ 1       β”‚ 1.0  …  4.0     β”‚