Standardize all columns of DataFrame

Surely this is answered somewhere, but what is the easiest way to standardize all columns of a DataFrame? There are a bunch of methods in pandas to do it, or you can just

df = (df - df.mean())/df.std()

how to do the same with DataFrames.jl?

If you mean all of them you can do:

using Statistics, DataFrames

df = DataFrame(:a => 1:10.0, :b => rand(10))
foreach(c -> c .= (c .- mean(c)) ./ std(c), eachcol(df))

But this needs the columns to already accept floating point numbers beforehand.

1 Like

One difference from pandas is that the transform function will act on the entire table of values if you only give it the transforming function. So you have to select all columns and then map each to the transformer so that it takes them in turn.

using StatsBase  # may require installation, has zscore()
transform(df,zscore)  # fails
transform(df,All() .=> zscore)  # appends new columns to old
select(df,All() .=> zscore)  # only new columns
3 Likes

You can also use the keyword argument renamecols = false

julia> using DataFrames, StatsBase

julia> df = DataFrame(rand(10, 4), :auto)
10Ɨ4 DataFrame
 Row ā”‚ x1         x2         x3          x4
     ā”‚ Float64    Float64    Float64     Float64
ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
   1 ā”‚ 0.399058   0.422599   0.00699281  0.0635379
   2 ā”‚ 0.265661   0.787594   0.217078    0.675266
   3 ā”‚ 0.292278   0.0477572  0.0380437   0.0680113
   4 ā”‚ 0.301224   0.740652   0.683419    0.97809
   5 ā”‚ 0.0729834  0.927029   0.606946    0.0202961
   6 ā”‚ 0.992137   0.203939   0.425782    0.547135
   7 ā”‚ 0.321505   0.151547   0.819317    0.627988
   8 ā”‚ 0.790473   0.326784   0.24795     0.465694
   9 ā”‚ 0.284565   0.663169   0.725925    0.499605
  10 ā”‚ 0.282419   0.340454   0.147984    0.842851

julia> mapcols(zscore, df)
10Ɨ4 DataFrame
 Row ā”‚ x1           x2         x3         x4
     ā”‚ Float64      Float64    Float64    Float64
ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
   1 ā”‚ -0.00425657  -0.128547  -1.28255   -1.24632
   2 ā”‚ -0.488579     1.08844   -0.582604   0.589441
   3 ā”‚ -0.391941    -1.37836   -1.1791    -1.2329
   4 ā”‚ -0.359462     0.931921   0.971114   1.4982
   5 ā”‚ -1.18813      1.55335    0.716329  -1.37609
   6 ā”‚  2.14903     -0.857614   0.112739   0.204928
   7 ā”‚ -0.285827    -1.0323     1.42389    0.447564
   8 ā”‚  1.41685     -0.448018  -0.479747  -0.0394733
   9 ā”‚ -0.419944     0.673572   1.11274    0.062291
  10 ā”‚ -0.427735    -0.402439  -0.812807   1.09236
2 Likes

In DataFrameMacros.jl you can also use the quick rename feature for this, "{}_z" just appends z to every input column name. I find that simplifies things sometimes, because the auto-chosen column names can get long.

julia> df = DataFrame(:a => 1:10.0, :b => rand(10))
10Ɨ2 DataFrame
 Row ā”‚ a        b        
     ā”‚ Float64  Float64  
ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
   1 ā”‚     1.0  0.279875
   2 ā”‚     2.0  0.916205
   3 ā”‚     3.0  0.50468
   4 ā”‚     4.0  0.555019
   5 ā”‚     5.0  0.126688
   6 ā”‚     6.0  0.163041
   7 ā”‚     7.0  0.394592
   8 ā”‚     8.0  0.38917
   9 ā”‚     9.0  0.812061
  10 ā”‚    10.0  0.620486

julia> @select(df, "{}_z" = @c zscore(All()))
10Ɨ2 DataFrame
 Row ā”‚ a_z        b_z       
     ā”‚ Float64    Float64   
ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
   1 ā”‚ -1.4863    -0.75574
   2 ā”‚ -1.15601    1.69399
   3 ā”‚ -0.825723   0.109712
   4 ā”‚ -0.495434   0.303506
   5 ā”‚ -0.165145  -1.34547
   6 ā”‚  0.165145  -1.20552
   7 ā”‚  0.495434  -0.314101
   8 ā”‚  0.825723  -0.334976
   9 ā”‚  1.15601    1.29306
  10 ā”‚  1.4863     0.55554
1 Like