Standardize all columns of DataFrame

Surely this is answered somewhere, but what is the easiest way to standardize all columns of a DataFrame? There are a bunch of methods in pandas to do it, or you can just

df = (df - df.mean())/df.std()

how to do the same with DataFrames.jl?

1 Like

If you mean all of them you can do:

using Statistics, DataFrames

df = DataFrame(:a => 1:10.0, :b => rand(10))
foreach(c -> c .= (c .- mean(c)) ./ std(c), eachcol(df))

But this needs the columns to already accept floating point numbers beforehand.

1 Like

One difference from pandas is that the transform function will act on the entire table of values if you only give it the transforming function. So you have to select all columns and then map each to the transformer so that it takes them in turn.

using StatsBase  # may require installation, has zscore()
transform(df,zscore)  # fails
transform(df,All() .=> zscore)  # appends new columns to old
select(df,All() .=> zscore)  # only new columns
3 Likes

You can also use the keyword argument renamecols = false

julia> using DataFrames, StatsBase

julia> df = DataFrame(rand(10, 4), :auto)
10ร—4 DataFrame
 Row โ”‚ x1         x2         x3          x4
     โ”‚ Float64    Float64    Float64     Float64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ 0.399058   0.422599   0.00699281  0.0635379
   2 โ”‚ 0.265661   0.787594   0.217078    0.675266
   3 โ”‚ 0.292278   0.0477572  0.0380437   0.0680113
   4 โ”‚ 0.301224   0.740652   0.683419    0.97809
   5 โ”‚ 0.0729834  0.927029   0.606946    0.0202961
   6 โ”‚ 0.992137   0.203939   0.425782    0.547135
   7 โ”‚ 0.321505   0.151547   0.819317    0.627988
   8 โ”‚ 0.790473   0.326784   0.24795     0.465694
   9 โ”‚ 0.284565   0.663169   0.725925    0.499605
  10 โ”‚ 0.282419   0.340454   0.147984    0.842851

julia> mapcols(zscore, df)
10ร—4 DataFrame
 Row โ”‚ x1           x2         x3         x4
     โ”‚ Float64      Float64    Float64    Float64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ -0.00425657  -0.128547  -1.28255   -1.24632
   2 โ”‚ -0.488579     1.08844   -0.582604   0.589441
   3 โ”‚ -0.391941    -1.37836   -1.1791    -1.2329
   4 โ”‚ -0.359462     0.931921   0.971114   1.4982
   5 โ”‚ -1.18813      1.55335    0.716329  -1.37609
   6 โ”‚  2.14903     -0.857614   0.112739   0.204928
   7 โ”‚ -0.285827    -1.0323     1.42389    0.447564
   8 โ”‚  1.41685     -0.448018  -0.479747  -0.0394733
   9 โ”‚ -0.419944     0.673572   1.11274    0.062291
  10 โ”‚ -0.427735    -0.402439  -0.812807   1.09236
5 Likes

In DataFrameMacros.jl you can also use the quick rename feature for this, "{}_z" just appends z to every input column name. I find that simplifies things sometimes, because the auto-chosen column names can get long.

julia> df = DataFrame(:a => 1:10.0, :b => rand(10))
10ร—2 DataFrame
 Row โ”‚ a        b        
     โ”‚ Float64  Float64  
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1.0  0.279875
   2 โ”‚     2.0  0.916205
   3 โ”‚     3.0  0.50468
   4 โ”‚     4.0  0.555019
   5 โ”‚     5.0  0.126688
   6 โ”‚     6.0  0.163041
   7 โ”‚     7.0  0.394592
   8 โ”‚     8.0  0.38917
   9 โ”‚     9.0  0.812061
  10 โ”‚    10.0  0.620486

julia> @select(df, "{}_z" = @c zscore(All()))
10ร—2 DataFrame
 Row โ”‚ a_z        b_z       
     โ”‚ Float64    Float64   
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ -1.4863    -0.75574
   2 โ”‚ -1.15601    1.69399
   3 โ”‚ -0.825723   0.109712
   4 โ”‚ -0.495434   0.303506
   5 โ”‚ -0.165145  -1.34547
   6 โ”‚  0.165145  -1.20552
   7 โ”‚  0.495434  -0.314101
   8 โ”‚  0.825723  -0.334976
   9 โ”‚  1.15601    1.29306
  10 โ”‚  1.4863     0.55554
1 Like

Hello @bkamins
Is there any way to do this but with dataframes containing missings values in some (or all) columns?

Not so clean:

mapcols(col -> (col .- mean(skipmissing(col)) / std(skipmissing(col))), df)
2 Likes

But you could also define your own zscore function that ignored missings and then the mapcols call is clean again. Whatever floats your boat, as they say.

1 Like

Yes, and I have opened zscore with missing values ยท Issue #898 ยท JuliaStats/StatsBase.jl ยท GitHub to discuss this. If you have an opinion please comment there.

2 Likes