Standardize all columns of DataFrame

markmbaum · March 11, 2022, 10:56pm

Surely this is answered somewhere, but what is the easiest way to standardize all columns of a DataFrame? There are a bunch of methods in pandas to do it, or you can just

df = (df - df.mean())/df.std()

how to do the same with DataFrames.jl?

Henrique_Becker · March 11, 2022, 11:33pm

If you mean all of them you can do:

using Statistics, DataFrames

df = DataFrame(:a => 1:10.0, :b => rand(10))
foreach(c -> c .= (c .- mean(c)) ./ std(c), eachcol(df))

But this needs the columns to already accept floating point numbers beforehand.

tobydriscoll · March 11, 2022, 11:36pm

One difference from pandas is that the transform function will act on the entire table of values if you only give it the transforming function. So you have to select all columns and then map each to the transformer so that it takes them in turn.

using StatsBase  # may require installation, has zscore()
transform(df,zscore)  # fails
transform(df,All() .=> zscore)  # appends new columns to old
select(df,All() .=> zscore)  # only new columns

pdeffebach · March 12, 2022, 1:15am

You can also use the keyword argument renamecols = false

bkamins · March 12, 2022, 9:57am

julia> using DataFrames, StatsBase

julia> df = DataFrame(rand(10, 4), :auto)
10×4 DataFrame
 Row │ x1         x2         x3          x4
     │ Float64    Float64    Float64     Float64
─────┼─────────────────────────────────────────────
   1 │ 0.399058   0.422599   0.00699281  0.0635379
   2 │ 0.265661   0.787594   0.217078    0.675266
   3 │ 0.292278   0.0477572  0.0380437   0.0680113
   4 │ 0.301224   0.740652   0.683419    0.97809
   5 │ 0.0729834  0.927029   0.606946    0.0202961
   6 │ 0.992137   0.203939   0.425782    0.547135
   7 │ 0.321505   0.151547   0.819317    0.627988
   8 │ 0.790473   0.326784   0.24795     0.465694
   9 │ 0.284565   0.663169   0.725925    0.499605
  10 │ 0.282419   0.340454   0.147984    0.842851

julia> mapcols(zscore, df)
10×4 DataFrame
 Row │ x1           x2         x3         x4
     │ Float64      Float64    Float64    Float64
─────┼───────────────────────────────────────────────
   1 │ -0.00425657  -0.128547  -1.28255   -1.24632
   2 │ -0.488579     1.08844   -0.582604   0.589441
   3 │ -0.391941    -1.37836   -1.1791    -1.2329
   4 │ -0.359462     0.931921   0.971114   1.4982
   5 │ -1.18813      1.55335    0.716329  -1.37609
   6 │  2.14903     -0.857614   0.112739   0.204928
   7 │ -0.285827    -1.0323     1.42389    0.447564
   8 │  1.41685     -0.448018  -0.479747  -0.0394733
   9 │ -0.419944     0.673572   1.11274    0.062291
  10 │ -0.427735    -0.402439  -0.812807   1.09236

jules · March 12, 2022, 12:21pm

In DataFrameMacros.jl you can also use the quick rename feature for this, "{}_z" just appends z to every input column name. I find that simplifies things sometimes, because the auto-chosen column names can get long.

julia> df = DataFrame(:a => 1:10.0, :b => rand(10))
10×2 DataFrame
 Row │ a        b        
     │ Float64  Float64  
─────┼───────────────────
   1 │     1.0  0.279875
   2 │     2.0  0.916205
   3 │     3.0  0.50468
   4 │     4.0  0.555019
   5 │     5.0  0.126688
   6 │     6.0  0.163041
   7 │     7.0  0.394592
   8 │     8.0  0.38917
   9 │     9.0  0.812061
  10 │    10.0  0.620486

julia> @select(df, "{}_z" = @c zscore(All()))
10×2 DataFrame
 Row │ a_z        b_z       
     │ Float64    Float64   
─────┼──────────────────────
   1 │ -1.4863    -0.75574
   2 │ -1.15601    1.69399
   3 │ -0.825723   0.109712
   4 │ -0.495434   0.303506
   5 │ -0.165145  -1.34547
   6 │  0.165145  -1.20552
   7 │  0.495434  -0.314101
   8 │  0.825723  -0.334976
   9 │  1.15601    1.29306
  10 │  1.4863     0.55554

javier_mj · October 2, 2023, 4:27pm

Hello @bkamins
Is there any way to do this but with dataframes containing missings values in some (or all) columns?

bkamins · October 2, 2023, 6:30pm

Not so clean:

mapcols(col -> (col .- mean(skipmissing(col)) / std(skipmissing(col))), df)

tbeason · October 2, 2023, 7:26pm

But you could also define your own zscore function that ignored missings and then the mapcols call is clean again. Whatever floats your boat, as they say.

bkamins · October 2, 2023, 8:07pm

Yes, and I have opened zscore with missing values · Issue #898 · JuliaStats/StatsBase.jl · GitHub to discuss this. If you have an opinion please comment there.

Topic		Replies	Views
Apply transform() to all DataFrame columns of a certain type? General Usage dataframes	1	356	April 16, 2021
Change datatype for subset of DataFrame columns Data dataframes	6	559	February 6, 2022
Recommended way to select multiple DataFrame columns parametrically, i.e. not by the names New to Julia dataframes	9	2233	January 8, 2022
Sum the columns in a dataframe New to Julia	6	8555	June 24, 2021
Replacing one column with three (DataFrames.jl) Data dataframes	3	595	August 3, 2021

Standardize all columns of DataFrame

Related topics