Surely this is answered somewhere, but what is the easiest way to standardize all columns of a DataFrame
? There are a bunch of methods in pandas to do it, or you can just
df = (df - df.mean())/df.std()
how to do the same with DataFrames.jl
?
Surely this is answered somewhere, but what is the easiest way to standardize all columns of a DataFrame
? There are a bunch of methods in pandas to do it, or you can just
df = (df - df.mean())/df.std()
how to do the same with DataFrames.jl
?
If you mean all of them you can do:
using Statistics, DataFrames
df = DataFrame(:a => 1:10.0, :b => rand(10))
foreach(c -> c .= (c .- mean(c)) ./ std(c), eachcol(df))
But this needs the columns to already accept floating point numbers beforehand.
One difference from pandas is that the transform
function will act on the entire table of values if you only give it the transforming function. So you have to select all columns and then map each to the transformer so that it takes them in turn.
using StatsBase # may require installation, has zscore()
transform(df,zscore) # fails
transform(df,All() .=> zscore) # appends new columns to old
select(df,All() .=> zscore) # only new columns
You can also use the keyword argument renamecols = false
julia> using DataFrames, StatsBase
julia> df = DataFrame(rand(10, 4), :auto)
10ร4 DataFrame
Row โ x1 x2 x3 x4
โ Float64 Float64 Float64 Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 0.399058 0.422599 0.00699281 0.0635379
2 โ 0.265661 0.787594 0.217078 0.675266
3 โ 0.292278 0.0477572 0.0380437 0.0680113
4 โ 0.301224 0.740652 0.683419 0.97809
5 โ 0.0729834 0.927029 0.606946 0.0202961
6 โ 0.992137 0.203939 0.425782 0.547135
7 โ 0.321505 0.151547 0.819317 0.627988
8 โ 0.790473 0.326784 0.24795 0.465694
9 โ 0.284565 0.663169 0.725925 0.499605
10 โ 0.282419 0.340454 0.147984 0.842851
julia> mapcols(zscore, df)
10ร4 DataFrame
Row โ x1 x2 x3 x4
โ Float64 Float64 Float64 Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ -0.00425657 -0.128547 -1.28255 -1.24632
2 โ -0.488579 1.08844 -0.582604 0.589441
3 โ -0.391941 -1.37836 -1.1791 -1.2329
4 โ -0.359462 0.931921 0.971114 1.4982
5 โ -1.18813 1.55335 0.716329 -1.37609
6 โ 2.14903 -0.857614 0.112739 0.204928
7 โ -0.285827 -1.0323 1.42389 0.447564
8 โ 1.41685 -0.448018 -0.479747 -0.0394733
9 โ -0.419944 0.673572 1.11274 0.062291
10 โ -0.427735 -0.402439 -0.812807 1.09236
In DataFrameMacros.jl you can also use the quick rename feature for this, "{}_z"
just appends z
to every input column name. I find that simplifies things sometimes, because the auto-chosen column names can get long.
julia> df = DataFrame(:a => 1:10.0, :b => rand(10))
10ร2 DataFrame
Row โ a b
โ Float64 Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโ
1 โ 1.0 0.279875
2 โ 2.0 0.916205
3 โ 3.0 0.50468
4 โ 4.0 0.555019
5 โ 5.0 0.126688
6 โ 6.0 0.163041
7 โ 7.0 0.394592
8 โ 8.0 0.38917
9 โ 9.0 0.812061
10 โ 10.0 0.620486
julia> @select(df, "{}_z" = @c zscore(All()))
10ร2 DataFrame
Row โ a_z b_z
โ Float64 Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโ
1 โ -1.4863 -0.75574
2 โ -1.15601 1.69399
3 โ -0.825723 0.109712
4 โ -0.495434 0.303506
5 โ -0.165145 -1.34547
6 โ 0.165145 -1.20552
7 โ 0.495434 -0.314101
8 โ 0.825723 -0.334976
9 โ 1.15601 1.29306
10 โ 1.4863 0.55554
Hello @bkamins
Is there any way to do this but with dataframes containing missings values in some (or all) columns?
Not so clean:
mapcols(col -> (col .- mean(skipmissing(col)) / std(skipmissing(col))), df)
But you could also define your own zscore
function that ignored missings and then the mapcols
call is clean again. Whatever floats your boat, as they say.
Yes, and I have opened zscore with missing values ยท Issue #898 ยท JuliaStats/StatsBase.jl ยท GitHub to discuss this. If you have an opinion please comment there.