Normalizing DataFrame column by group

jbr · October 25, 2022, 3:14am

I’ve been going through some of the DataFrames documentation and I feel I haven’t fully grokked the split-apply-combine strategy. Concretely, what I’d like to do is:

group a DataFrame by one column
Normalize the data contained in another column within each group
Recombine the grouped dataframe into one of the same size as the original

Say I have df = DataFrame(a=[[1,2,3],[4,5,6],[7,8,9],[9,10,11],[10,11,12]],b=[1,1,2,2,3]) and I’d like to group by values of b, so I do dfg = groupby(df,:b). So far, so good. What I’d like to do next is: for each vector element x in column a, calculate mean m and standard deviation s over all vector elements within the group and apply the transformation x -> (x-m)/s and finally reconstitute the original dataframe.

For instance, for the group given by b=1, we get m=mean([1,2,3,4,5,6])=3.5 and similarly s = sqrt(3.5) which would transform the [1,2,3] vector into [-1.34 -0.80 -0.27] and the vector [4,5,6] into [0.27 0.80 1.34].

My solution so far is a for loop over the subdataframes of dfg, but I feel like there must be a better way using the split-apply-combine strategy?

Any advice would be much appreciated.

alfaromartino · October 25, 2022, 4:32am

This is my solution. I don’t usually work with DataFrames having arrays as elements, so I bet there’s a briefer way to do this.

using DataFrames, StatsBase
df = DataFrame(a=[[1,2,3],[4,5,6],[7,8,9],[9,10,11],[10,11,12]],b=[1,1,2,2,3]) 
dfg = groupby(df,:b)

#function that collects all the elements of vectors comprising vectors
flattened_values(x) = collect(Iterators.flatten(x))

transform!(dfg, :a => mean ∘ flattened_values => :mean_a, 
                :a => std  ∘ flattened_values => :std_a)
select!(dfg, [:a,:mean_a,:std_a] => ByRow( (x,y,z) -> (x .- y) ./ z) => :normalized_a)

bkamins · October 25, 2022, 8:32am

What you ask for is typically called standardization.

Is this what you want?

julia> df = DataFrame(a=[[1,2,3],[4,5,6],[7,8,9],[9,10,11],[10,11,12]],b=[1,1,2,2,3])
5×2 DataFrame
 Row │ a             b
     │ Array…        Int64
─────┼─────────────────────
   1 │ [1, 2, 3]         1
   2 │ [4, 5, 6]         1
   3 │ [7, 8, 9]         2
   4 │ [9, 10, 11]       2
   5 │ [10, 11, 12]      3

julia> function standardize(vx)
           m = mean(Iterators.flatten(vx))
           s = std(Iterators.flatten(vx))
           return [(v .- m) ./ s for v in vx]
       end
standardize (generic function with 1 method)

julia> combine(groupby(df, :b), :a => standardize)
5×2 DataFrame
 Row │ b      a_standardize
     │ Int64  Array…
─────┼─────────────────────────────────────────
   1 │     1  [-1.33631, -0.801784, -0.267261]
   2 │     1  [0.267261, 0.801784, 1.33631]
   3 │     2  [-1.41421, -0.707107, 0.0]
   4 │     2  [0.0, 0.707107, 1.41421]
   5 │     3  [-1.0, 0.0, 1.0]

jbr · October 25, 2022, 7:22pm

Thank you both for your solutions, that’s very helpful. I found the documentation for combine() somewhat confusing, but your examples helped. Much appreciated!

bkamins · October 25, 2022, 8:23pm

Could you point me to what was confusing, so that we can fix this? Thank you!

jbr · October 30, 2022, 2:47pm

It’s possible I missed something but I wasn’t able to find a short self-contained description of how combine() actually combines the groups.

For instance, the help accessible by typing ?combine states “combine: does not put restrictions on number of rows returned, the order of rows is specified by the order of groups in GroupedDataFrame…” which doesn’t specify exactly how the groups are combined. I think adding a description something like “combine() applies a function to each group in a GroupedDataFrame and concatenates the results into a single DataFrame” would be helpful.

Topic		Replies	Views
Group DataFrames by a function of a column Data package	4	1199	December 11, 2019
Data Cleaning: Split, Combine, Apply? New to Julia dataframes	9	774	October 28, 2021
Reduce DataFrame by unique value General Usage dataframes	5	1043	October 14, 2022
How to calcule the mean of values considering their tuples of another value: General Usage dataframes	6	247	November 14, 2022
Group, Mutate, Ungroup Data question , dataframes	2	1753	February 4, 2018

Normalizing DataFrame column by group

Related topics