Normalizing DataFrame column by group

I’ve been going through some of the DataFrames documentation and I feel I haven’t fully grokked the split-apply-combine strategy. Concretely, what I’d like to do is:

  1. group a DataFrame by one column
  2. Normalize the data contained in another column within each group
  3. Recombine the grouped dataframe into one of the same size as the original

Say I have df = DataFrame(a=[[1,2,3],[4,5,6],[7,8,9],[9,10,11],[10,11,12]],b=[1,1,2,2,3]) and I’d like to group by values of b, so I do dfg = groupby(df,:b). So far, so good. What I’d like to do next is: for each vector element x in column a, calculate mean m and standard deviation s over all vector elements within the group and apply the transformation x -> (x-m)/s and finally reconstitute the original dataframe.

For instance, for the group given by b=1, we get m=mean([1,2,3,4,5,6])=3.5 and similarly s = sqrt(3.5) which would transform the [1,2,3] vector into [-1.34 -0.80 -0.27] and the vector [4,5,6] into [0.27 0.80 1.34].

My solution so far is a for loop over the subdataframes of dfg, but I feel like there must be a better way using the split-apply-combine strategy?

Any advice would be much appreciated.

This is my solution. I don’t usually work with DataFrames having arrays as elements, so I bet there’s a briefer way to do this.

using DataFrames, StatsBase
df = DataFrame(a=[[1,2,3],[4,5,6],[7,8,9],[9,10,11],[10,11,12]],b=[1,1,2,2,3]) 
dfg = groupby(df,:b)

#function that collects all the elements of vectors comprising vectors
flattened_values(x) = collect(Iterators.flatten(x))

transform!(dfg, :a => mean ∘ flattened_values => :mean_a, 
                :a => std  ∘ flattened_values => :std_a)
select!(dfg, [:a,:mean_a,:std_a] => ByRow( (x,y,z) -> (x .- y) ./ z) => :normalized_a)

What you ask for is typically called standardization.

Is this what you want?

julia> df = DataFrame(a=[[1,2,3],[4,5,6],[7,8,9],[9,10,11],[10,11,12]],b=[1,1,2,2,3])
5×2 DataFrame
 Row │ a             b
     │ Array…        Int64
─────┼─────────────────────
   1 │ [1, 2, 3]         1
   2 │ [4, 5, 6]         1
   3 │ [7, 8, 9]         2
   4 │ [9, 10, 11]       2
   5 │ [10, 11, 12]      3

julia> function standardize(vx)
           m = mean(Iterators.flatten(vx))
           s = std(Iterators.flatten(vx))
           return [(v .- m) ./ s for v in vx]
       end
standardize (generic function with 1 method)

julia> combine(groupby(df, :b), :a => standardize)
5×2 DataFrame
 Row │ b      a_standardize
     │ Int64  Array…
─────┼─────────────────────────────────────────
   1 │     1  [-1.33631, -0.801784, -0.267261]
   2 │     1  [0.267261, 0.801784, 1.33631]
   3 │     2  [-1.41421, -0.707107, 0.0]
   4 │     2  [0.0, 0.707107, 1.41421]
   5 │     3  [-1.0, 0.0, 1.0]
1 Like

Thank you both for your solutions, that’s very helpful. I found the documentation for combine() somewhat confusing, but your examples helped. Much appreciated!

Could you point me to what was confusing, so that we can fix this? Thank you!

It’s possible I missed something but I wasn’t able to find a short self-contained description of how combine() actually combines the groups.

For instance, the help accessible by typing ?combine states “combine: does not put restrictions on number of rows returned, the order of rows is specified by the order of groups in GroupedDataFrame…” which doesn’t specify exactly how the groups are combined. I think adding a description something like “combine() applies a function to each group in a GroupedDataFrame and concatenates the results into a single DataFrame” would be helpful.