I’ve been going through some of the DataFrames documentation and I feel I haven’t fully grokked the split-apply-combine strategy. Concretely, what I’d like to do is:
- group a DataFrame by one column
- Normalize the data contained in another column within each group
- Recombine the grouped dataframe into one of the same size as the original
Say I have df = DataFrame(a=[[1,2,3],[4,5,6],[7,8,9],[9,10,11],[10,11,12]],b=[1,1,2,2,3])
and I’d like to group by values of b, so I do dfg = groupby(df,:b)
. So far, so good. What I’d like to do next is: for each vector element x in column a, calculate mean m and standard deviation s over all vector elements within the group and apply the transformation x -> (x-m)/s
and finally reconstitute the original dataframe.
For instance, for the group given by b=1, we get m=mean([1,2,3,4,5,6])=3.5
and similarly s = sqrt(3.5)
which would transform the [1,2,3]
vector into [-1.34 -0.80 -0.27]
and the vector [4,5,6]
into [0.27 0.80 1.34]
.
My solution so far is a for loop over the subdataframes of dfg
, but I feel like there must be a better way using the split-apply-combine strategy?
Any advice would be much appreciated.