I’ve been going through some of the DataFrames documentation and I feel I haven’t fully grokked the split-apply-combine strategy. Concretely, what I’d like to do is:
- group a DataFrame by one column
- Normalize the data contained in another column within each group
- Recombine the grouped dataframe into one of the same size as the original
Say I have df = DataFrame(a=[[1,2,3],[4,5,6],[7,8,9],[9,10,11],[10,11,12]],b=[1,1,2,2,3]) and I’d like to group by values of b, so I do dfg = groupby(df,:b). So far, so good. What I’d like to do next is: for each vector element x in column a, calculate mean m and standard deviation s over all vector elements within the group and apply the transformation x -> (x-m)/s and finally reconstitute the original dataframe.
For instance, for the group given by b=1, we get m=mean([1,2,3,4,5,6])=3.5 and similarly s = sqrt(3.5) which would transform the [1,2,3] vector into [-1.34 -0.80 -0.27] and the vector [4,5,6] into [0.27 0.80 1.34].
My solution so far is a for loop over the subdataframes of dfg, but I feel like there must be a better way using the split-apply-combine strategy?
Any advice would be much appreciated.