How to `combine` row vectors

Hi All,

Just wondering how I can apply a function to rows containing vectors when doing a groupby → combine:

using DataFrames
df = DataFrame(group = ["A", "A", "B", "B"], v1 = [ [1, 2], [3, 4], [5, 6], [7, 8] ], v2 = [ [1, 2], [3, 4], [5, 6], [7, 8] ])
gdf = groupby(df, :group)
DataFrames.combine(gdf, 2:3 .=> sum)

returns

4×3 DataFrame
 Row │ group   v1_sum  v2_sum 
     │ String  Int64   Int64  
─────┼────────────────────────
   1 │ A            4       4
   2 │ A            6       6
   3 │ B           12      12
   4 │ B           14      14

but I would like two rows, with column 2 and 3 containing vectors of length 2

The following works:

julia> DataFrames.combine(gdf, 2:3 .=> Ref∘sum)
2×3 DataFrame
 Row │ group   v1_Ref_sum  v2_Ref_sum 
     │ String  Array…      Array…     
─────┼────────────────────────────────
   1 │ A       [4, 6]      [4, 6]
   2 │ B       [12, 14]    [12, 14]

But there does seem to be some conceptual problem.

4 Likes

If it’s easier for you to follow, the following form also produces what you ask for.

combine(g->(;v1=[sum(g.v1)], v2=[sum(g.v2)]),gdf)

PS
I, like Dan, have some difficulty imagining the logic behind the request

PPS
I don’t know if in this case some variant of the unstack function (that accepts multiple columns) can get the result in one pass.

2 Likes

another way: split-(apply)combine-

dfspl=select(df, :group,[:v1,:v2] .=>ByRow(x->(x...,)).=>x->x.*string.([1,2]) )

cdfspl=combine(groupby(dfspl,:group), Cols(2:5).=>sum .=>identity)

jcdf=select(cdfspl,:group, [Cols(2:3),Cols(4:5)].=>ByRow((x...)->Base.vect(x)).=>x->x[1][1:end-1])

The data frame that I’m working with has columns that store time series vectors. My goal is simply to groupby and sum retaining the original dimensionality of the time series. Is that sensible?

I would have never thought of
Ref ∘ sum

julia> df = DataFrame(group = ["A", "A", "B", "B"], v1 = [ [1, 2], [3, 4], [5, 6], [7, 8] ], v2 = [ [11, 21], [31, 41], [51, 61], [71, 81] ])   
4×3 DataFrame
 Row │ group   v1      v2       
     │ String  Array…  Array…
─────┼──────────────────────────
   1 │ A       [1, 2]  [11, 21]
   2 │ A       [3, 4]  [31, 41]
   3 │ B       [5, 6]  [51, 61]
   4 │ B       [7, 8]  [71, 81]

julia> sdf=stack(df,[:v1,:v2])
8×3 DataFrame
 Row │ group   variable  value    
     │ String  String    Array…
─────┼────────────────────────────
   1 │ A       v1        [1, 2]
   2 │ A       v1        [3, 4]
   3 │ B       v1        [5, 6]
   4 │ B       v1        [7, 8]
   5 │ A       v2        [11, 21]
   6 │ A       v2        [31, 41]
   7 │ B       v2        [51, 61]
   8 │ B       v2        [71, 81]

julia> unstack(sdf,:group,:variable,:value, combine=sum)
2×3 DataFrame
 Row │ group   v1        v2
     │ String  Array…?   Array…?
─────┼──────────────────────────────
   1 │ A       [4, 6]    [42, 62]
   2 │ B       [12, 14]  [122, 142]

the inverse operation

flatten(unstack(sdf,:group,:variable,:value, combine=copy),[:v1,:v2])
1 Like