DataFrames.jl: keep group by result in every row

Using R data.table I’m used to saving the result of a group by aggregation in each row of each group, e.g.:

> d = data.table(a=c(1,2,3,4), b=c(1,1,2,2))
> d
   a b
1: 1 1
2: 2 1
3: 3 2
4: 4 2
> d[,s:=sum(a), b]
> d
   a b s
1: 1 1 3
2: 2 1 3
3: 3 2 7
4: 4 2 7

The last command groups by the column b, sums the values in a and writes the result in each row of the groups from b.

Using DataFrames.jl, I’ve currently always been doing this:

julia> d = DataFrame(a=[1,2,3,4], b=[1,1,2,2])
4×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 1     │
│ 3   │ 3     │ 2     │
│ 4   │ 4     │ 2     │

julia> join(by(d, :b, g -> sum(g[:, :a])), d, on=:b)
4×3 DataFrame
│ Row │ b     │ x1    │ a     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 3     │ 1     │
│ 2   │ 1     │ 3     │ 2     │
│ 3   │ 2     │ 7     │ 3     │
│ 4   │ 2     │ 7     │ 4     │

This seems to me to be a bit complicated to write, hard to read and probably inefficient, due to the unnecessary join.

Is there a better way?

1 Like

DataFramesMeta’s @transform macro on a grouped DataFrame will do what you want.

This kind of transformation is something that will hopefully be added to DataFrames soon.

1 Like

Thanks, in case anybody is interested, here it is:

> using DataFramesMeta
> @transform(groupby(d, :b), s=sum(:a))
4×3 DataFrame
│ Row │ a     │ b     │ s     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 3     │
│ 2   │ 2     │ 1     │ 3     │
│ 3   │ 3     │ 2     │ 7     │
│ 4   │ 4     │ 2     │ 7     │

2 Likes

Here’s another solution:

sumfill(x) = (a=x, s=fill(sum(x), length(x))) 
by(d, :b, :a=>sumfill)
1 Like

I’ve just made a small macro package that approximates data.table syntax: [ANN] FilteredGroupbyMacro.jl

It has the assignment syntax you want as well, although it uses a join in the background. At least you don’t have to write that out.

Your example would be: (the grouping variable comes second)

@by d[!, :b, s := sum(:a)]
2 Likes