Reapply groupby on a GroupedDataFrame

Is there a reason for groupby not supporting a GroupedDataFrame as an input?

In a use case, if I try to apply a transform base on different groups, it will hit an error:

dfg = groupby(df, "eid")
transform!(dfg, "amount" => ((x) -> cumsum(x)) => "amount_cumsum")
dfg = groupby(dfg, "date")

ERROR: MethodError: no method matching groupby(::GroupedDataFrame{DataFrame}, ::Array{String,1})
Closest candidates are:
  groupby(::AbstractDataFrame, ::Any; sort, skipmissing) at C:\Users\jerem\.julia\packages\DataFrames\oQ5c7\src\groupeddataframe\groupeddataframe.jl:187

A solution could be to use the non-mutating transform along ungroup=true, but on large data, it is a very costly operation (both time and RAM):

dfg = groupby(df, "eid")
df1 = transform(dfg, "amount" => ((x) -> cumsum(x)) => "amount_cumsum", ungroup=true)
dfg = groupby(df1, "date")

A way to circumvent this seems to be to call the second groupby on the parent of the GroupedDataFrame:

dfg = groupby(df, "eid")
transform!(dfg, "amount" => ((x) -> cumsum(x)) => "amount_cumsum")
dfg = groupby(dfg.parent, "date")

However, I have doubts whether this later approach might be exposed to some undesired side effects. If itโ€™s legit, then wouldnโ€™t it be desirable for groupby the handle a GroupedDataFrame as in input?

I might have some anchoring with regard to the behavior of Rโ€™s data.table, where a transformation sequentially be called on different grouping through dt[, ..., by = "eid"] and dt[, ..., by = "date"]. In DataFrames.jl, can it be assumed that to remove the grouping key, doing df = dfg.parent would be equivalent to data.tableโ€™s setkey(dt, NULL)?

Wait I donโ€™t observe that behavior

julia> x = DataFrame(a =[1, 2], b = [100, 200])
2ร—2 DataFrame
 Row โ”‚ a      b     
     โ”‚ Int64  Int64 
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1    100
   2 โ”‚     2    200

julia> gd = groupby(x, :a)
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 1
 Row โ”‚ a      b     
     โ”‚ Int64  Int64 
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1    100
โ‹ฎ
Last Group (1 row): a = 2
 Row โ”‚ a      b     
     โ”‚ Int64  Int64 
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     2    200

julia> transform!(gd, :a => first => :c)
2ร—3 DataFrame
 Row โ”‚ a      b      c     
     โ”‚ Int64  Int64  Int64 
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1    100      1
   2 โ”‚     2    200      2

transform! returns a DataFrame

2 Likes

Youโ€™re totally correct!
I think I stuck to a narrow interpretation of the mutation concept in transform! and wrongly assumed it only modified the its input. By assigning a result to the transform operator df = transform!(dfg...) it effectively behave as desired.