Reapply groupby on a GroupedDataFrame

jeremiedb · March 12, 2021, 9:58pm

Is there a reason for groupby not supporting a GroupedDataFrame as an input?

In a use case, if I try to apply a transform base on different groups, it will hit an error:

dfg = groupby(df, "eid")
transform!(dfg, "amount" => ((x) -> cumsum(x)) => "amount_cumsum")
dfg = groupby(dfg, "date")

ERROR: MethodError: no method matching groupby(::GroupedDataFrame{DataFrame}, ::Array{String,1})
Closest candidates are:
  groupby(::AbstractDataFrame, ::Any; sort, skipmissing) at C:\Users\jerem\.julia\packages\DataFrames\oQ5c7\src\groupeddataframe\groupeddataframe.jl:187

A solution could be to use the non-mutating transform along ungroup=true, but on large data, it is a very costly operation (both time and RAM):

dfg = groupby(df, "eid")
df1 = transform(dfg, "amount" => ((x) -> cumsum(x)) => "amount_cumsum", ungroup=true)
dfg = groupby(df1, "date")

A way to circumvent this seems to be to call the second groupby on the parent of the GroupedDataFrame:

dfg = groupby(df, "eid")
transform!(dfg, "amount" => ((x) -> cumsum(x)) => "amount_cumsum")
dfg = groupby(dfg.parent, "date")

However, I have doubts whether this later approach might be exposed to some undesired side effects. If it’s legit, then wouldn’t it be desirable for groupby the handle a GroupedDataFrame as in input?

I might have some anchoring with regard to the behavior of R’s data.table, where a transformation sequentially be called on different grouping through dt[, ..., by = "eid"] and dt[, ..., by = "date"]. In DataFrames.jl, can it be assumed that to remove the grouping key, doing df = dfg.parent would be equivalent to data.table’s setkey(dt, NULL)?

pdeffebach · March 12, 2021, 10:11pm

Wait I don’t observe that behavior

julia> x = DataFrame(a =[1, 2], b = [100, 200])
2×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1    100
   2 │     2    200

julia> gd = groupby(x, :a)
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 1
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1    100
⋮
Last Group (1 row): a = 2
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     2    200

julia> transform!(gd, :a => first => :c)
2×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1    100      1
   2 │     2    200      2

transform! returns a DataFrame

jeremiedb · March 12, 2021, 10:39pm

You’re totally correct!
I think I stuck to a narrow interpretation of the mutation concept in transform! and wrongly assumed it only modified the its input. By assigning a result to the transform operator df = transform!(dfg...) it effectively behave as desired.

Topic		Replies	Views
Grouped Data Frame -- Two different types General Usage dataframes	3	683	August 6, 2022
Bug in DataFrames grouping General Usage	8	437	July 24, 2020
Simple transformation of a dataframe with grouped data New to Julia	4	198	June 17, 2024
How to correct the contents of GroupedDataFrame to update it? New to Julia question	7	250	September 8, 2022
DataFrames.jl: keep group by result in every row General Usage dataframes	4	1685	February 9, 2020

Reapply groupby on a GroupedDataFrame

Related topics