Is there a reason for groupby
not supporting a GroupedDataFrame
as an input?
In a use case, if I try to apply a transform base on different groups, it will hit an error:
dfg = groupby(df, "eid")
transform!(dfg, "amount" => ((x) -> cumsum(x)) => "amount_cumsum")
dfg = groupby(dfg, "date")
ERROR: MethodError: no method matching groupby(::GroupedDataFrame{DataFrame}, ::Array{String,1})
Closest candidates are:
groupby(::AbstractDataFrame, ::Any; sort, skipmissing) at C:\Users\jerem\.julia\packages\DataFrames\oQ5c7\src\groupeddataframe\groupeddataframe.jl:187
A solution could be to use the non-mutating transform
along ungroup=true
, but on large data, it is a very costly operation (both time and RAM):
dfg = groupby(df, "eid")
df1 = transform(dfg, "amount" => ((x) -> cumsum(x)) => "amount_cumsum", ungroup=true)
dfg = groupby(df1, "date")
A way to circumvent this seems to be to call the second groupby on the parent
of the GroupedDataFrame
:
dfg = groupby(df, "eid")
transform!(dfg, "amount" => ((x) -> cumsum(x)) => "amount_cumsum")
dfg = groupby(dfg.parent, "date")
However, I have doubts whether this later approach might be exposed to some undesired side effects. If itโs legit, then wouldnโt it be desirable for groupby
the handle a GroupedDataFrame
as in input?
I might have some anchoring with regard to the behavior of Rโs data.table
, where a transformation sequentially be called on different grouping through dt[, ..., by = "eid"]
and dt[, ..., by = "date"]
. In DataFrames.jl
, can it be assumed that to remove the grouping key, doing df = dfg.parent
would be equivalent to data.tableโs setkey(dt, NULL)
?