Stack overflow in DataFrames group by

Yes, maybe, but there are several levels of complexity regarding what the function returns for each group:

  1. a single scalar
  2. several scalars (e.g. as a tuple, NamedTuple or a single-row DataFrame)
  3. a multiple-row DataFrame (possibly with a varying number of rows depending on the group)

Each of these cases can be either type-stable or not, which makes things even more tricky.

FWIW, Pandas provides three different functions:

  • aggregate to return a single value for each column and each group (similar to our aggregate, which is a bit more general)
  • transform to return a DataFrame with the same shape as the original for each group
  • apply for general transformations (similar to our by)

The advantage of transform over apply is that you know in advance the size of the result, so you can avoid allocating a temporary copy if you can predict the output type. Same for aggregate, which can be even more efficient since it can operate by columns (making inference and specialization easier). The fact that we allow aggregate to return either a scalar or a vector makes it harder to optimize, cf. this PR.

See `stack(vec_of_vecs)` for `vcat(vec_of_vecs...)` · Issue #21672 · JuliaLang/julia · GitHub.

2 Likes