Stack overflow in DataFrames group by

nalimilan · October 10, 2017, 7:53pm

Yes, maybe, but there are several levels of complexity regarding what the function returns for each group:

a single scalar
several scalars (e.g. as a tuple, NamedTuple or a single-row DataFrame)
a multiple-row DataFrame (possibly with a varying number of rows depending on the group)

Each of these cases can be either type-stable or not, which makes things even more tricky.

FWIW, Pandas provides three different functions:

aggregate to return a single value for each column and each group (similar to our aggregate, which is a bit more general)
transform to return a DataFrame with the same shape as the original for each group
apply for general transformations (similar to our by)

The advantage of transform over apply is that you know in advance the size of the result, so you can avoid allocating a temporary copy if you can predict the output type. Same for aggregate, which can be even more efficient since it can operate by columns (making inference and specialization easier). The fact that we allow aggregate to return either a scalar or a vector makes it harder to optimize, cf. this PR.

See `stack(vec_of_vecs)` for `vcat(vec_of_vecs...)` · Issue #21672 · JuliaLang/julia · GitHub.

Topic		Replies	Views
Bad performance of group_by of DataFrames - updated - General Usage performance	21	1346	October 23, 2019
Who does "better" than DataFrames? Performance dataframes	43	2290	April 6, 2023
Type of groupby(df,id) elements are Any Data dataframes	6	1523	March 28, 2018
Julia performs poorly on group-by benchmarks Data performance	48	6031	January 23, 2019
DataFrames operation scales badly Performance	21	2854	December 10, 2018

Stack overflow in DataFrames group by

Related topics