Recommended equivalent to map / broadcast on GroupedDataFrame

If we have function f1 that takes a DataFrame as an argument and returns a scalar, how can a reduction operation using f1 be applied over each SubDataFrame of a grouped dataframe?

using DataFrames

# reduce function
f1(df::AbstractDataFrame) = sum(df[!, :x1])

# data with `y` as key
df1 = DataFrame(rand(100, 3), :auto)
df1.y = rand(1:3, 100)
df1g = groupby(df1, :y)

An ideal scenario would be to be able to directly pass SubDataFrame as input to the the aggregating function in combine, something like:

combine(df1g, AsDF() => f1 => :z)

The following works, but is an inefficient workaround:

combine(df1g, AsTable(:) => (x -> f1(DataFrame(x))) => :z)

Otherwise, the map and broadcasting seemed natural options but they are reserved operations:

julia> map(f1, df1g)
ERROR: ArgumentError: using map over `GroupedDataFrame`s is reserved

julia> f1.(df1g)
ERROR: ArgumentError: broadcasting over `GroupedDataFrame`s is reserved
Stacktrace:

Going through a loop also appear the efficient option, but doesn’t appear as an “elegant” solution:

out = Pair[]
for idx in eachindex(df1g)
    push!(out, idx => f1(df1g[idx]))
end

Is there a more straighforward way to performed the desired reduction over grouped dataframe that I missed?

If I understood the requirement correctly, you could try the following form

combine(f1,df1g)

if you want to rename the new column, you could use a function like this

f2(df::AbstractDataFrame) = (;z=sum(df[!, :x1]))
2 Likes

Yes - combine is the intended method if you want a data frame in the output.
If you want a vector then do:

[f(sdf) for sdf in gdf]

This ambiguity (the type of the resulting object) is the reason why map is not implemented for GroupedDataFrame.

1 Like

Thank you! I had completely overlooked the application of the combine(fun, df) method as I’ve integrated these verbs to be strictly of the form combine(df, ops...). Much appreciated!

Summary

This text will be hidden

map(f1,collect(df1g))

works fine