Easier way to split-apply-combine in DataFrames.jl

stefanjwojcik · December 13, 2020, 10:35pm

I am updating some code that uses DataFrames.jl. I formerly used the by command and the aggregate command to execute commands over multiple columns, but it looks like these commands were deprecated recently. The updates for these functions seem a bit clunky.

If I have a dataframe like this:

using DataFrames
using Statistics
julia> df = DataFrame(x=rand(5), z=rand(5), y=["this", "that", "this", "that", "this"])
5×3 DataFrame
 Row │ x         z         y      
     │ Float64   Float64   String 
─────┼────────────────────────────
   1 │ 0.1203    0.562386  this
   2 │ 0.828021  0.355487  that
   3 │ 0.218044  0.859215  this
   4 │ 0.500774  0.329637  that
   5 │ 0.95807   0.39004   this

If I want to group by y and get the means, I used to do this:

julia> aggregate(df, :y, mean)

2×3 DataFrame
 Row │ y       x_mean    z_mean   
     │ String  Float64   Float64  
─────┼────────────────────────────
   1 │ this    0.289291  0.432779
   2 │ that    0.886162  0.66099

Now it looks like the canonical way to do it is this:

julia> combine(groupby(df, :y), [:x, :z] .=> mean)
2×3 DataFrame
 Row │ y       x_mean    z_mean   
     │ String  Float64   Float64  
─────┼────────────────────────────
   1 │ this    0.289291  0.432779
   2 │ that    0.886162  0.66099

Did I miss something or is this the most code-efficient way of achieving this?

pdeffebach · December 13, 2020, 10:36pm

That’s the correct way!

stefanjwojcik · December 13, 2020, 10:38pm

I can see some advantages when applying different functions to different subsets of columns, but it seems like two functions is replacing one for quick aggregations. Perhaps I’m missing some deeper reasoning here.

pdeffebach · December 14, 2020, 12:53am

I think the deeper reasoning is that the scenario where all your columns are <:Real is a bit of an edge case. It’s not worth maintaining aggregate just for that scenario when combine works just as well.

kevbonham · December 14, 2020, 2:43am

I wasn’t involved in the decision, and I initially thought it was a bit clunky too. But the promise of this approach is a powerful, concise, and consistent syntax for all kinds of transformations. Fwiw, after using it for a bit I think this promise has been fulfilled.

Now the API is unified for v1.0. I suspect that some conveniences will be added back for simple stuff like this, but even if not, I think the benefits of the new system are well worth it.

nalimilan · December 14, 2020, 10:03pm

Yes, the idea is that it’s better to have a single function to perform the combine step, which is general enough to cover all use cases. Having both combine and aggregate makes the API more difficult to master, especially since the two names are very similar and nothing indicates which does what. By following this principle everywhere, we get a package which is much easier to use.

Also, following what @pdeffebach said, if your data suddenly gets a second string column, combine(groupby(df, :y), [:x, :z] .=> mean) will continue to work, contrary to aggregate. Things like combine(groupby(df, :y), names(df, Real) .=> mean) can also be used to compute the mean of all numeric columns.

Topic		Replies	Views
Split-Apply-Combine on many columns at once? Looking for equivalent to Stata's collapse New to Julia question , dataframes	7	2035	November 9, 2021
Data Cleaning: Split, Combine, Apply? New to Julia dataframes	9	787	October 28, 2021
Using DataFrames `combine` is there a way to programmatically pass multiple functions to apply to the same same column? Data	9	860	January 20, 2023
DataFrame groups as an argument of a function General Usage question , dataframes	15	919	November 23, 2021
Aggregate deprecated use combine Data dataframes	6	1383	July 9, 2020

Easier way to split-apply-combine in DataFrames.jl

Related topics