Easier way to split-apply-combine in DataFrames.jl

I am updating some code that uses DataFrames.jl. I formerly used the by command and the aggregate command to execute commands over multiple columns, but it looks like these commands were deprecated recently. The updates for these functions seem a bit clunky.

If I have a dataframe like this:

using DataFrames
using Statistics
julia> df = DataFrame(x=rand(5), z=rand(5), y=["this", "that", "this", "that", "this"])
5×3 DataFrame
 Row │ x         z         y      
     │ Float64   Float64   String 
─────┼────────────────────────────
   1 │ 0.1203    0.562386  this
   2 │ 0.828021  0.355487  that
   3 │ 0.218044  0.859215  this
   4 │ 0.500774  0.329637  that
   5 │ 0.95807   0.39004   this

If I want to group by y and get the means, I used to do this:

julia> aggregate(df, :y, mean)

2×3 DataFrame
 Row │ y       x_mean    z_mean   
     │ String  Float64   Float64  
─────┼────────────────────────────
   1 │ this    0.289291  0.432779
   2 │ that    0.886162  0.66099

Now it looks like the canonical way to do it is this:

julia> combine(groupby(df, :y), [:x, :z] .=> mean)
2×3 DataFrame
 Row │ y       x_mean    z_mean   
     │ String  Float64   Float64  
─────┼────────────────────────────
   1 │ this    0.289291  0.432779
   2 │ that    0.886162  0.66099

Did I miss something or is this the most code-efficient way of achieving this?

That’s the correct way!

2 Likes

I can see some advantages when applying different functions to different subsets of columns, but it seems like two functions is replacing one for quick aggregations. Perhaps I’m missing some deeper reasoning here.

I think the deeper reasoning is that the scenario where all your columns are <:Real is a bit of an edge case. It’s not worth maintaining aggregate just for that scenario when combine works just as well.

1 Like

I wasn’t involved in the decision, and I initially thought it was a bit clunky too. But the promise of this approach is a powerful, concise, and consistent syntax for all kinds of transformations. Fwiw, after using it for a bit I think this promise has been fulfilled.

Now the API is unified for v1.0. I suspect that some conveniences will be added back for simple stuff like this, but even if not, I think the benefits of the new system are well worth it.

3 Likes

Yes, the idea is that it’s better to have a single function to perform the combine step, which is general enough to cover all use cases. Having both combine and aggregate makes the API more difficult to master, especially since the two names are very similar and nothing indicates which does what. By following this principle everywhere, we get a package which is much easier to use.

Also, following what @pdeffebach said, if your data suddenly gets a second string column, combine(groupby(df, :y), [:x, :z] .=> mean) will continue to work, contrary to aggregate. Things like combine(groupby(df, :y), names(df, Real) .=> mean) can also be used to compute the mean of all numeric columns.

4 Likes