I am updating some code that uses DataFrames.jl. I formerly used the by
command and the aggregate
command to execute commands over multiple columns, but it looks like these commands were deprecated recently. The updates for these functions seem a bit clunky.
If I have a dataframe like this:
using DataFrames
using Statistics
julia> df = DataFrame(x=rand(5), z=rand(5), y=["this", "that", "this", "that", "this"])
5×3 DataFrame
Row │ x z y
│ Float64 Float64 String
─────┼────────────────────────────
1 │ 0.1203 0.562386 this
2 │ 0.828021 0.355487 that
3 │ 0.218044 0.859215 this
4 │ 0.500774 0.329637 that
5 │ 0.95807 0.39004 this
If I want to group by y and get the means, I used to do this:
julia> aggregate(df, :y, mean)
2×3 DataFrame
Row │ y x_mean z_mean
│ String Float64 Float64
─────┼────────────────────────────
1 │ this 0.289291 0.432779
2 │ that 0.886162 0.66099
Now it looks like the canonical way to do it is this:
julia> combine(groupby(df, :y), [:x, :z] .=> mean)
2×3 DataFrame
Row │ y x_mean z_mean
│ String Float64 Float64
─────┼────────────────────────────
1 │ this 0.289291 0.432779
2 │ that 0.886162 0.66099
Did I miss something or is this the most code-efficient way of achieving this?
I can see some advantages when applying different functions to different subsets of columns, but it seems like two functions is replacing one for quick aggregations. Perhaps I’m missing some deeper reasoning here.
I think the deeper reasoning is that the scenario where all your columns are <:Real
is a bit of an edge case. It’s not worth maintaining aggregate
just for that scenario when combine
works just as well.
1 Like
I wasn’t involved in the decision, and I initially thought it was a bit clunky too. But the promise of this approach is a powerful, concise, and consistent syntax for all kinds of transformations. Fwiw, after using it for a bit I think this promise has been fulfilled.
Now the API is unified for v1.0. I suspect that some conveniences will be added back for simple stuff like this, but even if not, I think the benefits of the new system are well worth it.
3 Likes
Yes, the idea is that it’s better to have a single function to perform the combine
step, which is general enough to cover all use cases. Having both combine
and aggregate
makes the API more difficult to master, especially since the two names are very similar and nothing indicates which does what. By following this principle everywhere, we get a package which is much easier to use.
Also, following what @pdeffebach said, if your data suddenly gets a second string column, combine(groupby(df, :y), [:x, :z] .=> mean)
will continue to work, contrary to aggregate
. Things like combine(groupby(df, :y), names(df, Real) .=> mean)
can also be used to compute the mean of all numeric columns.
4 Likes