Column selection in DataFrames comb

I think I am misunderstanding the column selection in grouped DataFrames. I would like to select a range of variables in a combine operation but I cannot make it work with any of the DataAPI selectors:

using DataFrames, Dates, Statistics
df = DataFrame(
    g = ['a','a', 'a', 'a', 'c', 'c', 'c'], 
    date = [Date(2021,1,1), Date(2021,1,2), Date(2021,1,2), Date(2021,1,4), Date(2021,1,1),Date(2021,1,3) ,Date(2021,1,7)],
    v = rand(7),
    v1 = rand(7),
    v2 = rand(7)
)

df[:, :week_date] = firstdayofweek.(df.date)
gdf = groupby(df, [:g, :week_date])
# Works:
cols = [:v, :v1, :v2]
combine(gdf, cols .=> mean)
combine(gdf, names(gdf)[occursin.(r"^v", names(gdf))] .=> mean)
# Does not work:
combine(gdf, r"^v" .=> mean)
combine(gdf, Between(:v, :v2) .=> mean)

After reading the documentation it does not seem clear to me why there should be a difference. Could someone please clear this up for me.
Thanks!

There’s no magic in the DataFrames API: it’s regular Julia syntax. So what does combine(gdf, cols .=> mean) mean?

Julia will first evaluate cols .=> mean, then pass the result to combine. This broadcasting operation is equivalent to [col => mean for col in cols]. Check it in the REPL:

julia> cols .=> mean
3-element Vector{Pair{Symbol, typeof(mean)}}:
  :v => Statistics.mean
 :v1 => Statistics.mean
 :v2 => Statistics.mean

The combine function will understand this and calculate the three means as desired.

Now what does combine(gdf, r"^v" .=> mean) do? It must first evaluate the argument. Let’s see in the REPL:

julia> r"^v" .=> mean
r"^v" => Statistics.mean

Indeed r"^v" and mean are β€œscalars” so the broadcast does nothing. It’s like calling combine(gdf, r"^v" => mean). Which is totally valid DataFrames syntax, but it means that the function mean should be called with several arguments (all the columns matching r"^v"). Not what we want! But it would work with + for example, to sum all these columns:

julia> combine(gdf, r"^v" => +)
7Γ—3 DataFrame
 Row β”‚ g     week_date   v_v1_v2_+ 
     β”‚ Char  Date        Float64   
─────┼─────────────────────────────
   1 β”‚ a     2020-12-28    1.61341
   2 β”‚ a     2020-12-28    2.06179
   3 β”‚ a     2020-12-28    1.07186
   4 β”‚ a     2021-01-04    2.16722
   5 β”‚ c     2020-12-28    1.4146
   6 β”‚ c     2020-12-28    1.84385
   7 β”‚ c     2021-01-04    1.64552

For the same reason broadcasting on Between(:v, :v2) doesn’t work: it’s a simple value of Between type. This type doesn’t implement a smart broadcasting that finds the correct columns: it can’t because the Between value is not linked to a particular data frame.

So how can you use fancy column selectors like r"^v" and Between in broadcasting? You need the actual column names (not an abstract specification like r"^v") and that is what names is for: it accepts all the fancy column selectors:

julia> combine(gdf, names(gdf, r"^v") .=> mean)
4Γ—5 DataFrame
 Row β”‚ g     week_date   v_mean    v1_mean   v2_mean  
     β”‚ Char  Date        Float64   Float64   Float64  
─────┼────────────────────────────────────────────────
   1 β”‚ a     2020-12-28  0.388976  0.572689  0.620689
   2 β”‚ a     2021-01-04  0.868798  0.676683  0.621742
   3 β”‚ c     2020-12-28  0.609329  0.754561  0.265337
   4 β”‚ c     2021-01-04  0.356272  0.980407  0.308845

To better understand how DataFrames.jl uses => you might enjoy this excellent blog post by one of the developers: DataFrames.jl minilanguage explained | Blog by BogumiΕ‚ KamiΕ„ski

6 Likes

Thank you very much that clears things up!

1 Like

An excellent explanation. Bravo!

As a small development note in the future we might add broadcasting support for Between(:a, :b) .=> fun so that combine would understand that it should be rewritten to names(gdf, Between(:a, :b)) .=> fun.

2 Likes