Name of dataframe column created by function

jo-fleck · December 1, 2020, 6:30pm

I am computing weighted means of subgroups using the groupby and transform approach. See below for an illustration.

My understanding is that the new name is sourcevar1_sourcevar2_function because the weighted mean function does not return a single value or vector (explained here).

I need to get weighted means of several columns and so I am wondering if there is any way to set the column name within the transform command? Or does this have to be done in a separate step?

Thanks for helping with this!


using DataFrames
df = DataFrame(Region = ["state1", "state1", "state1", "state2", "state2", "state2"], Income = [10, 7, 12, 10, 7, 12], Weight = [51, 20, 86, 75, 125, 16])
gdf = groupby(df, :Region)

df_reg_mean_unweighted = transform(gdf, :Income => mean => :Region_mean_income_unweighted)     # Income unweighted

│ Row │ Region │ Income │ Weight │ Region_mean_income_unweighted │
│     │ String │ Int64  │ Int64  │ Float64                       │
├─────┼────────┼────────┼────────┼───────────────────────────────┤
│ 1   │ state1 │ 10     │ 51     │ 9.66667                       │
│ 2   │ state1 │ 7      │ 20     │ 9.66667                       │
│ 3   │ state1 │ 12     │ 86     │ 9.66667                       │
│ 4   │ state2 │ 10     │ 75     │ 9.66667                       │
│ 5   │ state2 │ 7      │ 125    │ 9.66667                       │
│ 6   │ state2 │ 12     │ 16     │ 9.66667                       │

df_reg_mean_weighted   = transform(gdf, [:Income, :Weight] => (x, y) -> (mean(x, weights(y)))) # Income weighted

│ Row │ Region │ Income │ Weight │ Income_Weight_function │
│     │ String │ Int64  │ Int64  │ Float64                │
├─────┼────────┼────────┼────────┼────────────────────────┤
│ 1   │ state1 │ 10     │ 51     │ 10.7134                │
│ 2   │ state1 │ 7      │ 20     │ 10.7134                │
│ 3   │ state1 │ 12     │ 86     │ 10.7134                │
│ 4   │ state2 │ 10     │ 75     │ 8.41204                │
│ 5   │ state2 │ 7      │ 125    │ 8.41204                │
│ 6   │ state2 │ 12     │ 16     │ 8.41204                │

nilshg · December 1, 2020, 8:10pm

I’m not sure I understand the question - you are showing in your example how you can specify the name of the created column, and you can do that for your weighted mean function as well?

julia> transform(gdf, [:Income, :Weight] => ((x, y) -> (mean(x, weights(y)))) => :Income_weighted) # Income weighted
6×4 DataFrame
 Row │ Region  Income  Weight  Income_weighted 
     │ String  Int64   Int64   Float64         
─────┼─────────────────────────────────────────
   1 │ state1      10      51         10.7134
   2 │ state1       7      20         10.7134
   3 │ state1      12      86         10.7134
   4 │ state2      10      75          8.41204
   5 │ state2       7     125          8.41204
   6 │ state2      12      16          8.41204

jo-fleck · December 1, 2020, 8:58pm

Sorry - I should have been clearer in the problem description.

I tried your suggested approach before asking but it didn’t work for me:

df_reg_mean_weighted2   = transform(gdf, [:Income, :Weight] => (x, y) -> (mean(x, weights(y))) => :Income_weighted)

6×4 DataFrame
│ Row │ Region │ Income │ Weight │ Income_Weight_function    │
│     │ String │ Int64  │ Int64  │ Pair{Float64,Symbol}      │
├─────┼────────┼────────┼────────┼───────────────────────────┤
│ 1   │ state1 │ 10     │ 51     │ 10.7134=>:Income_weighted │
│ 2   │ state1 │ 7      │ 20     │ 10.7134=>:Income_weighted │
│ 3   │ state1 │ 12     │ 86     │ 10.7134=>:Income_weighted │
│ 4   │ state2 │ 10     │ 75     │ 8.41204=>:Income_weighted │
│ 5   │ state2 │ 7      │ 125    │ 8.41204=>:Income_weighted │
│ 6   │ state2 │ 12     │ 16     │ 8.41204=>:Income_weighted │

However, your suggestion has an extra set of parentheses around the function (which I missed). They do the trick - many thanks!

nilshg · December 1, 2020, 9:22pm

Ah yes this is an operator precedence issue, you need to enclose the anonymous function in brackets -glad it’s working now!

jo-fleck · December 1, 2020, 9:40pm

Is there a reason why this isn’t mentioned in the documentation? Seems like an easy way to name columns created by anonymous functions.

nilshg · December 2, 2020, 7:10am

I think it’s just a case of this being the general way anonymous functions and operator precedence work, and not DataFrames specific:

julia> :a => sum => :b
:a => (sum => :b)

julia> :a => x -> sum(x) => :b
:a => var"#21#22"()

julia> :a => (x -> sum(x)) => :b
:a => (var"#23#24"() => :b)

I don’t think anyone would object to a small note in the docs on transform and friends that highlights this particular quirk around anonymous functions, it certainly comes up regularly as something that trips up new users.

Topic		Replies	Views
How to easily rename column of GroupedDataFrame General Usage	2	856	June 16, 2020
Apply some functions to columns of a dataframe General Usage question , dataframes	5	1475	November 10, 2021
Specify AsTable output column names Data dataframes	2	502	December 8, 2021
With DataFrames, best practice for applying function across columns, where we also need to reference, in a second argument, the same column for each function call? General Usage dataframes	11	248	April 9, 2025
DataFrame: New columns with names from existing column General Usage question , dataframes	1	294	August 10, 2023

Name of dataframe column created by function

Related topics