Name of dataframe column created by function

I am computing weighted means of subgroups using the groupby and transform approach. See below for an illustration.

My understanding is that the new name is sourcevar1_sourcevar2_function because the weighted mean function does not return a single value or vector (explained here).

I need to get weighted means of several columns and so I am wondering if there is any way to set the column name within the transform command? Or does this have to be done in a separate step?

Thanks for helping with this!


using DataFrames
df = DataFrame(Region = ["state1", "state1", "state1", "state2", "state2", "state2"], Income = [10, 7, 12, 10, 7, 12], Weight = [51, 20, 86, 75, 125, 16])
gdf = groupby(df, :Region)

df_reg_mean_unweighted = transform(gdf, :Income => mean => :Region_mean_income_unweighted)     # Income unweighted

β”‚ Row β”‚ Region β”‚ Income β”‚ Weight β”‚ Region_mean_income_unweighted β”‚
β”‚     β”‚ String β”‚ Int64  β”‚ Int64  β”‚ Float64                       β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ state1 β”‚ 10     β”‚ 51     β”‚ 9.66667                       β”‚
β”‚ 2   β”‚ state1 β”‚ 7      β”‚ 20     β”‚ 9.66667                       β”‚
β”‚ 3   β”‚ state1 β”‚ 12     β”‚ 86     β”‚ 9.66667                       β”‚
β”‚ 4   β”‚ state2 β”‚ 10     β”‚ 75     β”‚ 9.66667                       β”‚
β”‚ 5   β”‚ state2 β”‚ 7      β”‚ 125    β”‚ 9.66667                       β”‚
β”‚ 6   β”‚ state2 β”‚ 12     β”‚ 16     β”‚ 9.66667                       β”‚

df_reg_mean_weighted   = transform(gdf, [:Income, :Weight] => (x, y) -> (mean(x, weights(y)))) # Income weighted

β”‚ Row β”‚ Region β”‚ Income β”‚ Weight β”‚ Income_Weight_function β”‚
β”‚     β”‚ String β”‚ Int64  β”‚ Int64  β”‚ Float64                β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ state1 β”‚ 10     β”‚ 51     β”‚ 10.7134                β”‚
β”‚ 2   β”‚ state1 β”‚ 7      β”‚ 20     β”‚ 10.7134                β”‚
β”‚ 3   β”‚ state1 β”‚ 12     β”‚ 86     β”‚ 10.7134                β”‚
β”‚ 4   β”‚ state2 β”‚ 10     β”‚ 75     β”‚ 8.41204                β”‚
β”‚ 5   β”‚ state2 β”‚ 7      β”‚ 125    β”‚ 8.41204                β”‚
β”‚ 6   β”‚ state2 β”‚ 12     β”‚ 16     β”‚ 8.41204                β”‚


I’m not sure I understand the question - you are showing in your example how you can specify the name of the created column, and you can do that for your weighted mean function as well?

julia> transform(gdf, [:Income, :Weight] => ((x, y) -> (mean(x, weights(y)))) => :Income_weighted) # Income weighted
6Γ—4 DataFrame
 Row β”‚ Region  Income  Weight  Income_weighted 
     β”‚ String  Int64   Int64   Float64         
─────┼─────────────────────────────────────────
   1 β”‚ state1      10      51         10.7134
   2 β”‚ state1       7      20         10.7134
   3 β”‚ state1      12      86         10.7134
   4 β”‚ state2      10      75          8.41204
   5 β”‚ state2       7     125          8.41204
   6 β”‚ state2      12      16          8.41204

1 Like

Sorry - I should have been clearer in the problem description.

I tried your suggested approach before asking but it didn’t work for me:

df_reg_mean_weighted2   = transform(gdf, [:Income, :Weight] => (x, y) -> (mean(x, weights(y))) => :Income_weighted)

6Γ—4 DataFrame
β”‚ Row β”‚ Region β”‚ Income β”‚ Weight β”‚ Income_Weight_function    β”‚
β”‚     β”‚ String β”‚ Int64  β”‚ Int64  β”‚ Pair{Float64,Symbol}      β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ state1 β”‚ 10     β”‚ 51     β”‚ 10.7134=>:Income_weighted β”‚
β”‚ 2   β”‚ state1 β”‚ 7      β”‚ 20     β”‚ 10.7134=>:Income_weighted β”‚
β”‚ 3   β”‚ state1 β”‚ 12     β”‚ 86     β”‚ 10.7134=>:Income_weighted β”‚
β”‚ 4   β”‚ state2 β”‚ 10     β”‚ 75     β”‚ 8.41204=>:Income_weighted β”‚
β”‚ 5   β”‚ state2 β”‚ 7      β”‚ 125    β”‚ 8.41204=>:Income_weighted β”‚
β”‚ 6   β”‚ state2 β”‚ 12     β”‚ 16     β”‚ 8.41204=>:Income_weighted β”‚

However, your suggestion has an extra set of parentheses around the function (which I missed). They do the trick - many thanks!

Ah yes this is an operator precedence issue, you need to enclose the anonymous function in brackets -glad it’s working now!

Is there a reason why this isn’t mentioned in the documentation? Seems like an easy way to name columns created by anonymous functions.

I think it’s just a case of this being the general way anonymous functions and operator precedence work, and not DataFrames specific:

julia> :a => sum => :b
:a => (sum => :b)

julia> :a => x -> sum(x) => :b
:a => var"#21#22"()

julia> :a => (x -> sum(x)) => :b
:a => (var"#23#24"() => :b)

I don’t think anyone would object to a small note in the docs on transform and friends that highlights this particular quirk around anonymous functions, it certainly comes up regularly as something that trips up new users.

1 Like