How to average column values in a dataframe based on multiple other matching columns?

Hi all,

I’ve created a DataFrame as seen below:

using DataFrames # package dependency 

original_df = DataFrame(; Year=Int[], Name=String[], 
Surname=String[], Rating=Int[], ValueOne=Float64[], ValueTwo=Float64[])

push!(original_df, (1, "Dick", "Jones", 1, 2.3, 2.9))
push!(original_df, (1, "Dick", "Grayson", 1, 3.1, 6.9))
push!(original_df, (1, "Dick", "Grayson", 3, 2.4, 6.9))
push!(original_df, (2, "Dick", "Grayson", 3, 7.4, 5.3))
push!(original_df, (2, "Bob", "Jeff", 3, 4.83, 6.29))
push!(original_df, (1, "Bob", "Jeff", 1, 3.3, 1.9))
push!(original_df, (2, "Bob", "Jeff", 3, 10.2, 1.01))
push!(original_df, (2, "Bob", "Man", 3, 10.9, 0.03))

The original_df DataFrame should look like this:
image

I want to perform an operation on original_df such that it will average the :ValueOne and :ValueTwo columns (separately) when the following columns share the same value: [:Year, :Name, :Rating]. The resulting DataFrame should look like this if my mental math is correct:

using DataFrames # package dependency

resulting_df = DataFrame(; Year=Int[], Name=String[], Rating=Int[], 
ValueOne=Float64[], ValueTwo=Float64[])

push!(resulting_df, (1, "Dick", 1, 2.7, 4.9))
push!(resulting_df, (1, "Dick", 3, 2.4, 6.9))
push!(resulting_df, (2, "Dick", 3, 7.4, 5.3))
push!(resulting_df, (2, "Bob", 3, 8.643, 2.443))
push!(resulting_df, (1, "Bob", 1, 3.3, 1.9))

The resulting_df DataFrame should look like this:
image
Note that I don’t care what the sorting of the rows in resulting_df looks like; I just care that it has those rows.

Would anybody be able to instruct me in how the following operation can be done please? The actual DataFrame that I’m going to do this on will have 15+ columns and there’ll be 10 that need to match for the averaging to be done, so preferably if I had a way to do the operation where I didn’t have to type in multiple conditions separated by the && operator then that would be good please. Regardless, the code examples above should be a decent starting point I think.

Thank you in advance.

combine(groupby(df, [:Year, :Name, :Rating]), [:ValueOne, :ValueTwo] .=> mean; renamecols = false)

If you need most of your columns to group by it might be more concise to use something like names(df[!, Not([:ValueOne, :ValueTwo])]) in groupby instead of listing them manually.

1 Like

Thanks for this, the code is working well. Yea, doing it with the names functions will be a good idea.