How to average column values in a dataframe based on multiple other matching columns?

Name1 · February 22, 2023, 4:31am

Hi all,

I’ve created a DataFrame as seen below:

using DataFrames # package dependency 

original_df = DataFrame(; Year=Int[], Name=String[], 
Surname=String[], Rating=Int[], ValueOne=Float64[], ValueTwo=Float64[])

push!(original_df, (1, "Dick", "Jones", 1, 2.3, 2.9))
push!(original_df, (1, "Dick", "Grayson", 1, 3.1, 6.9))
push!(original_df, (1, "Dick", "Grayson", 3, 2.4, 6.9))
push!(original_df, (2, "Dick", "Grayson", 3, 7.4, 5.3))
push!(original_df, (2, "Bob", "Jeff", 3, 4.83, 6.29))
push!(original_df, (1, "Bob", "Jeff", 1, 3.3, 1.9))
push!(original_df, (2, "Bob", "Jeff", 3, 10.2, 1.01))
push!(original_df, (2, "Bob", "Man", 3, 10.9, 0.03))

The original_df DataFrame should look like this:

I want to perform an operation on original_df such that it will average the :ValueOne and :ValueTwo columns (separately) when the following columns share the same value: [:Year, :Name, :Rating]. The resulting DataFrame should look like this if my mental math is correct:

using DataFrames # package dependency

resulting_df = DataFrame(; Year=Int[], Name=String[], Rating=Int[], 
ValueOne=Float64[], ValueTwo=Float64[])

push!(resulting_df, (1, "Dick", 1, 2.7, 4.9))
push!(resulting_df, (1, "Dick", 3, 2.4, 6.9))
push!(resulting_df, (2, "Dick", 3, 7.4, 5.3))
push!(resulting_df, (2, "Bob", 3, 8.643, 2.443))
push!(resulting_df, (1, "Bob", 1, 3.3, 1.9))

The resulting_df DataFrame should look like this:

Note that I don’t care what the sorting of the rows in resulting_df looks like; I just care that it has those rows.

Would anybody be able to instruct me in how the following operation can be done please? The actual DataFrame that I’m going to do this on will have 15+ columns and there’ll be 10 that need to match for the averaging to be done, so preferably if I had a way to do the operation where I didn’t have to type in multiple conditions separated by the && operator then that would be good please. Regardless, the code examples above should be a decent starting point I think.

Thank you in advance.

nilshg · February 22, 2023, 6:05am

combine(groupby(df, [:Year, :Name, :Rating]), [:ValueOne, :ValueTwo] .=> mean; renamecols = false)

If you need most of your columns to group by it might be more concise to use something like names(df[!, Not([:ValueOne, :ValueTwo])]) in groupby instead of listing them manually.

Name1 · February 22, 2023, 6:32am

Thanks for this, the code is working well. Yea, doing it with the names functions will be a good idea.

Topic		Replies	Views
How to take the mean of entries across an array of DataFrames conditional upon the value of a separate column? General Usage statistics , dataframes	5	1310	May 11, 2022
Collapsing data into the level of certain variables and taking averages New to Julia dataframes	1	716	March 6, 2021
Creating new dataframe column! General Usage question , dataframes	3	295	April 21, 2021
How to calcule the mean of values considering their tuples of another value: General Usage dataframes	6	254	November 14, 2022
Grouping a DataFrame by something other than an existing column Data data	2	857	August 6, 2017

How to average column values in a dataframe based on multiple other matching columns?

Related topics