Efficient computation of statistics across multiple data frames

alex-s-gardner · January 30, 2024, 8:18pm

I have a vector collection of DataFrames that have the same number of rows and same column names and I would like to compute statistics across all data frames to arrive at a data frame that is the same size as a single data frame in the original collection. For instance how can I most efficiently calculate the mean for each row of c1 and the median for each row of c2 across all data frames given the below example:

using DataFrames

n = 10
dfs =[]
for i = 1:n
    dfs = push!(dfs, DataFrame(name=["a","b","c","d","e"], c1=rand(5), c2=rand(5)))
end

pdeffebach · January 30, 2024, 8:25pm

Just so we are clear, do you want mean(df.c1) separately for each value of name in the data frame, or do you want a mean for each row of the variables :c1 and :c2?

jules · January 30, 2024, 8:25pm

Easiest way might be to make a vector of vectors out of the columns from all dfs you want to work on, run the aggregation on that. This way it’s at least type stable.

pdeffebach · January 30, 2024, 8:31pm

I think OP probably wants (using DataFramesMeta)

for i in 1:n
    @transform! dfs[n] begin 
        :c1_mean = mean(:c1)
        :c2_med = median(:c2)
    end
end

alex-s-gardner · January 30, 2024, 8:50pm

I’d like the mean for each row of c1 across all data frames. The result will be a single data frame with 5 rows.

the brute force way would look like this:

mean(hcat(dfs[1].c1, dfs[2].c1, dfs[3].c1, dfs[4].c1, dfs[5].c1, dfs[6].c1, dfs[7].c1, dfs[8].c1, dfs[9].c1, dfs[10].c1), dims = 2)

bertschi · January 30, 2024, 10:23pm

In this case, I would just try to get all my data into one data frame and then use grouped aggregations:

df_long = reduce(vcat, dfs)
# or if you want to keep track where each datum came from
df_long = reduce(vcat, transform(df, [] => (() -> i) => :id) for (i, df) in enumerate(dfs))

combine(groupby(df_long, :name),
        :c1 => mean,
        :c2 => median)

alex-s-gardner · January 30, 2024, 11:45pm

This is exactly what I was looking for!

jules · January 31, 2024, 5:19am

The vcat(dfs...) method for DataFrames also has a special source keyword arg which does this for you.

rocco_sprmnt21 · January 31, 2024, 4:27pm

This, although not very elegant, is quite efficient

m3=hvncat(3,Matrix.(dfs)...)[:,[2,3],:]
mc1=mean(m3[:,1:1,:],dims=3)[:]
mdc2=median(m3[:,2:2,:],dims=3)[:]
name='a':'e'
DataFrame((;name,mc1,mdc2))

PS
Furthermore there is the use of a little used function and with documentation that should be enriched with examples

bertschi · January 31, 2024, 6:55pm

Nice, thanks.

Topic		Replies	Views
Row-wise mean of columns in a DataFrame Data	4	1721	August 13, 2021
How to take the mean of entries across an array of DataFrames conditional upon the value of a separate column? General Usage statistics , dataframes	5	1314	May 11, 2022
Mean of multiple DataFrames New to Julia dataframes	10	2610	April 17, 2022
DataFrame creating a rowmean for 95 columns General Usage question , dataframes	6	490	March 7, 2021
Mean of each column in a DataFrame? New to Julia	4	6501	October 10, 2020

Efficient computation of statistics across multiple data frames

Related topics