Efficient computation of statistics across multiple data frames

I have a vector collection of DataFrames that have the same number of rows and same column names and I would like to compute statistics across all data frames to arrive at a data frame that is the same size as a single data frame in the original collection. For instance how can I most efficiently calculate the mean for each row of c1 and the median for each row of c2 across all data frames given the below example:

using DataFrames

n = 10
dfs =[]
for i = 1:n
    dfs = push!(dfs, DataFrame(name=["a","b","c","d","e"], c1=rand(5), c2=rand(5)))
end

Just so we are clear, do you want mean(df.c1) separately for each value of name in the data frame, or do you want a mean for each row of the variables :c1 and :c2?

Easiest way might be to make a vector of vectors out of the columns from all dfs you want to work on, run the aggregation on that. This way it’s at least type stable.

I think OP probably wants (using DataFramesMeta)

for i in 1:n
    @transform! dfs[n] begin 
        :c1_mean = mean(:c1)
        :c2_med = median(:c2)
    end
end

I’d like the mean for each row of c1 across all data frames. The result will be a single data frame with 5 rows.

the brute force way would look like this:

mean(hcat(dfs[1].c1, dfs[2].c1, dfs[3].c1, dfs[4].c1, dfs[5].c1, dfs[6].c1, dfs[7].c1, dfs[8].c1, dfs[9].c1, dfs[10].c1), dims = 2)

In this case, I would just try to get all my data into one data frame and then use grouped aggregations:

df_long = reduce(vcat, dfs)
# or if you want to keep track where each datum came from
df_long = reduce(vcat, transform(df, [] => (() -> i) => :id) for (i, df) in enumerate(dfs))

combine(groupby(df_long, :name),
        :c1 => mean,
        :c2 => median)
2 Likes

This is exactly what I was looking for!

The vcat(dfs...) method for DataFrames also has a special source keyword arg which does this for you.

3 Likes

This, although not very elegant, is quite efficient

m3=hvncat(3,Matrix.(dfs)...)[:,[2,3],:]
mc1=mean(m3[:,1:1,:],dims=3)[:]
mdc2=median(m3[:,2:2,:],dims=3)[:]
name='a':'e'
DataFrame((;name,mc1,mdc2))

PS
Furthermore there is the use of a little used function and with documentation that should be enriched with examples

Nice, thanks.