I have a vector collection of DataFrames that have the same number of rows and same column names and I would like to compute statistics across all data frames to arrive at a data frame that is the same size as a single data frame in the original collection. For instance how can I most efficiently calculate the mean for each row of c1 and the median for each row of c2 across all data frames given the below example:
using DataFrames
n = 10
dfs =[]
for i = 1:n
dfs = push!(dfs, DataFrame(name=["a","b","c","d","e"], c1=rand(5), c2=rand(5)))
end
Just so we are clear, do you want mean(df.c1) separately for each value of name in the data frame, or do you want a mean for each row of the variables :c1 and :c2?
Easiest way might be to make a vector of vectors out of the columns from all dfs you want to work on, run the aggregation on that. This way it’s at least type stable.
In this case, I would just try to get all my data into one data frame and then use grouped aggregations:
df_long = reduce(vcat, dfs)
# or if you want to keep track where each datum came from
df_long = reduce(vcat, transform(df, [] => (() -> i) => :id) for (i, df) in enumerate(dfs))
combine(groupby(df_long, :name),
:c1 => mean,
:c2 => median)