Mean of multiple DataFrames


we are given two or more DataFrames

using DataFrames, Statistics

df1 = DataFrame(A = 1:5, B = 3:7)
df2 = DataFrame(A = 10:14, B = 3:7)

What is a fast way to calculate the mean of these frames?

In the example above, the result should be also a dataframe where the entries of first column equals (1+10)/2, (2+11)/2… and the second column (3+3)/2,(4+4)/2,…

Thank you!

A simple way:

julia> (df1 .+ df2) ./ 2
5Γ—2 DataFrame
 Row β”‚ A        B       
     β”‚ Float64  Float64 
   1 β”‚     5.5      3.0 
   2 β”‚     6.5      4.0 
   3 β”‚     7.5      5.0 
   4 β”‚     8.5      6.0 
   5 β”‚     9.5      7.0

I do notice the names of the columns much match In Julia (seems sensible) and also length or you get:

ERROR: DimensionMismatch(β€œarrays could not be broadcast to a common size; got a dimension with lengths 5 and 4”)

so I checked in R, and it’s the same unless a multiple of the length:

A ← 1:2
B ← 1:3
[1] 2 4 4
Warning message:
In A + B : longer object length is not a multiple of shorter object length
B ← 1:4
[1] 2 4 4 6

I was maybe expecting NA or NaN for extra rows (is there an easy way?) but more importantly any idea about what’s the idea behind the repeating/multiple R behavior and if Julia’s DataFrames should support such (maybe optionally).

This is called recycling in R and imho is a terrible footgun - Julia is quite consistent in asking users to be explicit in their intentions rather than trying to guess and rely on DWIM, which I think is one of the strenghts of the language. It does mean more verbosity/less convenience in some situations, but I think it’s a tradeoff well worth making.

Incidentally I think this is an excellent example where the Julia behaviour makes life easier: the broadcasted dot makes it clear that addition happens elementwise, and it doesn’t make sense to do this for shapes that don’t match - rather than coming up with a β€œsolution” to this β€œproblem” for the user, DataFrames asks people to be explicit what they think should happen in these cases. Your missing suggestion would mean just pad out the smaller DataFrame with missing (where, btw - at the end? The start? Randomly in between?), but my guess is that in 9x% of all cases where this happens the fact that someone tries to add different-sized DataFrames is actually a bug in their code, and it’s helpful that an error is raised rather than a silent workaround performed in the background.


Another question. Consider an array of DataFrames e.g. [df1,df2,df3,…,dfn].

How would you calculate the mean in this case?

julia> reduce(.+, [df1, df2, df3]) ./ 3
5Γ—2 DataFrame
 Row β”‚ A         B       
     β”‚ Float64   Float64 
   1 β”‚  8.66667      3.0
   2 β”‚  9.66667      4.0
   3 β”‚ 10.6667       5.0
   4 β”‚ 11.6667       6.0
   5 β”‚ 12.6667       7.0

or more generally:

reduce(.+, df_array) ./ length(df_array)

Unfortunately, I got an error

using DataFrames, Statistics

df1 = DataFrame(A = 1:5, B = 3:7)
df2 = DataFrame(A = 10:14, B = 3:7)

reduce(.+, [df1, df2]) ./ 2

ERROR: LoadError: UndefVarError: .+ not defined

Using Julia Version 1.5.3 (2020-11-09)

Ah sorry, how quickly you forget… :slight_smile: The parsing of .+ has actually changed in Julia 1.6 (I’m using the release candidate), see point 6 on the release notes here: julia/ at release-1.6 Β· JuliaLang/julia Β· GitHub

To get something similar in 1.5 you can use an anonymous function:

julia> using DataFrames

julia> df_array = [DataFrame(A = rand(5), B = 1:5) for _ ∈ 1:10_000];

julia> reduce((x, y) -> x .+ y, df_array) ./ length(df_array)
5Γ—2 DataFrame
 Row β”‚ A         B       
     β”‚ Float64   Float64 
   1 β”‚ 0.50288       1.0
   2 β”‚ 0.5007        2.0
   3 β”‚ 0.49985       3.0
   4 β”‚ 0.499544      4.0
   5 β”‚ 0.502726      5.0

Here’s an alternative which may be useful.
The advantage is that any stat function could be applied

using Statistics

# Create a DF to store results into
# There are more efficient ways of doing this but here I just copy one of the existing DFs
mean_df = copy(df1)  

# Calculate the mean
mean_df[:, :] = mean(cat(map(Matrix, [df1, df2, df3]), dims=3))


The call to map() converts each DataFrame into a matrix.

cat(..., dims=3) then concatenates these along the third dimension, creating an N-dimensional matrix

We can then simply apply the stat function (mean() in this case, but could be any function) to the result.

mean_df[:, :] = assigns the values in the target DataFrame (note the [:, :])

1 Like

An alternative to the 3rd dimension could be to use this tip to use a tuple shield to broadcast over the array of matrices:

using DataFrames, Statistics
dfn = [DataFrame(i*ones(4,3),:auto) for i in 1:4] 
mβ‚€, = mean.((map(Matrix, dfn),))
Οƒβ‚˜, = std.((map(Matrix, dfn),))
1 Like

I am one year late to the party. Let me add that in Julia broadcasting has one special case when recycling is allowed. It is when one of the collections has length 1 in some dimension. Example:

julia> [1 2] .+ [11 12; 13 14]
2Γ—2 Matrix{Int64}:
 12  14
 14  16

and the same applies to DataFrames:

julia> df1 = DataFrame([1 2], :auto)
1Γ—2 DataFrame
 Row β”‚ x1     x2
     β”‚ Int64  Int64
   1 β”‚     1      2

julia> df2 = DataFrame([11 12; 13 14], :auto)
2Γ—2 DataFrame
 Row β”‚ x1     x2
     β”‚ Int64  Int64
   1 β”‚    11     12
   2 β”‚    13     14

julia> df1 .+ df2
2Γ—2 DataFrame
 Row β”‚ x1     x2
     β”‚ Int64  Int64
   1 β”‚    12     14
   2 β”‚    14     16
1 Like