Mean of multiple DataFrames

Sleepy · February 10, 2021, 11:30am

Hi,

we are given two or more DataFrames

using DataFrames, Statistics

df1 = DataFrame(A = 1:5, B = 3:7)
df2 = DataFrame(A = 10:14, B = 3:7)

What is a fast way to calculate the mean of these frames?

In the example above, the result should be also a dataframe where the entries of first column equals (1+10)/2, (2+11)/2… and the second column (3+3)/2,(4+4)/2,…

Thank you!

nilshg · February 10, 2021, 12:12pm

A simple way:

julia> (df1 .+ df2) ./ 2
5×2 DataFrame
 Row │ A        B       
     │ Float64  Float64 
─────┼──────────────────
   1 │     5.5      3.0 
   2 │     6.5      4.0 
   3 │     7.5      5.0 
   4 │     8.5      6.0 
   5 │     9.5      7.0

Palli · February 10, 2021, 12:26pm

I do notice the names of the columns much match In Julia (seems sensible) and also length or you get:

ERROR: DimensionMismatch(“arrays could not be broadcast to a common size; got a dimension with lengths 5 and 4”)

so I checked in R, and it’s the same unless a multiple of the length:

A ← 1:2
B ← 1:3
A+B
[1] 2 4 4
Warning message:
In A + B : longer object length is not a multiple of shorter object length
B ← 1:4
A+B
[1] 2 4 4 6

I was maybe expecting NA or NaN for extra rows (is there an easy way?) but more importantly any idea about what’s the idea behind the repeating/multiple R behavior and if Julia’s DataFrames should support such (maybe optionally).

nilshg · February 10, 2021, 12:34pm

This is called recycling in R and imho is a terrible footgun - Julia is quite consistent in asking users to be explicit in their intentions rather than trying to guess and rely on DWIM, which I think is one of the strenghts of the language. It does mean more verbosity/less convenience in some situations, but I think it’s a tradeoff well worth making.

Incidentally I think this is an excellent example where the Julia behaviour makes life easier: the broadcasted dot makes it clear that addition happens elementwise, and it doesn’t make sense to do this for shapes that don’t match - rather than coming up with a “solution” to this “problem” for the user, DataFrames asks people to be explicit what they think should happen in these cases. Your missing suggestion would mean just pad out the smaller DataFrame with missing (where, btw - at the end? The start? Randomly in between?), but my guess is that in 9x% of all cases where this happens the fact that someone tries to add different-sized DataFrames is actually a bug in their code, and it’s helpful that an error is raised rather than a silent workaround performed in the background.

Sleepy · February 10, 2021, 12:58pm

Another question. Consider an array of DataFrames e.g. [df1,df2,df3,…,dfn].

How would you calculate the mean in this case?

nilshg · February 10, 2021, 1:01pm


julia> reduce(.+, [df1, df2, df3]) ./ 3
5×2 DataFrame
 Row │ A         B       
     │ Float64   Float64 
─────┼───────────────────
   1 │  8.66667      3.0
   2 │  9.66667      4.0
   3 │ 10.6667       5.0
   4 │ 11.6667       6.0
   5 │ 12.6667       7.0

or more generally:

reduce(.+, df_array) ./ length(df_array)

Sleepy · February 10, 2021, 1:14pm

Unfortunately, I got an error

using DataFrames, Statistics

df1 = DataFrame(A = 1:5, B = 3:7)
df2 = DataFrame(A = 10:14, B = 3:7)

reduce(.+, [df1, df2]) ./ 2

ERROR: LoadError: UndefVarError: .+ not defined

Using Julia Version 1.5.3 (2020-11-09)

nilshg · February 10, 2021, 2:15pm

Ah sorry, how quickly you forget… The parsing of .+ has actually changed in Julia 1.6 (I’m using the release candidate), see point 6 on the release notes here: https://github.com/JuliaLang/julia/blob/release-1.6/NEWS.md#new-language-features

To get something similar in 1.5 you can use an anonymous function:

julia> using DataFrames

julia> df_array = [DataFrame(A = rand(5), B = 1:5) for _ ∈ 1:10_000];

julia> reduce((x, y) -> x .+ y, df_array) ./ length(df_array)
5×2 DataFrame
 Row │ A         B       
     │ Float64   Float64 
─────┼───────────────────
   1 │ 0.50288       1.0
   2 │ 0.5007        2.0
   3 │ 0.49985       3.0
   4 │ 0.499544      4.0
   5 │ 0.502726      5.0

ConnectedSystems · April 17, 2022, 1:21am

Here’s an alternative which may be useful.
The advantage is that any stat function could be applied

using Statistics

# Create a DF to store results into
# There are more efficient ways of doing this but here I just copy one of the existing DFs
mean_df = copy(df1)  

# Calculate the mean
mean_df[:, :] = mean(cat(map(Matrix, [df1, df2, df3]), dims=3))

Explanation:

The call to map() converts each DataFrame into a matrix.

cat(..., dims=3) then concatenates these along the third dimension, creating an N-dimensional matrix

We can then simply apply the stat function (mean() in this case, but could be any function) to the result.

mean_df[:, :] = assigns the values in the target DataFrame (note the [:, :])

rafael.guerra · April 17, 2022, 7:58am

Nice.
An alternative to the 3rd dimension could be to use this tip to use a tuple shield to broadcast over the array of matrices:

using DataFrames, Statistics
dfn = [DataFrame(i*ones(4,3),:auto) for i in 1:4] 
m₀, = mean.((map(Matrix, dfn),))
σₘ, = std.((map(Matrix, dfn),))

bkamins · April 17, 2022, 9:12am

I am one year late to the party. Let me add that in Julia broadcasting has one special case when recycling is allowed. It is when one of the collections has length 1 in some dimension. Example:

julia> [1 2] .+ [11 12; 13 14]
2×2 Matrix{Int64}:
 12  14
 14  16

and the same applies to DataFrames:

julia> df1 = DataFrame([1 2], :auto)
1×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df2 = DataFrame([11 12; 13 14], :auto)
2×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │    11     12
   2 │    13     14

julia> df1 .+ df2
2×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │    12     14
   2 │    14     16