using DataFrames, Statistics
df1 = DataFrame(A = 1:5, B = 3:7)
df2 = DataFrame(A = 10:14, B = 3:7)
What is a fast way to calculate the mean of these frames?
In the example above, the result should be also a dataframe where the entries of first column equals (1+10)/2, (2+11)/2β¦ and the second column (3+3)/2,(4+4)/2,β¦
I do notice the names of the columns much match In Julia (seems sensible) and also length or you get:
ERROR: DimensionMismatch(βarrays could not be broadcast to a common size; got a dimension with lengths 5 and 4β)
so I checked in R, and itβs the same unless a multiple of the length:
A β 1:2
B β 1:3
A+B
[1] 2 4 4
Warning message:
In A + B : longer object length is not a multiple of shorter object length
B β 1:4
A+B
[1] 2 4 4 6
I was maybe expecting NA or NaN for extra rows (is there an easy way?) but more importantly any idea about whatβs the idea behind the repeating/multiple R behavior and if Juliaβs DataFrames should support such (maybe optionally).
This is called recycling in R and imho is a terrible footgun - Julia is quite consistent in asking users to be explicit in their intentions rather than trying to guess and rely on DWIM, which I think is one of the strenghts of the language. It does mean more verbosity/less convenience in some situations, but I think itβs a tradeoff well worth making.
Incidentally I think this is an excellent example where the Julia behaviour makes life easier: the broadcasted dot makes it clear that addition happens elementwise, and it doesnβt make sense to do this for shapes that donβt match - rather than coming up with a βsolutionβ to this βproblemβ for the user, DataFrames asks people to be explicit what they think should happen in these cases. Your missing suggestion would mean just pad out the smaller DataFrame with missing (where, btw - at the end? The start? Randomly in between?), but my guess is that in 9x% of all cases where this happens the fact that someone tries to add different-sized DataFrames is actually a bug in their code, and itβs helpful that an error is raised rather than a silent workaround performed in the background.
Hereβs an alternative which may be useful.
The advantage is that any stat function could be applied
using Statistics
# Create a DF to store results into
# There are more efficient ways of doing this but here I just copy one of the existing DFs
mean_df = copy(df1)
# Calculate the mean
mean_df[:, :] = mean(cat(map(Matrix, [df1, df2, df3]), dims=3))
Explanation:
The call to map() converts each DataFrame into a matrix.
cat(..., dims=3) then concatenates these along the third dimension, creating an N-dimensional matrix
We can then simply apply the stat function (mean() in this case, but could be any function) to the result.
mean_df[:, :] = assigns the values in the target DataFrame (note the [:, :])
I am one year late to the party. Let me add that in Julia broadcasting has one special case when recycling is allowed. It is when one of the collections has length 1 in some dimension. Example: