How to take the mean of entries across an array of DataFrames conditional upon the value of a separate column?

I have a question regarding the correct way to store an n dimensional matrix in julia - and how one would take the mean across an axis based on conditions of a different column.

Suppose V is a vector of DataFrames consisting of entries like the following.

V[1] = 
 Row │ Name    Time     Data  
     │ String  Day      Float64 
   1 │ A      1 day      1
   2 │ A      2 days     2
   3 │ A      3 days     3
   4 │ A      4 days     4

V[2] = 

 Row │ Name    Time     Data  
     │ String  Day      Float64 
   1 │ B      1 day      5
   2 │ B      2 days     6
   3 │ B      3 days     7
   4 │ B      5 days     8

V[3] =

 Row │ Name    Time     Data  
     │ String  Day      Float64 
   1 │ C      1 day       9
   2 │ C      3 days     10
   3 │ C      6 days     11
   4 │ C      7 days     12

  1. Is this a good/efficient way to store this kind of data? It seems a poor choice because the “Name” column seems redundant, but I was unsure about what would be the appropriate alternative.

  2. Is there an efficient way to take an average of the entries in the Data Column across the DataFrames based one the value of a separate column? So suppose I wanted the 2 day average for the Data Column across the DataFrames in V, i.e. 4. How could this be efficiently achieved?

I saw on a separate posting that if I wanted to just average the Data Columns I could do something like. V[1][ : , 3 ] + V[2][ : , 3 ] +V[3][ : , 3 ]/3 however that would not work here because of the different days involved. I wanted to average only the values where the Time column values match.

I think I could go through each DataFrame with a nested loop which but my understanding is that is not a great practice and not particularly efficient. It would probably look something like this

# 1 collect a list of all the unique days  
   uniquedays = [ ] 
   for i in 1:size(V)[1]
         a = unique(V[i].Time)
         uniquedays = vcat(uniquedays,a)
   uniquedays = unique(uniquedays)

# 2 loop through each DataFrame Checking for Data on each of the unique days 
# and storing the results in a DataFrame     
   DF = DataFrame(Time = Day[], Average = Float64[])
   count = 1 
   while count < = size(uniquedays)[1]
        data = [ ]
        for i in 1:size(V)[1]
# add the datapoint of the correct day to data as a Float64 
            push!(data, V[i][V[i].Time.==Day(uniquedays[count]),3][1]) 
        avg = mean(data) 
        push!( DF, [uniquedays[count], avg])
        count +=1 

I’m not sure this is the best or even a good approach. Is there was a more efficient/straightforward way to do this type of conditional averaging across DataFrames?

Why are you working with a vector of DataFrames? Why not just

vdf = reduce(vcat, V)

and then

combine(groupby(vdf, :Time), :Data => mean)



Thank you so much! well …I think the short answer is because I am stupid…but would you mind walking me through the second snippet of code where you use combine ? what is going on there and why does it work?

This is the traditional split-apply-combine approach to data analysis. In DataFrames it is implemented by its own “minilanguage”, which is thoroughly explained here:

The tl;dr for your use case here is groupby(vdf, :Time) creates a GroupedDataFrame object, in which the full vdf DataFrame is essentially split into sub-DataFrames based on the value of the :Time column. The second part :Data => mean then says "for each of these sub-DataFrames, take the column :Data (as a vector) and apply the function mean to it.


if you start from a 3D array you could do something like this.

V=[[repeat(["A"],4) string.(1:4,"d") [1:4...]];;;
[repeat(["B"],4) string.(2:5,"d") [5:8...]];;;
[repeat(["C"],4) string.(8:-2:2,"d") [9:12...]]]



mean(V[map(i->i+ci, ci2d)])



selectdim(V,2,3)[selectdim(V, 2,2).=="2d"]

1 Like

Thank you this was very helpful!