How to take the mean of entries across an array of DataFrames conditional upon the value of a separate column?

phantom · May 11, 2022, 8:50am

I have a question regarding the correct way to store an n dimensional matrix in julia - and how one would take the mean across an axis based on conditions of a different column.

Suppose V is a vector of DataFrames consisting of entries like the following.

V[1] = 
 
 Row │ Name    Time     Data  
     │ String  Day      Float64 
─────┼─────────────────────────────────────
   1 │ A      1 day      1
   2 │ A      2 days     2
   3 │ A      3 days     3
   4 │ A      4 days     4

V[2] = 

 Row │ Name    Time     Data  
     │ String  Day      Float64 
─────┼─────────────────────────────────────
   1 │ B      1 day      5
   2 │ B      2 days     6
   3 │ B      3 days     7
   4 │ B      5 days     8

V[3] =

 Row │ Name    Time     Data  
     │ String  Day      Float64 
─────┼─────────────────────────────────────
   1 │ C      1 day       9
   2 │ C      3 days     10
   3 │ C      6 days     11
   4 │ C      7 days     12

Is this a good/efficient way to store this kind of data? It seems a poor choice because the “Name” column seems redundant, but I was unsure about what would be the appropriate alternative.
Is there an efficient way to take an average of the entries in the Data Column across the DataFrames based one the value of a separate column? So suppose I wanted the 2 day average for the Data Column across the DataFrames in V, i.e. 4. How could this be efficiently achieved?

I saw on a separate posting that if I wanted to just average the Data Columns I could do something like. V[1][ : , 3 ] + V[2][ : , 3 ] +V[3][ : , 3 ]/3 however that would not work here because of the different days involved. I wanted to average only the values where the Time column values match.

I think I could go through each DataFrame with a nested loop which but my understanding is that is not a great practice and not particularly efficient. It would probably look something like this

# 1 collect a list of all the unique days  
    
   uniquedays = [ ] 
   for i in 1:size(V)[1]
         a = unique(V[i].Time)
         uniquedays = vcat(uniquedays,a)
   end 
   uniquedays = unique(uniquedays)

# 2 loop through each DataFrame Checking for Data on each of the unique days 
# and storing the results in a DataFrame     
   DF = DataFrame(Time = Day[], Average = Float64[])
   count = 1 
   while count < = size(uniquedays)[1]
        data = [ ]
        for i in 1:size(V)[1]
# add the datapoint of the correct day to data as a Float64 
            push!(data, V[i][V[i].Time.==Day(uniquedays[count]),3][1]) 
        end 
        avg = mean(data) 
        push!( DF, [uniquedays[count], avg])
        count +=1 
   end

I’m not sure this is the best or even a good approach. Is there was a more efficient/straightforward way to do this type of conditional averaging across DataFrames?

nilshg · May 11, 2022, 9:12am

Why are you working with a vector of DataFrames? Why not just

vdf = reduce(vcat, V)

and then

combine(groupby(vdf, :Time), :Data => mean)

?

phantom · May 11, 2022, 9:29am

Thank you so much! well …I think the short answer is because I am stupid…but would you mind walking me through the second snippet of code where you use combine ? what is going on there and why does it work?

nilshg · May 11, 2022, 9:47am

This is the traditional split-apply-combine approach to data analysis. In DataFrames it is implemented by its own “minilanguage”, which is thoroughly explained here:

The tl;dr for your use case here is groupby(vdf, :Time) creates a GroupedDataFrame object, in which the full vdf DataFrame is essentially split into sub-DataFrames based on the value of the :Time column. The second part :Data => mean then says "for each of these sub-DataFrames, take the column :Data (as a vector) and apply the function mean to it.

rocco_sprmnt21 · May 11, 2022, 5:07pm

if you start from a 3D array you could do something like this.

V=[[repeat(["A"],4) string.(1:4,"d") [1:4...]];;;
[repeat(["B"],4) string.(2:5,"d") [5:8...]];;;
[repeat(["C"],4) string.(8:-2:2,"d") [9:12...]]]


ci2d=findall(==("2d"),V)

ci=CartesianIndex(0,1,0)

mean(V[map(i->i+ci, ci2d)])

or

V[:,3,:][V[:,2,:].=="2d"]


selectdim(V,2,3)[selectdim(V, 2,2).=="2d"]

phantom · May 11, 2022, 10:56pm

Thank you this was very helpful!

Topic		Replies	Views
How to average column values in a dataframe based on multiple other matching columns? General Usage question	2	1048	February 22, 2023
Efficient computation of statistics across multiple data frames Data dataframes	9	351	January 31, 2024
Collapsing data into the level of certain variables and taking averages New to Julia dataframes	1	717	March 6, 2021
Easier way to split-apply-combine in DataFrames.jl General Usage dataframes	5	1111	December 14, 2020
With DataFrames, best practice for applying function across columns, where we also need to reference, in a second argument, the same column for each function call? General Usage dataframes	11	257	April 9, 2025

How to take the mean of entries across an array of DataFrames conditional upon the value of a separate column?

Related topics