Calculation question

I am trying to make a graph of over 200 emergence curves from a dataset that is structured below.

 Row │ Farm    Variety  Container  ED       Wasp  
     │ String  String   String     Float64  Int16 
─────┼────────────────────────────────────────────
   1 │ Talsma  Aurora   A1-1       15.375       1
   2 │ Talsma  Aurora   A1-1       15.5417      1
   3 │ Talsma  Aurora   A1-1       15.5417      1
   4 │ Talsma  Aurora   A1-1       16.5417      2
  ⋮  │   ⋮        ⋮         ⋮         ⋮       ⋮
 275 │ Talsma  Aurora   A9-5       16.5417      1
 276 │ Talsma  Aurora   A9-5       17.5417      1
 277 │ Talsma  Aurora   A9-5       17.5417      2
 278 │ Talsma  Aurora   A9-5       18.5417      1

ED is the elapsed days from setting them out.
So what I need to do is calculate the percentage of the total wasps (small non-stinging) from each Container that has emerged at the time points

I have imported the data into a DataFrame and have used

varDaySum=combine(varmean,:VarietyTotal=>sum)

Which results in a DataFrame that looks like this

48×2 DataFrame
 Row │ Container  Wasp_sum 
     │ String     Int64    
─────┼─────────────────────
   1 │ A1-1             55
   2 │ A1-2             21
   3 │ A1-3              8
   4 │ A1-4              4
  ⋮  │     ⋮         ⋮
  45 │ A9-2             11
  46 │ A9-3              1
  47 │ A9-4             11
  48 │ A9-5              5

Where varmean is a grouped DataFrame created from

varmean=groupby(dataset,:ED)

What I cannot figure out is how to calculate the cumlative percentages at each time point.

For example for the Container A1-1

Row │ ED       Wasp   Container 
     │ Float64  Int16  String    
─────┼───────────────────────────
   1 │ 15.375       1  A1-1
   2 │ 15.5417      1  A1-1
   3 │ 15.5417      1  A1-1
   4 │ 16.5417      2  A1-1
   5 │ 16.5417     38  A1-1
   6 │ 16.875       3  A1-1
   7 │ 16.875       3  A1-1
   8 │ 17.5417      2  A1-1
   9 │ 17.7083      1  A1-1
  10 │ 18.375       1  A1-1
  11 │ 19.5417      1  A1-1
  12 │ 21.875       1  A1-1

The series would be as follows

Time point        Cumlative count             Percent 
15.375                        1                            1/55=0.0181818
15.5417                       3                            3/55=0.0545454
16.5417                      43                           44/55=0.80000

etc

There must be an elegant coding solution and not a long series of for next loops building a DataFrame one column at time.

Mike Sergeant

You’re basically doing this:

julia> using DataFrames

julia> test = DataFrame(ED = [15.3, 15.5, 15.5, 16.5, 16.5], Wasp = [1, 1, 1, 2, 38])
5×2 DataFrame
 Row │ ED       Wasp  
     │ Float64  Int64 
─────┼────────────────
   1 │    15.3      1
   2 │    15.5      1
   3 │    15.5      1
   4 │    16.5      2
   5 │    16.5     38

julia> test_grouped = combine(groupby(test, :ED), :Wasp => sum)
3×2 DataFrame
 Row │ ED       Wasp_sum 
     │ Float64  Int64    
─────┼───────────────────
   1 │    15.3         1
   2 │    15.5         2
   3 │    16.5        40

julia> test_grouped[!, :cumulative] = cumsum(test_grouped.Wasp_sum);

julia> test_grouped[!, :Percent] = test_grouped.Wasp_sum ./ maximum(test_grouped.Wasp_sum);

julia> test_grouped
3×4 DataFrame
 Row │ ED       Wasp_sum  cumulative  Percent 
     │ Float64  Int64     Int64       Float64 
─────┼────────────────────────────────────────
   1 │    15.3         1           1    0.025
   2 │    15.5         2           3    0.05
   3 │    16.5        40          43    1.0

And then just apply this across Container (ie apply these steps to the data grouped by Container)

There’s definitely also some sort of @chain-type solution where you write out all these steps in one macro call without the intermediate test_grouped DataFrame - I generally prefer the simplicity of base DataFrames but I’m sure someone will be around soon to show the macro solution.

1 Like

like this ?

(for some values of “elegant”)

julia> d = DataFrame([[1,1,2,2,5], [1,2,3,4,5]], ["a", "b"])
5×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     2      3
   4 │     2      4
   5 │     5      5

julia> tw = sum(d[!, :b])
15

julia> a1 = combine(groupby(d, :a), :a=>sum, :b=>cumsum)
5×3 DataFrame
 Row │ a      a_sum  b_cumsum
     │ Int64  Int64  Int64
─────┼────────────────────────
   1 │     1      2         1
   2 │     1      2         3
   3 │     2      4         3
   4 │     2      4         7
   5 │     5      5         5

julia> a2 = combine(groupby(a1, :a), [:a_sum=>maximum, :b_cumsum=>maximum])
3×3 DataFrame
 Row │ a      a_sum_maximum  b_cumsum_maximum
     │ Int64  Int64          Int64
─────┼────────────────────────────────────────
   1 │     1              2                 3
   2 │     2              4                 7
   3 │     5              5                 5

julia> a3 = combine(a2, [:a=>(a)->a, :b_cumsum_maximum=>cumsum, :b_cumsum_maximum=>(b)->cumsum(b)/tw])
3×3 DataFrame
 Row │ a_function  b_cumsum_maximum_cumsum  b_cumsum_maximum_function
     │ Int64       Int64                    Float64
─────┼────────────────────────────────────────────────────────────────
   1 │          1                        3                   0.2
   2 │          2                       10                   0.666667
   3 │          5                       15                   1.0

or by using Pipe

 @pipe combine(groupby(d, :a), :a=>sum, :b=>cumsum) |>
       combine(groupby(_, :a), [:a_sum=>maximum, :b_cumsum=>maximum]) |>
       combine(_, [:a=>(a)->a, :b_cumsum_maximum=>cumsum, :b_cumsum_maximum=>(b)->cumsum(b)/tw])

Thank you. I think I basically had this but kept getting lost in the logic. Need to go through the coding to see where I missed.

Mike

1 Like

It took me a few gos and I do this stuff all day :slight_smile: