 # Calculation question

I am trying to make a graph of over 200 emergence curves from a dataset that is structured below.

`````` Row │ Farm    Variety  Container  ED       Wasp
│ String  String   String     Float64  Int16
─────┼────────────────────────────────────────────
1 │ Talsma  Aurora   A1-1       15.375       1
2 │ Talsma  Aurora   A1-1       15.5417      1
3 │ Talsma  Aurora   A1-1       15.5417      1
4 │ Talsma  Aurora   A1-1       16.5417      2
⋮  │   ⋮        ⋮         ⋮         ⋮       ⋮
275 │ Talsma  Aurora   A9-5       16.5417      1
276 │ Talsma  Aurora   A9-5       17.5417      1
277 │ Talsma  Aurora   A9-5       17.5417      2
278 │ Talsma  Aurora   A9-5       18.5417      1
``````

ED is the elapsed days from setting them out.
So what I need to do is calculate the percentage of the total wasps (small non-stinging) from each Container that has emerged at the time points

I have imported the data into a DataFrame and have used

``````varDaySum=combine(varmean,:VarietyTotal=>sum)
``````

Which results in a DataFrame that looks like this

``````48×2 DataFrame
Row │ Container  Wasp_sum
│ String     Int64
─────┼─────────────────────
1 │ A1-1             55
2 │ A1-2             21
3 │ A1-3              8
4 │ A1-4              4
⋮  │     ⋮         ⋮
45 │ A9-2             11
46 │ A9-3              1
47 │ A9-4             11
48 │ A9-5              5
``````

Where varmean is a grouped DataFrame created from

``````varmean=groupby(dataset,:ED)
``````

What I cannot figure out is how to calculate the cumlative percentages at each time point.

For example for the Container A1-1

``````Row │ ED       Wasp   Container
│ Float64  Int16  String
─────┼───────────────────────────
1 │ 15.375       1  A1-1
2 │ 15.5417      1  A1-1
3 │ 15.5417      1  A1-1
4 │ 16.5417      2  A1-1
5 │ 16.5417     38  A1-1
6 │ 16.875       3  A1-1
7 │ 16.875       3  A1-1
8 │ 17.5417      2  A1-1
9 │ 17.7083      1  A1-1
10 │ 18.375       1  A1-1
11 │ 19.5417      1  A1-1
12 │ 21.875       1  A1-1
``````

The series would be as follows

``````Time point        Cumlative count             Percent
15.375                        1                            1/55=0.0181818
15.5417                       3                            3/55=0.0545454
16.5417                      43                           44/55=0.80000
``````

etc

There must be an elegant coding solution and not a long series of for next loops building a DataFrame one column at time.

Mike Sergeant

You’re basically doing this:

``````julia> using DataFrames

julia> test = DataFrame(ED = [15.3, 15.5, 15.5, 16.5, 16.5], Wasp = [1, 1, 1, 2, 38])
5×2 DataFrame
Row │ ED       Wasp
│ Float64  Int64
─────┼────────────────
1 │    15.3      1
2 │    15.5      1
3 │    15.5      1
4 │    16.5      2
5 │    16.5     38

julia> test_grouped = combine(groupby(test, :ED), :Wasp => sum)
3×2 DataFrame
Row │ ED       Wasp_sum
│ Float64  Int64
─────┼───────────────────
1 │    15.3         1
2 │    15.5         2
3 │    16.5        40

julia> test_grouped[!, :cumulative] = cumsum(test_grouped.Wasp_sum);

julia> test_grouped[!, :Percent] = test_grouped.Wasp_sum ./ maximum(test_grouped.Wasp_sum);

julia> test_grouped
3×4 DataFrame
Row │ ED       Wasp_sum  cumulative  Percent
│ Float64  Int64     Int64       Float64
─────┼────────────────────────────────────────
1 │    15.3         1           1    0.025
2 │    15.5         2           3    0.05
3 │    16.5        40          43    1.0
``````

And then just apply this across `Container` (ie apply these steps to the data grouped by `Container`)

There’s definitely also some sort of `@chain`-type solution where you write out all these steps in one macro call without the intermediate `test_grouped` DataFrame - I generally prefer the simplicity of base DataFrames but I’m sure someone will be around soon to show the macro solution.

1 Like

like this ?

(for some values of “elegant”)

``````julia> d = DataFrame([[1,1,2,2,5], [1,2,3,4,5]], ["a", "b"])
5×2 DataFrame
Row │ a      b
│ Int64  Int64
─────┼──────────────
1 │     1      1
2 │     1      2
3 │     2      3
4 │     2      4
5 │     5      5

julia> tw = sum(d[!, :b])
15

julia> a1 = combine(groupby(d, :a), :a=>sum, :b=>cumsum)
5×3 DataFrame
Row │ a      a_sum  b_cumsum
│ Int64  Int64  Int64
─────┼────────────────────────
1 │     1      2         1
2 │     1      2         3
3 │     2      4         3
4 │     2      4         7
5 │     5      5         5

julia> a2 = combine(groupby(a1, :a), [:a_sum=>maximum, :b_cumsum=>maximum])
3×3 DataFrame
Row │ a      a_sum_maximum  b_cumsum_maximum
│ Int64  Int64          Int64
─────┼────────────────────────────────────────
1 │     1              2                 3
2 │     2              4                 7
3 │     5              5                 5

julia> a3 = combine(a2, [:a=>(a)->a, :b_cumsum_maximum=>cumsum, :b_cumsum_maximum=>(b)->cumsum(b)/tw])
3×3 DataFrame
Row │ a_function  b_cumsum_maximum_cumsum  b_cumsum_maximum_function
│ Int64       Int64                    Float64
─────┼────────────────────────────────────────────────────────────────
1 │          1                        3                   0.2
2 │          2                       10                   0.666667
3 │          5                       15                   1.0

``````

or by using Pipe

`````` @pipe combine(groupby(d, :a), :a=>sum, :b=>cumsum) |>
combine(groupby(_, :a), [:a_sum=>maximum, :b_cumsum=>maximum]) |>
combine(_, [:a=>(a)->a, :b_cumsum_maximum=>cumsum, :b_cumsum_maximum=>(b)->cumsum(b)/tw])
``````

Thank you. I think I basically had this but kept getting lost in the logic. Need to go through the coding to see where I missed.

Mike

1 Like

It took me a few gos and I do this stuff all day 