Groupby, aggregate with unstack on multiple columns

Hi everyone.

I’m working a little project, and I have to translate python code to julia.
For a toy example, here is my dataset: https://raw.githubusercontent.com/clementpoiret/Pingouin.jl/master/datasets/rm_anova2.csv

Sample:

6×4 DataFrame
│ Row │ Subject │ Time   │ Metric  │ Performance │
│     │ Int64   │ String │ String  │ Int64       │
├─────┼─────────┼────────┼─────────┼─────────────┤
│ 1   │ 1       │ Pre    │ Product │ 13          │
│ 2   │ 2       │ Pre    │ Product │ 12          │
│ 3   │ 3       │ Pre    │ Product │ 17          │
│ 4   │ 4       │ Pre    │ Product │ 12          │
│ 5   │ 5       │ Pre    │ Product │ 19          │
│ 6   │ 6       │ Pre    │ Product │ 6           │

I want to groupby one or more column, then aggregate on :Performance (mean value), such as the end result if I group only by :Metric is (python output here):

Metric   Action  Client  Product
Subject                         
1          25.5    21.0     15.5
2          24.0    18.5      9.0
3          28.0    25.0     19.0
4          32.5    32.0     15.0
5          23.0    27.5     18.5
6          14.5    15.0      6.0
7          34.0    27.0     20.5
8          38.0    32.5     20.0
9          28.0    34.0     20.5
10         29.0    18.5     21.0

And if I have two factors (:Time, :Metric), I unstack by :Metric, groupby :Time and diff the mean value of :Performance, such as:

Metric   Action  Client  Product
Subject                         
1         -17.0   -18.0     -5.0
2         -12.0     1.0      6.0
3          -8.0   -12.0     -4.0
4         -15.0   -14.0     -6.0
5          -8.0    -1.0      1.0
6         -17.0    -6.0      0.0
7          -8.0   -18.0     -7.0
8          -4.0    -7.0     -4.0
9          -8.0    -8.0      5.0
10        -10.0   -13.0     -6.0

I’m quite new to Julia, and I can’t find a solution. The only solutions I get are grouping by values in only one column…
Do you have any clue?

Thank you,
Clément

1 Like

Check out this comparison with pandas on the latest version of the DataFrames documentation.

You can group on multiple columns with groupby(df, [:Time, :Metric])

4 Likes

With data being the dataframe of the OP, it sounds like this produces the correct output.

dv = :Performance
subject = :Subject
within = [:Time, :Metric]

grp = combine(groupby(data, [subject, within...], skipmissing=true), dv => mean => dv)
if length(within) == 1
    grp = unstack(grp, subject, within..., dv)
    data = grp[:, Not(subject)]
elseif length(within) == 2
    grp = groupby(grp, within[1])
    grp = [g[:, Not(within[1])] for g in grp]
    grp = unstack.(grp, subject, within[2], dv)
    grp = [g[:, Not(subject)] for g in grp]

    data = grp[1]
    for i in 2:length(grp)
        data .-= grp[i]
    end
end

To my eyes, this is not really an optimal solution, do you see any ways to improve the code?

Thanks :slight_smile:

I do have a solution! I think it’s easiest if you unstack at the very beginning. One of the benefits of DataFrames is that you don’t have to always work in the “tidy data” paradigm because its very easy to work with many columns.

julia> data_wide = unstack(data, "Metric", "Performance");

julia> combine(groupby(data_wide, [:Subject]), [:Action, :Client, :Product] .=> diff)
10×4 DataFrame
│ Row │ Subject │ Action_diff │ Client_diff │ Product_diff │
│     │ Int64   │ Int64       │ Int64       │ Int64        │
├─────┼─────────┼─────────────┼─────────────┼──────────────┤
│ 1   │ 1       │ -17         │ -18         │ -5           │
│ 2   │ 2       │ -12         │ 1           │ 6            │
│ 3   │ 3       │ -8          │ -12         │ -4           │
│ 4   │ 4       │ -15         │ -14         │ -6           │
│ 5   │ 5       │ -8          │ -1          │ 1            │
│ 6   │ 6       │ -17         │ -6          │ 0            │
│ 7   │ 7       │ -8          │ -18         │ -7           │
│ 8   │ 8       │ -4          │ -7          │ -4           │
│ 9   │ 9       │ -8          │ -8          │ 5            │
│ 10  │ 10      │ -10         │ -13         │ -6           │

I’m not 100% sure what you want in the length(within) == 1 case, but hopefully the technique is similar.

4 Likes

Much simpler code! Thanks mate :smiley: