Groupby, aggregate with unstack on multiple columns

Clement_POIRET · October 11, 2020, 1:37pm

Hi everyone.

I’m working a little project, and I have to translate python code to julia.
For a toy example, here is my dataset: https://raw.githubusercontent.com/clementpoiret/Pingouin.jl/master/datasets/rm_anova2.csv

Sample:

6×4 DataFrame
│ Row │ Subject │ Time   │ Metric  │ Performance │
│     │ Int64   │ String │ String  │ Int64       │
├─────┼─────────┼────────┼─────────┼─────────────┤
│ 1   │ 1       │ Pre    │ Product │ 13          │
│ 2   │ 2       │ Pre    │ Product │ 12          │
│ 3   │ 3       │ Pre    │ Product │ 17          │
│ 4   │ 4       │ Pre    │ Product │ 12          │
│ 5   │ 5       │ Pre    │ Product │ 19          │
│ 6   │ 6       │ Pre    │ Product │ 6           │

I want to groupby one or more column, then aggregate on :Performance (mean value), such as the end result if I group only by :Metric is (python output here):

Metric   Action  Client  Product
Subject                         
1          25.5    21.0     15.5
2          24.0    18.5      9.0
3          28.0    25.0     19.0
4          32.5    32.0     15.0
5          23.0    27.5     18.5
6          14.5    15.0      6.0
7          34.0    27.0     20.5
8          38.0    32.5     20.0
9          28.0    34.0     20.5
10         29.0    18.5     21.0

And if I have two factors (:Time, :Metric), I unstack by :Metric, groupby :Time and diff the mean value of :Performance, such as:

Metric   Action  Client  Product
Subject                         
1         -17.0   -18.0     -5.0
2         -12.0     1.0      6.0
3          -8.0   -12.0     -4.0
4         -15.0   -14.0     -6.0
5          -8.0    -1.0      1.0
6         -17.0    -6.0      0.0
7          -8.0   -18.0     -7.0
8          -4.0    -7.0     -4.0
9          -8.0    -8.0      5.0
10        -10.0   -13.0     -6.0

I’m quite new to Julia, and I can’t find a solution. The only solutions I get are grouping by values in only one column…
Do you have any clue?

Thank you,
Clément

pdeffebach · October 11, 2020, 1:42pm

Check out this comparison with pandas on the latest version of the DataFrames documentation.

You can group on multiple columns with groupby(df, [:Time, :Metric])

Clement_POIRET · October 11, 2020, 3:01pm

With data being the dataframe of the OP, it sounds like this produces the correct output.

dv = :Performance
subject = :Subject
within = [:Time, :Metric]

grp = combine(groupby(data, [subject, within...], skipmissing=true), dv => mean => dv)
if length(within) == 1
    grp = unstack(grp, subject, within..., dv)
    data = grp[:, Not(subject)]
elseif length(within) == 2
    grp = groupby(grp, within[1])
    grp = [g[:, Not(within[1])] for g in grp]
    grp = unstack.(grp, subject, within[2], dv)
    grp = [g[:, Not(subject)] for g in grp]

    data = grp[1]
    for i in 2:length(grp)
        data .-= grp[i]
    end
end

To my eyes, this is not really an optimal solution, do you see any ways to improve the code?

Thanks

pdeffebach · October 11, 2020, 3:47pm

I do have a solution! I think it’s easiest if you unstack at the very beginning. One of the benefits of DataFrames is that you don’t have to always work in the “tidy data” paradigm because its very easy to work with many columns.

julia> data_wide = unstack(data, "Metric", "Performance");

julia> combine(groupby(data_wide, [:Subject]), [:Action, :Client, :Product] .=> diff)
10×4 DataFrame
│ Row │ Subject │ Action_diff │ Client_diff │ Product_diff │
│     │ Int64   │ Int64       │ Int64       │ Int64        │
├─────┼─────────┼─────────────┼─────────────┼──────────────┤
│ 1   │ 1       │ -17         │ -18         │ -5           │
│ 2   │ 2       │ -12         │ 1           │ 6            │
│ 3   │ 3       │ -8          │ -12         │ -4           │
│ 4   │ 4       │ -15         │ -14         │ -6           │
│ 5   │ 5       │ -8          │ -1          │ 1            │
│ 6   │ 6       │ -17         │ -6          │ 0            │
│ 7   │ 7       │ -8          │ -18         │ -7           │
│ 8   │ 8       │ -4          │ -7          │ -4           │
│ 9   │ 9       │ -8          │ -8          │ 5            │
│ 10  │ 10      │ -10         │ -13         │ -6           │

I’m not 100% sure what you want in the length(within) == 1 case, but hopefully the technique is similar.

Clement_POIRET · October 12, 2020, 9:12am

Much simpler code! Thanks mate

Topic		Replies	Views
Combining a col from each DF group into a single DF New to Julia question , dataframes	5	296	August 25, 2022
Dataframes: Split combined result to different columns General Usage dataframes	3	310	December 13, 2021
Translation groupby and agg and join python to julia General Usage	4	934	April 8, 2021
Comparing DataFrames native API and Query Data	4	1525	September 1, 2017
JuliaDB groupby many columns General Usage juliadb	0	361	April 9, 2021

Groupby, aggregate with unstack on multiple columns

Related topics