# Nested groupby

trying to understand the role of kwarg isgathered, I practiced this exercise.

``````julia> ds = Dataset(id = [1,1,1,1,1,2,2,2,3,3,3],
date = Date.(["2019-03-05", "2019-03-12", "2019-04-10",
"2019-04-29", "2019-05-10", "2019-03-20",
"2019-04-22", "2019-05-04", "2019-11-01",
"2019-11-10", "2019-12-12"]),
outcome = [false, false, false, true, false, false,
true, false, true, true, true])
11×3 Dataset
...

julia> gb=gatherby(ds, [1, 3], isgathered = true)
11×3 View of GatherBy Dataset, Gathered by: id ,outcome
id        date        outcome
identity  identity    identity
Int64?    Date?       Bool?
────────────────────────────────
1  2019-03-05     false
1  2019-03-12     false
1  2019-04-10     false
1  2019-04-29      true
1  2019-05-10     false
2  2019-03-20     false
2  2019-04-22      true
2  2019-05-04     false
3  2019-11-01      true
3  2019-11-10      true
3  2019-12-12      true

julia> combine(gb, (:) => last,   dropgroupcols = true)
7×3 Dataset
...

``````

I was wondering how many groups are generated by

`julia> gb = gatherby (ds, [1, 3], isgathered = true)`

because the output header does not report the information:

`11 × 3 View of GatherBy Dataset, Gathered by: id, outcome`

I saw that the info can be obtained in the following way
`last (gb.groups),` but I don’t think it’s a recommended procedure.
Later, having read that groupby and gatherby accept the output of other grouping functions as input, I tried (in all the variations that came to my mind ), unsuccessfully, to do something like this:

``````combine (gb, (:) => x-> groupby (x, 1)).
``````

so I tried to build the subgroups by hand and, after a few attempts, I found this “solution”

``````
cgb1 = combine (gb, (:) => x -> [x], (:) => byrow (x-> Dataset (; zip ([: a,: b,: c], x) ...) ), dropgroupcols = true)
``````

At this point the question arises why the following “reduced” expression does not work

``````
cgb1 = combine (gb, (:) => byrow (x-> Dataset (; zip ([: a,: b,: c], [x]) ...)), dropgroupcols = true)
``````

I found this way to get more directly to the form I was looking for.

`cgb1 = combine(gb, (1,2,3) => (x...)-> Dataset(; zip([:a,:b,:c], x) ...))`

I don’t know if there are others even more direct.

So at first thought, a method for combine which allows to pass the entire subgroup to a function for the subsequent transformations, would seem desirable to me.
eg:
`combine (gbds, x-> nrow (x))`
in general
`combine (gbds, sg-> fun (sg))`
But I don’t know if it has any unwanted side effects.

my take was like this `groupby` can group the output of another `groupby`, like `groupby(groupby(ds,:x1),:x2)`.

my understanding is that 1:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:11 are starts and ends of groups this means total of 7 groups.

From the package documentation I have seen that in some cases the information is exposed.
I was wondering why in other cases it is not and if it is not the case to standardize the outputs in this respect.
I believe that info is more important in the case of gatherby than in the case of groupby

``````julia> groupby!(ds, 1)
6×2 Grouped Dataset with 2 groups
Grouped by: g
Row │ g         x
│ identity  identity
│ Int64?    Float64?
─────┼────────────────────
1 │        1      12.0
2 │        1      11.0
3 │        1      15.0
4 │        2      12.3
5 │        2      13.0
6 │        2      13.2
``````
1 Like

Yes. I too understand this. I just wanted to understand the possibilities of the package and how it works.
A little by reading the documentation (not much though ) a little by doing tests.
I still don’t quite understand what is passed to functions in the part
cols => fun => newcols

in different situations.

i just look at `col=>fun` as a pipe in linux, “pipe col to fun”.

in many cases, I have been able to figure out how the parameters are passed

cases working
``````ds=Dataset(a=rand(6),a_lim=0.75,b=rand(6),b_lim=0.33,c=rand(6),c_lim=0.25)

modify(compare(ds[!, r"lim"], ds[!, Not(r"lim")], on = 1:3 .=> 1:3, eq = !isless), 1:3=>byrow(count))
modify(compare(ds[!, r"lim"], ds[!, Not(r"lim")], on = 1:3 .=> 1:3, eq = !isless), 1:3=>byrow(sum))

modify(modify(compare(ds[!, r"lim"], ds[!, Not(r"lim")], on = 1:3 .=> 1:3, eq = !isless),
1:3=>x->Int.(x)), 1:3=>byrow(x->sum(x)))

cds=compare(ds[!, r"lim"], ds[!, Not(r"lim")], on = 1:3 .=> 1:3, eq = !isless)
mcds=modify(cds,1:3=>x->Int.(x))
modify(mcds, 1:3=>byrow(sum))
modify(mcds, 1:3=>byrow(x->sum(x)))

mmcds=modify(mcds, 1:3=>byrow(x->x .*2) =>:v1,1:3=>byrow(x->x .* 3)=>:v2)[:,[1,2,4,5]]
names(mmcds)
rename!(mmcds,[:a,:b,:v1,:v2])
modify(mmcds,[:v1,:v2]=>byrow(x->sum(x))=>:sum_v1)
modify(mmcds,:v1=>byrow(x->sum(x))=>:sum_v1)
modify(mmcds,:v1=>byrow(x->mean(x))=>:mean_v1)

modify(mmcds,[:v1,:v2]=>byrow(x->mean(x))=>:mean_v1_v2)
modify(mmcds,[:a,:b]=>byrow(x->mean(x))=>:mean_a_b)

modify(mmcds,[:v1,:v2]=>byrow(x->sum(x)))
modify(mmcds,[:v1,:v2]=>byrow(sum))
modify(mmcds,[:v2]=>byrow(sum))
modify(mmcds,:v2=>byrow(sum))
modify(mmcds,:v2=>byrow(x->sum(x)))

modify(mmcds,[:a,:b]=>byrow(x->sum(x)))
modify(mmcds,[:a,:b]=>byrow(sum))
modify(mmcds,[:b]=>byrow(sum))

modify(mmcds,[:v1,:v2]=>byrow((x,y)->y .+ x))

modify(mmcds,[:v1,:v2]=>byrow((x,y)->x .+y))
modify(mmcds,[:a,:b]=>byrow((x,y)->x .+y))

modify(mmcds,(:v1,:v2)=>(x...)->(x))
modify(mmcds,(:a,:b)=>(x...)->(x))

modify(mmcds,(:v1,:v2)=>(x,y)-> string(x...,y...)[end-30:end])
modify(mmcds,(:a,:b)=>(x,y)-> string(x...," | ",y...))

modify(mmcds,(:v1,:v2)=>(x,y)-> (x,y))
modify(mmcds,(:a,:b)=>(x,y)-> (x,y))

modify(mmcds,(:v1,:v2)=>(x,y)-> x[1:2])
modify(mmcds,(:a,:b)=>(x,y)-> x[1:2])

modify(mmcds,(:v1,:v2)=>(x,y)->sum.(zip(x,y)))

modify(mmcds,(:a,:b)=>(x,y)->sum.(zip(x,y)))

byrow(mmcds,sum, [:a,:b])

``````

but in this I don’t understand why the form with byrow () doesn’t work

``````julia> mmcds=mmcds[:,[3,4]]
6×2 Dataset
Row │ v1         v2
│ identity   identity
│ Array…?    Array…?
─────┼──────────────────────
1 │ [0, 2, 0]  [0, 3, 0]
2 │ [2, 0, 0]  [3, 0, 0]
3 │ [2, 0, 2]  [3, 0, 3]
4 │ [2, 2, 0]  [3, 3, 0]
5 │ [2, 0, 0]  [3, 0, 0]
6 │ [2, 0, 0]  [3, 0, 0]

julia> modify(mmcds,(:v1,:v2)=>(x,y)->cor.(x,y))
6×3 Dataset
Row │ v1         v2         function_v1_v2
│ identity   identity   identity
│ Array…?    Array…?    Float64?
─────┼──────────────────────────────────────
1 │ [0, 2, 0]  [0, 3, 0]             1.0
2 │ [2, 0, 0]  [3, 0, 0]             1.0
3 │ [2, 0, 2]  [3, 0, 3]             1.0
4 │ [2, 2, 0]  [3, 3, 0]             1.0
5 │ [2, 0, 0]  [3, 0, 0]             1.0
6 │ [2, 0, 0]  [3, 0, 0]             1.0

julia> modify(mmcds,(:v1,:v2)=>byrow(cor))
ERROR: MethodError: no method matching normalize_modify!(::InMemoryDatasets.
``````
1 Like

use

``````modify(mmcds,[:v1,:v2]=>byrow(cor))
``````
1 Like