Nested groupby

trying to understand the role of kwarg isgathered, I practiced this exercise.

julia> ds = Dataset(id = [1,1,1,1,1,2,2,2,3,3,3],
                    date = Date.(["2019-03-05", "2019-03-12", "2019-04-10",
                            "2019-04-29", "2019-05-10", "2019-03-20",
                            "2019-04-22", "2019-05-04", "2019-11-01",
                            "2019-11-10", "2019-12-12"]),
                    outcome = [false, false, false, true, false, false,
                               true, false, true, true, true])
11Γ—3 Dataset
...

julia> gb=gatherby(ds, [1, 3], isgathered = true)  
11Γ—3 View of GatherBy Dataset, Gathered by: id ,outcome        
 id        date        outcome  
 identity  identity    identity
 Int64?    Date?       Bool?
────────────────────────────────
        1  2019-03-05     false
        1  2019-03-12     false
        1  2019-04-10     false
        1  2019-04-29      true
        1  2019-05-10     false
        2  2019-03-20     false
        2  2019-04-22      true
        2  2019-05-04     false
        3  2019-11-01      true
        3  2019-11-10      true
        3  2019-12-12      true

julia> combine(gb, (:) => last,   dropgroupcols = true)
7Γ—3 Dataset
...

I was wondering how many groups are generated by

julia> gb = gatherby (ds, [1, 3], isgathered = true)

because the output header does not report the information:

11 Γ— 3 View of GatherBy Dataset, Gathered by: id, outcome

I saw that the info can be obtained in the following way
last (gb.groups), but I don’t think it’s a recommended procedure.
Later, having read that groupby and gatherby accept the output of other grouping functions as input, I tried (in all the variations that came to my mind ), unsuccessfully, to do something like this:

combine (gb, (:) => x-> groupby (x, 1)).

so I tried to build the subgroups by hand and, after a few attempts, I found this β€œsolution”


cgb1 = combine (gb, (:) => x -> [x], (:) => byrow (x-> Dataset (; zip ([: a,: b,: c], x) ...) ), dropgroupcols = true)

At this point the question arises why the following β€œreduced” expression does not work


cgb1 = combine (gb, (:) => byrow (x-> Dataset (; zip ([: a,: b,: c], [x]) ...)), dropgroupcols = true)

I found this way to get more directly to the form I was looking for.

cgb1 = combine(gb, (1,2,3) => (x...)-> Dataset(; zip([:a,:b,:c], x) ...))

I don’t know if there are others even more direct.

So at first thought, a method for combine which allows to pass the entire subgroup to a function for the subsequent transformations, would seem desirable to me.
eg:
combine (gbds, x-> nrow (x))
in general
combine (gbds, sg-> fun (sg))
But I don’t know if it has any unwanted side effects.

my take was like this groupby can group the output of another groupby, like groupby(groupby(ds,:x1),:x2).

my understanding is that 1:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:11 are starts and ends of groups this means total of 7 groups.

From the package documentation I have seen that in some cases the information is exposed.
I was wondering why in other cases it is not and if it is not the case to standardize the outputs in this respect.
I believe that info is more important in the case of gatherby than in the case of groupby

julia> groupby!(ds, 1)
6Γ—2 Grouped Dataset with 2 groups
Grouped by: g
 Row β”‚ g         x        
     β”‚ identity  identity
     β”‚ Int64?    Float64?
─────┼────────────────────
   1 β”‚        1      12.0
   2 β”‚        1      11.0
   3 β”‚        1      15.0
   4 β”‚        2      12.3
   5 β”‚        2      13.0
   6 β”‚        2      13.2
1 Like

Yes. I too understand this. I just wanted to understand the possibilities of the package and how it works.
A little by reading the documentation (not much though :smiley:) a little by doing tests.
I still don’t quite understand what is passed to functions in the part
cols => fun => newcols

in different situations.

i just look at col=>fun as a pipe in linux, β€œpipe col to fun”.

in many cases, I have been able to figure out how the parameters are passed

cases working
ds=Dataset(a=rand(6),a_lim=0.75,b=rand(6),b_lim=0.33,c=rand(6),c_lim=0.25)

modify(compare(ds[!, r"lim"], ds[!, Not(r"lim")], on = 1:3 .=> 1:3, eq = !isless), 1:3=>byrow(count))
modify(compare(ds[!, r"lim"], ds[!, Not(r"lim")], on = 1:3 .=> 1:3, eq = !isless), 1:3=>byrow(sum))



modify(modify(compare(ds[!, r"lim"], ds[!, Not(r"lim")], on = 1:3 .=> 1:3, eq = !isless), 
          1:3=>x->Int.(x)), 1:3=>byrow(x->sum(x)))


cds=compare(ds[!, r"lim"], ds[!, Not(r"lim")], on = 1:3 .=> 1:3, eq = !isless)
mcds=modify(cds,1:3=>x->Int.(x))
modify(mcds, 1:3=>byrow(sum))
modify(mcds, 1:3=>byrow(x->sum(x)))

mmcds=modify(mcds, 1:3=>byrow(x->x .*2) =>:v1,1:3=>byrow(x->x .* 3)=>:v2)[:,[1,2,4,5]]
names(mmcds)
rename!(mmcds,[:a,:b,:v1,:v2])
modify(mmcds,[:v1,:v2]=>byrow(x->sum(x[1]))=>:sum_v1)
modify(mmcds,:v1=>byrow(x->sum(x))=>:sum_v1)
modify(mmcds,:v1=>byrow(x->mean(x))=>:mean_v1)

modify(mmcds,[:v1,:v2]=>byrow(x->mean(x))=>:mean_v1_v2)
modify(mmcds,[:a,:b]=>byrow(x->mean(x))=>:mean_a_b)


modify(mmcds,[:v1,:v2]=>byrow(x->sum(x)))
modify(mmcds,[:v1,:v2]=>byrow(sum))
modify(mmcds,[:v2]=>byrow(sum))
modify(mmcds,:v2=>byrow(sum))
modify(mmcds,:v2=>byrow(x->sum(x)))



modify(mmcds,[:a,:b]=>byrow(x->sum(x)))
modify(mmcds,[:a,:b]=>byrow(sum))
modify(mmcds,[:b]=>byrow(sum))

modify(mmcds,[:v1,:v2]=>byrow((x,y)->y .+ x[2]))

modify(mmcds,[:v1,:v2]=>byrow((x,y)->x .+y))
modify(mmcds,[:a,:b]=>byrow((x,y)->x .+y))

modify(mmcds,(:v1,:v2)=>(x...)->(x[2]))
modify(mmcds,(:a,:b)=>(x...)->(x[2]))

modify(mmcds,(:v1,:v2)=>(x,y)-> string(x...,y...)[end-30:end])
modify(mmcds,(:a,:b)=>(x,y)-> string(x...," | ",y...))

modify(mmcds,(:v1,:v2)=>(x,y)-> (x[3],y[3]))
modify(mmcds,(:a,:b)=>(x,y)-> (x[3],y[3]))

modify(mmcds,(:v1,:v2)=>(x,y)-> x[1:2])
modify(mmcds,(:a,:b)=>(x,y)-> x[1:2])

modify(mmcds,(:v1,:v2)=>(x,y)->sum.(zip(x,y)))

modify(mmcds,(:a,:b)=>(x,y)->sum.(zip(x,y)))

byrow(mmcds,sum, [:a,:b])




but in this I don’t understand why the form with byrow () doesn’t work

julia> mmcds=mmcds[:,[3,4]]
6Γ—2 Dataset
 Row β”‚ v1         v2        
     β”‚ identity   identity
     β”‚ Array…?    Array…?
─────┼──────────────────────
   1 β”‚ [0, 2, 0]  [0, 3, 0]
   2 β”‚ [2, 0, 0]  [3, 0, 0]
   3 β”‚ [2, 0, 2]  [3, 0, 3]
   4 β”‚ [2, 2, 0]  [3, 3, 0]
   5 β”‚ [2, 0, 0]  [3, 0, 0]
   6 β”‚ [2, 0, 0]  [3, 0, 0]

julia> modify(mmcds,(:v1,:v2)=>(x,y)->cor.(x,y))
6Γ—3 Dataset
 Row β”‚ v1         v2         function_v1_v2 
     β”‚ identity   identity   identity
     β”‚ Array…?    Array…?    Float64?
─────┼──────────────────────────────────────
   1 β”‚ [0, 2, 0]  [0, 3, 0]             1.0
   2 β”‚ [2, 0, 0]  [3, 0, 0]             1.0
   3 β”‚ [2, 0, 2]  [3, 0, 3]             1.0
   4 β”‚ [2, 2, 0]  [3, 3, 0]             1.0
   5 β”‚ [2, 0, 0]  [3, 0, 0]             1.0
   6 β”‚ [2, 0, 0]  [3, 0, 0]             1.0

julia> modify(mmcds,(:v1,:v2)=>byrow(cor))
ERROR: MethodError: no method matching normalize_modify!(::InMemoryDatasets.
1 Like

use

modify(mmcds,[:v1,:v2]=>byrow(cor))
1 Like