# Row number in DataFrames groupby query

Iβm trying to understand how to do data manipulations in DataFrames (+ Query or + DataFramesMeta or +LINQ), in particular today I am looking to get rownumber in a groupby query

For MWE consider

``````df = DataFrame(a=repeat([1,2,3], inner=[2]), b=randn(6))
``````

Lets say I want to group by `a`, get mean of `b` within `a`, and return something like:

``````df |>
@groupby(_.a) |>
@map({rows=_, m=mean(_.b), t=1:length(_)}) |>
@mapmany(_.t, {r=__, getindex(_.rows,__).a, getindex(_.rows,__).b, _.m}) |>
DataFrame
``````
``````6Γ4 DataFrame
Row β r      a      b          measure
β Int64  Int64  Float64    Float64
ββββββΌββββββββββββββββββββββββββββββββββββ
1 β     1      1   0.137441  -0.347454
2 β     2      1  -0.832349  -0.347454
3 β     1      2   0.775431   0.68382
4 β     2      2   0.592209   0.68382
5 β     1      3  -1.17639   -0.324448
6 β     2      3   0.527495  -0.324448
``````

Is there a better way to do this? in particular, my real world example I have too many variables for this `getindex` approach to work.

In R I would do something like this

``````> data.frame(a=rep(1:3,each=2), b=rnorm(2*3)) %>% group_by(a) %>% mutate(m=mean(b), r=row_number())
# A tibble: 6 x 4
# Groups:   a [3]
a      b      m     r
<int>  <dbl>  <dbl> <int>
1     1  0.474  0.391     1
2     1  0.307  0.391     2
3     2  0.500 -0.370     1
4     2 -1.24  -0.370     2
5     3 -1.33  -0.862     1
6     3 -0.398 -0.862     2
``````

The use of `_` and `__` is almost ineligible to me.

Kinda hacky but this works:

``````transform(groupby(df,:a),"b"=>mean,"b"=> (x->1:length(x)) =>"r")
``````

I just make that column be a `1:n` range where `n` is the number of elements in the group.

1 Like

How do you groupby multiple columns in this fashion?

`groupby(df,[:a1,:a2])`

1 Like

Maybe youβd be interested in other piping macro packages then, which try to cut down on repetitive syntax. For example Chain.jl

``````using Chain

@chain df begin
groupby(:a)
transform(:b => mean, :b => (x -> 1:length(x)) => :r)
end
``````

You donβt have to write _ if your dataframe is the first argument, a bit like in R

4 Likes

this is an intro to my recommended packages Best Julia packages for manipulating tabular (dataframe) data - YouTube

For this in R

``````data.frame(a=rep(1:3,each=2), b=rnorm(2*3)) %>%
group_by(a) %>%
mutate(m=mean(b), r=row_number())
``````

I would do like this where the `@c` macro means treat column `:b` as a vector and not apply the function by row which is normally what you want with `@transform`

``````using Chain, DataFrames, DataFrameMacros
using StatsBase: mean

df = DataFrame(DataFrame(a=repeat([1,2,3], inner=[2]), b=randn(6))
@chain df begin
@groupby(:a)
@transform(:m = @c(mean(:b)), :r = @c(1:length(:b)))
end
``````

so itβs just as simple as R once you know what packages to use.

1 Like