Row number in DataFrames groupby query

I’m trying to understand how to do data manipulations in DataFrames (+ Query or + DataFramesMeta or +LINQ), in particular today I am looking to get rownumber in a groupby query

For MWE consider

df = DataFrame(a=repeat([1,2,3], inner=[2]), b=randn(6))

Lets say I want to group by a, get mean of b within a, and return something like:

df |>
 @groupby(_.a) |> 
 @map({rows=_, m=mean(_.b), t=1:length(_)}) |> 
 @mapmany(_.t, {r=__, getindex(_.rows,__).a, getindex(_.rows,__).b, _.m}) |>
 DataFrame
6Γ—4 DataFrame
 Row β”‚ r      a      b          measure   
     β”‚ Int64  Int64  Float64    Float64   
─────┼────────────────────────────────────
   1 β”‚     1      1   0.137441  -0.347454
   2 β”‚     2      1  -0.832349  -0.347454
   3 β”‚     1      2   0.775431   0.68382
   4 β”‚     2      2   0.592209   0.68382
   5 β”‚     1      3  -1.17639   -0.324448
   6 β”‚     2      3   0.527495  -0.324448

Is there a better way to do this? in particular, my real world example I have too many variables for this getindex approach to work.

In R I would do something like this

> data.frame(a=rep(1:3,each=2), b=rnorm(2*3)) %>% group_by(a) %>% mutate(m=mean(b), r=row_number())
# A tibble: 6 x 4
# Groups:   a [3]
      a      b      m     r
  <int>  <dbl>  <dbl> <int>
1     1  0.474  0.391     1
2     1  0.307  0.391     2
3     2  0.500 -0.370     1
4     2 -1.24  -0.370     2
5     3 -1.33  -0.862     1
6     3 -0.398 -0.862     2

The use of _ and __ is almost ineligible to me.

Kinda hacky but this works:

transform(groupby(df,:a),"b"=>mean,"b"=> (x->1:length(x)) =>"r")

I just make that column be a 1:n range where n is the number of elements in the group.

1 Like

How do you groupby multiple columns in this fashion?

groupby(df,[:a1,:a2])

1 Like

Maybe you’d be interested in other piping macro packages then, which try to cut down on repetitive syntax. For example Chain.jl

using Chain

@chain df begin
    groupby(:a)
    transform(:b => mean, :b => (x -> 1:length(x)) => :r)
end

You don’t have to write _ if your dataframe is the first argument, a bit like in R

4 Likes

this is an intro to my recommended packages https://www.youtube.com/watch?v=rDvpLFxcL84

For this in R

data.frame(a=rep(1:3,each=2), b=rnorm(2*3)) %>% 
   group_by(a) %>% 
   mutate(m=mean(b), r=row_number())

I would do like this where the @c macro means treat column :b as a vector and not apply the function by row which is normally what you want with @transform

using Chain, DataFrames, DataFrameMacros
using StatsBase: mean


df = DataFrame(DataFrame(a=repeat([1,2,3], inner=[2]), b=randn(6))
@chain df begin
    @groupby(:a)
    @transform(:m = @c(mean(:b)), :r = @c(1:length(:b)))
end

so it’s just as simple as R once you know what packages to use.

1 Like