Row number in DataFrames groupby query

bdeonovic · March 16, 2021, 5:40pm

I’m trying to understand how to do data manipulations in DataFrames (+ Query or + DataFramesMeta or +LINQ), in particular today I am looking to get rownumber in a groupby query

For MWE consider

df = DataFrame(a=repeat([1,2,3], inner=[2]), b=randn(6))

Lets say I want to group by a, get mean of b within a, and return something like:

df |>
 @groupby(_.a) |> 
 @map({rows=_, m=mean(_.b), t=1:length(_)}) |> 
 @mapmany(_.t, {r=__, getindex(_.rows,__).a, getindex(_.rows,__).b, _.m}) |>
 DataFrame

6×4 DataFrame
 Row │ r      a      b          measure   
     │ Int64  Int64  Float64    Float64   
─────┼────────────────────────────────────
   1 │     1      1   0.137441  -0.347454
   2 │     2      1  -0.832349  -0.347454
   3 │     1      2   0.775431   0.68382
   4 │     2      2   0.592209   0.68382
   5 │     1      3  -1.17639   -0.324448
   6 │     2      3   0.527495  -0.324448

Is there a better way to do this? in particular, my real world example I have too many variables for this getindex approach to work.

In R I would do something like this

> data.frame(a=rep(1:3,each=2), b=rnorm(2*3)) %>% group_by(a) %>% mutate(m=mean(b), r=row_number())
# A tibble: 6 x 4
# Groups:   a [3]
      a      b      m     r
  <int>  <dbl>  <dbl> <int>
1     1  0.474  0.391     1
2     1  0.307  0.391     2
3     2  0.500 -0.370     1
4     2 -1.24  -0.370     2
5     3 -1.33  -0.862     1
6     3 -0.398 -0.862     2

The use of _ and __ is almost ineligible to me.

tbeason · March 16, 2021, 5:52pm

Kinda hacky but this works:

transform(groupby(df,:a),"b"=>mean,"b"=> (x->1:length(x)) =>"r")

I just make that column be a 1:n range where n is the number of elements in the group.

bdeonovic · March 16, 2021, 5:58pm

How do you groupby multiple columns in this fashion?

tbeason · March 16, 2021, 5:59pm

groupby(df,[:a1,:a2])

jules · March 16, 2021, 8:51pm

Maybe you’d be interested in other piping macro packages then, which try to cut down on repetitive syntax. For example Chain.jl

using Chain

@chain df begin
    groupby(:a)
    transform(:b => mean, :b => (x -> 1:length(x)) => :r)
end

You don’t have to write _ if your dataframe is the first argument, a bit like in R

xiaodai · August 13, 2021, 1:30pm

this is an intro to my recommended packages https://www.youtube.com/watch?v=rDvpLFxcL84

For this in R

data.frame(a=rep(1:3,each=2), b=rnorm(2*3)) %>% 
   group_by(a) %>% 
   mutate(m=mean(b), r=row_number())

I would do like this where the @c macro means treat column :b as a vector and not apply the function by row which is normally what you want with @transform

using Chain, DataFrames, DataFrameMacros
using StatsBase: mean


df = DataFrame(DataFrame(a=repeat([1,2,3], inner=[2]), b=randn(6))
@chain df begin
    @groupby(:a)
    @transform(:m = @c(mean(:b)), :r = @c(1:length(:b)))
end

so it’s just as simple as R once you know what packages to use.

Topic		Replies	Views
Number groups in dataframe Data question , dataframes	7	491	November 11, 2022
How to obtain the number of rows in each group using DataFramesMeta? Data first-steps	2	1341	March 17, 2019
Find unique row in DataFrame General Usage	5	1649	May 17, 2018
Group, Mutate, Ungroup Data question , dataframes	2	1763	February 4, 2018
DataFrames.jl: keep group by result in every row General Usage dataframes	4	1681	February 9, 2020

Row number in DataFrames groupby query

Related topics