Indices of a sub-dataframe

johnbb · March 27, 2023, 9:21am

Assume I have two dataframes A and B where A is a subset of B. How can I get the rowindices of A?

bkamins · March 27, 2023, 9:47am

do you mean that B is a DataFrame and A is a SubDataFrame whose parent is B and you want to know which rows of B constitute A? If yes then this is done using the parentindices function.

johnbb · March 27, 2023, 10:29am

My actual problem is that I want the df indices of the following

combine(groupby(df, :group), :myvar => u -> sort(u, rev=true)[1:3])

Is there an easy way?

rocco_sprmnt21 · March 27, 2023, 11:34am

try

indexin(eachrow(A),eachrow(B))

johnbb · March 27, 2023, 11:45am

Thanks, that seems to work. Only had to rename the column name created by combine in the example above.

rocco_sprmnt21 · March 27, 2023, 12:32pm

I had only read the first message and replied to that.
I don’t know what to say about this clarification, as it is not clear to me what you are trying to do and what you would like to achieve.
If you post a minimal example complete with data to think about, you will surely receive more precise and complete answers.

johnbb · March 27, 2023, 12:44pm

Ok. Below is an example including a solution where the aim is to find the indices of the three largest values in each group

B = DataFrame(group = rand(1:4,40), values = rand(40))
A = combine(groupby(B, :group), :values => u -> sort(u, rev=true)[1:3])
rename!(A, :values_function => :values)
indexin(eachrow(A), eachrow(B))

bkamins · March 27, 2023, 12:45pm

There several ways to do it, but the simplest is to add df.id = axes(df, 1) column and then refer to it in the operations you perform.

rocco_sprmnt21 · March 27, 2023, 1:02pm

perhaps not the simplest, …

grps=groupby(B,:group)

[partialsort(tuple.(parentindices(g)[1], g.values),1:3,by=last) for g in grps]

or

bestningroup(n)=g->partialsort(tuple.(parentindices(g)[1], g.values),1:n,by=last)
`combine(bestningroup(3), grps)`
julia> combine(bestningroup(3), grps)
12×2 DataFrame
 Row │ group  x1
     │ Int64  Tuple…
─────┼────────────────────────
   1 │     1  (12, 0.100642)
   2 │     1  (19, 0.125845)
   3 │     1  (1, 0.253362)
   4 │     2  (22, 0.0399776)
   5 │     2  (39, 0.102041)
   6 │     2  (21, 0.194081)
   7 │     3  (40, 0.0169355)
   8 │     3  (32, 0.157626)
   9 │     3  (31, 0.187634)
  10 │     4  (5, 0.111986)
  11 │     4  (26, 0.123399)
  12 │     4  (27, 0.258254)

and for the largest n

julia> bestningroup(n, col)=g->partialsort(tuple.(parentindices(g)[1], g[:,col]),1:n,by=last,lt=!isless)
bestningroup (generic function with 2 methods)

julia> combine(bestningroup(3,:values), grps)
12×2 DataFrame
 Row │ group  x1
     │ Int64  Tuple…
─────┼───────────────────────
   1 │     1  (3, 0.79187)
   2 │     1  (9, 0.724257)
   3 │     1  (33, 0.692813)
   4 │     2  (36, 0.990465)
   5 │     2  (30, 0.818662)
   6 │     2  (6, 0.804834)
   7 │     3  (13, 0.760777)
   8 │     3  (10, 0.461795)
   9 │     3  (38, 0.431897)
  10 │     4  (15, 0.9416)
  11 │     4  (24, 0.731547)
  12 │     4  (20, 0.634004)

tbeason · March 27, 2023, 1:44pm

@bkamins Sort of related to your response: have you ever considered adding a rownum shortcut function sort of like nrow to do this sort of thing in the “minilanguage”?

Dan · March 27, 2023, 2:22pm

Inspired by @rocco_sprmnt21 answers:

combine(groupby(B, :group), 
  :values => u -> parentindices(u)[1][sortperm(u, rev=true)[1:3]])

P.S. partialsortperm is faster than sortperm, when time critical and large groups it should be preferred.

rocco_sprmnt21 · March 27, 2023, 3:47pm

I wonder how this expression works.
I understood that in the col=>fun syntax the function fun is passed only the vector of the values of the col column. How does (would) in this case to go back to the parentindices having only values?

Does this mean that in some cases, in addition to the values, some other information is propagated towards fun?
I ask because in some cases it would have been convenient to have the name of the column in addition to the values.

pdeffebach · March 27, 2023, 3:54pm

I don’t think this helps answer the question. The answer is no, the function acts on a view of the vectors and views, have parentindices defined. It has nothing to do with data frames.

julia> x = [1, 2, 3];

julia> y = view(x, 1:2);

julia> parentindices(y)
(1:2,)

rocco_sprmnt21 · March 27, 2023, 3:56pm

Sure clarifies for me that parentindices is not a specific function of subgroups, as I thought.
Thank you.

PS
I still have to figure out what exactly is being passed to fun in this case

pdeffebach · March 27, 2023, 4:05pm

It’s a view of the underlying column in the data frame, a SubDataFrame is basically a DataFrame of views.

rocco_sprmnt21 · March 27, 2023, 4:10pm

and the view of a vector is not a simple collection of values but a complex structure with lots of information, right?
Could it be something like a pointer to the start of the parent array and the view offsets!?!

pdeffebach · March 27, 2023, 4:17pm

Here is the documentation for views.

rocco_sprmnt21 · March 27, 2023, 4:22pm

thanks (I need to read more).
But the fact that sometimes double TAB after the . provides some answers helps my laziness and I read the manuals little

julia> x = [3,4,5, 2, 3];

julia> y = view(x, 1:2)
2-element view(::Vector{Int64}, 1:2) with eltype Int64:
 3
 4

julia> y. #pressing TAB twice
indices  offset1  parent   stride1
julia> y.parent
5-element Vector{Int64}:
 3
 4
 5
 2
 3

julia> y.indices
(1:2,)

I tried it to see what happens

julia> y.indices = (1:3,)
ERROR: setfield!: immutable struct of type SubArray cannot be changed
Stacktrace:
 [1] setproperty!(x::SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}, f::Symbol, v::Tuple{UnitRange{Int64}})

bkamins · March 27, 2023, 9:20pm

I understand you want the row number in the parent data frame? If you would find it useful we could consider adding it. It was not asked for before.

@pdeffebach - a small note.
Note that parentindices will not always return exactly what I assume was originally expected:

julia> using DataFrames

julia> df = DataFrame(id=[1,2,3,3,4,4])
6×1 DataFrame
 Row │ id
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     3
   5 │     4
   6 │     4

julia> gdf = groupby(view(df, 3:6, :), :id)
GroupedDataFrame with 2 groups based on key: id
First Group (2 rows): id = 3
 Row │ id
     │ Int64
─────┼───────
   1 │     3
   2 │     3
⋮
Last Group (2 rows): id = 4
 Row │ id
     │ Int64
─────┼───────
   1 │     4
   2 │     4

julia> combine(gdf, :id => v -> parentindices(v)[1], sdf -> parentindices(sdf)[1])
4×3 DataFrame
 Row │ id     id_function  x1
     │ Int64  Int64        Int64
─────┼───────────────────────────
   1 │     3            3      3
   2 │     3            4      4
   3 │     4            5      5
   4 │     4            6      6

tbeason · March 28, 2023, 12:56pm

Yea I use it all the time! I typically pass a random column to x -> 1:length(x). Your axes(x,1) solution could be a bit better, but I think this is probably something that enough people use that it would make sense to have a performant shortcut.

Topic		Replies	Views
Custom row indexing in DataFrames Data question	2	373	January 28, 2021
Select SubdataFrame in groupDataFrame by symbol General Usage question	0	265	August 12, 2019
(DataFrames.jl Suggestion) A (public) function that takes the same args as `subset` and returns the matched indices Data suggestions , dataframes	1	265	December 6, 2022
Data structure for convenient access to tabular data General Usage dataframes	5	389	February 13, 2023
Row index in a dataframe General Usage question , dataframes	4	1587	October 23, 2021

Indices of a sub-dataframe

Related topics