Indices of a sub-dataframe

Assume I have two dataframes A and B where A is a subset of B. How can I get the rowindices of A?

do you mean that B is a DataFrame and A is a SubDataFrame whose parent is B and you want to know which rows of B constitute A? If yes then this is done using the parentindices function.

My actual problem is that I want the df indices of the following

combine(groupby(df, :group), :myvar => u -> sort(u, rev=true)[1:3])

Is there an easy way?


1 Like

Thanks, that seems to work. Only had to rename the column name created by combine in the example above.

I had only read the first message and replied to that.
I donโ€™t know what to say about this clarification, as it is not clear to me what you are trying to do and what you would like to achieve.
If you post a minimal example complete with data to think about, you will surely receive more precise and complete answers.

Ok. Below is an example including a solution where the aim is to find the indices of the three largest values in each group

B = DataFrame(group = rand(1:4,40), values = rand(40))
A = combine(groupby(B, :group), :values => u -> sort(u, rev=true)[1:3])
rename!(A, :values_function => :values)
indexin(eachrow(A), eachrow(B))

There several ways to do it, but the simplest is to add = axes(df, 1) column and then refer to it in the operations you perform.


perhaps not the simplest, โ€ฆ


[partialsort(tuple.(parentindices(g)[1], g.values),1:3,by=last) for g in grps]


bestningroup(n)=g->partialsort(tuple.(parentindices(g)[1], g.values),1:n,by=last)
`combine(bestningroup(3), grps)`
julia> combine(bestningroup(3), grps)
12ร—2 DataFrame
 Row โ”‚ group  x1
     โ”‚ Int64  Tupleโ€ฆ
   1 โ”‚     1  (12, 0.100642)
   2 โ”‚     1  (19, 0.125845)
   3 โ”‚     1  (1, 0.253362)
   4 โ”‚     2  (22, 0.0399776)
   5 โ”‚     2  (39, 0.102041)
   6 โ”‚     2  (21, 0.194081)
   7 โ”‚     3  (40, 0.0169355)
   8 โ”‚     3  (32, 0.157626)
   9 โ”‚     3  (31, 0.187634)
  10 โ”‚     4  (5, 0.111986)
  11 โ”‚     4  (26, 0.123399)
  12 โ”‚     4  (27, 0.258254)

and for the largest n

julia> bestningroup(n, col)=g->partialsort(tuple.(parentindices(g)[1], g[:,col]),1:n,by=last,lt=!isless)
bestningroup (generic function with 2 methods)

julia> combine(bestningroup(3,:values), grps)
12ร—2 DataFrame
 Row โ”‚ group  x1
     โ”‚ Int64  Tupleโ€ฆ
   1 โ”‚     1  (3, 0.79187)
   2 โ”‚     1  (9, 0.724257)
   3 โ”‚     1  (33, 0.692813)
   4 โ”‚     2  (36, 0.990465)
   5 โ”‚     2  (30, 0.818662)
   6 โ”‚     2  (6, 0.804834)
   7 โ”‚     3  (13, 0.760777)
   8 โ”‚     3  (10, 0.461795)
   9 โ”‚     3  (38, 0.431897)
  10 โ”‚     4  (15, 0.9416)
  11 โ”‚     4  (24, 0.731547)
  12 โ”‚     4  (20, 0.634004)

@bkamins Sort of related to your response: have you ever considered adding a rownum shortcut function sort of like nrow to do this sort of thing in the โ€œminilanguageโ€?

Inspired by @rocco_sprmnt21 answers:

combine(groupby(B, :group), 
  :values => u -> parentindices(u)[1][sortperm(u, rev=true)[1:3]])

P.S. partialsortperm is faster than sortperm, when time critical and large groups it should be preferred.


I wonder how this expression works.
I understood that in the col=>fun syntax the function fun is passed only the vector of the values of the col column. How does (would) in this case to go back to the parentindices having only values?

Does this mean that in some cases, in addition to the values, some other information is propagated towards fun?
I ask because in some cases it would have been convenient to have the name of the column in addition to the values.

I donโ€™t think this helps answer the question. The answer is no, the function acts on a view of the vectors and views, have parentindices defined. It has nothing to do with data frames.

julia> x = [1, 2, 3];

julia> y = view(x, 1:2);

julia> parentindices(y)
1 Like

Sure clarifies for me that parentindices is not a specific function of subgroups, as I thought.
Thank you.

I still have to figure out what exactly is being passed to fun in this case

Itโ€™s a view of the underlying column in the data frame, a SubDataFrame is basically a DataFrame of views.

1 Like

and the view of a vector is not a simple collection of values but a complex structure with lots of information, right?
Could it be something like a pointer to the start of the parent array and the view offsets!?!

Here is the documentation for views.

thanks (I need to read more).
But the fact that sometimes double TAB after the . provides some answers helps my laziness and I read the manuals little

julia> x = [3,4,5, 2, 3];

julia> y = view(x, 1:2)
2-element view(::Vector{Int64}, 1:2) with eltype Int64:

julia> y. #pressing TAB twice
indices  offset1  parent   stride1
julia> y.parent
5-element Vector{Int64}:

julia> y.indices

I tried it to see what happens :smile:

julia> y.indices = (1:3,)
ERROR: setfield!: immutable struct of type SubArray cannot be changed
 [1] setproperty!(x::SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}, f::Symbol, v::Tuple{UnitRange{Int64}})

I understand you want the row number in the parent data frame? If you would find it useful we could consider adding it. It was not asked for before.

@pdeffebach - a small note.
Note that parentindices will not always return exactly what I assume was originally expected:

julia> using DataFrames

julia> df = DataFrame(id=[1,2,3,3,4,4])
6ร—1 DataFrame
 Row โ”‚ id
     โ”‚ Int64
   1 โ”‚     1
   2 โ”‚     2
   3 โ”‚     3
   4 โ”‚     3
   5 โ”‚     4
   6 โ”‚     4

julia> gdf = groupby(view(df, 3:6, :), :id)
GroupedDataFrame with 2 groups based on key: id
First Group (2 rows): id = 3
 Row โ”‚ id
     โ”‚ Int64
   1 โ”‚     3
   2 โ”‚     3
Last Group (2 rows): id = 4
 Row โ”‚ id
     โ”‚ Int64
   1 โ”‚     4
   2 โ”‚     4

julia> combine(gdf, :id => v -> parentindices(v)[1], sdf -> parentindices(sdf)[1])
4ร—3 DataFrame
 Row โ”‚ id     id_function  x1
     โ”‚ Int64  Int64        Int64
   1 โ”‚     3            3      3
   2 โ”‚     3            4      4
   3 โ”‚     4            5      5
   4 โ”‚     4            6      6

Yea I use it all the time! I typically pass a random column to x -> 1:length(x). Your axes(x,1) solution could be a bit better, but I think this is probably something that enough people use that it would make sense to have a performant shortcut.