Assume I have two dataframes A and B where A is a subset of B. How can I get the rowindices of A?
do you mean that B
is a DataFrame
and A
is a SubDataFrame
whose parent
is B
and you want to know which rows of B
constitute A
? If yes then this is done using the parentindices
function.
My actual problem is that I want the df
indices of the following
combine(groupby(df, :group), :myvar => u -> sort(u, rev=true)[1:3])
Is there an easy way?
try
indexin(eachrow(A),eachrow(B))
Thanks, that seems to work. Only had to rename the column name created by combine
in the example above.
I had only read the first message and replied to that.
I donโt know what to say about this clarification, as it is not clear to me what you are trying to do and what you would like to achieve.
If you post a minimal example complete with data to think about, you will surely receive more precise and complete answers.
Ok. Below is an example including a solution where the aim is to find the indices of the three largest values in each group
B = DataFrame(group = rand(1:4,40), values = rand(40))
A = combine(groupby(B, :group), :values => u -> sort(u, rev=true)[1:3])
rename!(A, :values_function => :values)
indexin(eachrow(A), eachrow(B))
There several ways to do it, but the simplest is to add df.id = axes(df, 1)
column and then refer to it in the operations you perform.
perhaps not the simplest, โฆ
grps=groupby(B,:group)
[partialsort(tuple.(parentindices(g)[1], g.values),1:3,by=last) for g in grps]
or
bestningroup(n)=g->partialsort(tuple.(parentindices(g)[1], g.values),1:n,by=last)
`combine(bestningroup(3), grps)`
julia> combine(bestningroup(3), grps)
12ร2 DataFrame
Row โ group x1
โ Int64 Tupleโฆ
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 (12, 0.100642)
2 โ 1 (19, 0.125845)
3 โ 1 (1, 0.253362)
4 โ 2 (22, 0.0399776)
5 โ 2 (39, 0.102041)
6 โ 2 (21, 0.194081)
7 โ 3 (40, 0.0169355)
8 โ 3 (32, 0.157626)
9 โ 3 (31, 0.187634)
10 โ 4 (5, 0.111986)
11 โ 4 (26, 0.123399)
12 โ 4 (27, 0.258254)
and for the largest n
julia> bestningroup(n, col)=g->partialsort(tuple.(parentindices(g)[1], g[:,col]),1:n,by=last,lt=!isless)
bestningroup (generic function with 2 methods)
julia> combine(bestningroup(3,:values), grps)
12ร2 DataFrame
Row โ group x1
โ Int64 Tupleโฆ
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 (3, 0.79187)
2 โ 1 (9, 0.724257)
3 โ 1 (33, 0.692813)
4 โ 2 (36, 0.990465)
5 โ 2 (30, 0.818662)
6 โ 2 (6, 0.804834)
7 โ 3 (13, 0.760777)
8 โ 3 (10, 0.461795)
9 โ 3 (38, 0.431897)
10 โ 4 (15, 0.9416)
11 โ 4 (24, 0.731547)
12 โ 4 (20, 0.634004)
@bkamins Sort of related to your response: have you ever considered adding a rownum
shortcut function sort of like nrow
to do this sort of thing in the โminilanguageโ?
Inspired by @rocco_sprmnt21 answers:
combine(groupby(B, :group),
:values => u -> parentindices(u)[1][sortperm(u, rev=true)[1:3]])
P.S. partialsortperm
is faster than sortperm
, when time critical and large groups it should be preferred.
I wonder how this expression works.
I understood that in the col=>fun syntax the function fun is passed only the vector of the values of the col column. How does (would) in this case to go back to the parentindices having only values?
Does this mean that in some cases, in addition to the values, some other information is propagated towards fun?
I ask because in some cases it would have been convenient to have the name of the column in addition to the values.
I donโt think this helps answer the question. The answer is no, the function acts on a view
of the vectors and view
s, have parentindices
defined. It has nothing to do with data frames.
julia> x = [1, 2, 3];
julia> y = view(x, 1:2);
julia> parentindices(y)
(1:2,)
Sure clarifies for me that parentindices is not a specific function of subgroups, as I thought.
Thank you.
PS
I still have to figure out what exactly is being passed to fun in this case
Itโs a view
of the underlying column in the data frame, a SubDataFrame
is basically a DataFrame
of view
s.
and the view of a vector is not a simple collection of values but a complex structure with lots of information, right?
Could it be something like a pointer to the start of the parent array and the view offsets!?!
Here is the documentation for view
s.
thanks (I need to read more).
But the fact that sometimes double TAB after the . provides some answers helps my laziness and I read the manuals little
julia> x = [3,4,5, 2, 3];
julia> y = view(x, 1:2)
2-element view(::Vector{Int64}, 1:2) with eltype Int64:
3
4
julia> y. #pressing TAB twice
indices offset1 parent stride1
julia> y.parent
5-element Vector{Int64}:
3
4
5
2
3
julia> y.indices
(1:2,)
I tried it to see what happens
julia> y.indices = (1:3,)
ERROR: setfield!: immutable struct of type SubArray cannot be changed
Stacktrace:
[1] setproperty!(x::SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}, f::Symbol, v::Tuple{UnitRange{Int64}})
I understand you want the row number in the parent data frame? If you would find it useful we could consider adding it. It was not asked for before.
@pdeffebach - a small note.
Note that parentindices
will not always return exactly what I assume was originally expected:
julia> using DataFrames
julia> df = DataFrame(id=[1,2,3,3,4,4])
6ร1 DataFrame
Row โ id
โ Int64
โโโโโโผโโโโโโโ
1 โ 1
2 โ 2
3 โ 3
4 โ 3
5 โ 4
6 โ 4
julia> gdf = groupby(view(df, 3:6, :), :id)
GroupedDataFrame with 2 groups based on key: id
First Group (2 rows): id = 3
Row โ id
โ Int64
โโโโโโผโโโโโโโ
1 โ 3
2 โ 3
โฎ
Last Group (2 rows): id = 4
Row โ id
โ Int64
โโโโโโผโโโโโโโ
1 โ 4
2 โ 4
julia> combine(gdf, :id => v -> parentindices(v)[1], sdf -> parentindices(sdf)[1])
4ร3 DataFrame
Row โ id id_function x1
โ Int64 Int64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 3 3 3
2 โ 3 4 4
3 โ 4 5 5
4 โ 4 6 6
Yea I use it all the time! I typically pass a random column to x -> 1:length(x)
. Your axes(x,1)
solution could be a bit better, but I think this is probably something that enough people use that it would make sense to have a performant shortcut.