laut
July 21, 2022, 8:22pm
1
@bkamins
How do I get the inverted row index in DataFrames?
# make data frame
df = DataFrame(x=1:10, y=11:20)
# subset df (train set)
df2 = df[[1,3,9],:]
# invert subset (test set)
df3 = df[Not(df2,:]),:] # error
df3 = df[Not(rownumber(df2)),:] # error
It is simpler than you think
julia> using DataFrames
julia> df = DataFrame(x=1:10, y=11:20)
10Γ2 DataFrame
Row β x y
β Int64 Int64
ββββββΌββββββββββββββ
1 β 1 11
2 β 2 12
3 β 3 13
4 β 4 14
5 β 5 15
6 β 6 16
7 β 7 17
8 β 8 18
9 β 9 19
10 β 10 20
julia> df2 = df[[1,3,9],:]
3Γ2 DataFrame
Row β x y
β Int64 Int64
ββββββΌββββββββββββββ
1 β 1 11
2 β 3 13
3 β 9 19
julia> df2 = df[Not([1,3,9]),:]
7Γ2 DataFrame
Row β x y
β Int64 Int64
ββββββΌββββββββββββββ
1 β 2 12
2 β 4 14
3 β 5 15
4 β 6 16
5 β 7 17
6 β 8 18
7 β 10 20
1 Like
laut
July 22, 2022, 9:07am
3
In my case I want to make df3 from df2 because Iβm not saving the [1,3,9] used in df2 as an object (Iβm using the βsampleβ function to randomly select rows) and it is actually a much longer vector.
If there isnβt a way to do this, I can save the vector as an object and reference it in both data frames.
nilshg
July 22, 2022, 9:16am
4
Yes thatβs the simplest way of doing it, e.g. I tend to write something like:
using StatsBase
train_rows = sample(1:nrow(df), round(Int, nrow(df)*0.8); replace = false)
test_rows = Not(train_rows)
and then you can do df[train_rows, :]
to get the training data and df[test_rows, :]
to get the testing data.
2 Likes
But how do you want to relate df2
to df
? You can e.g. use antijoin
if you can identify the key on which you can uniquely identify rows:
julia> antijoin(df, df2, on=names(df))
7Γ2 DataFrame
Row β x y
β Int64 Int64
ββββββΌββββββββββββββ
1 β 2 12
2 β 4 14
3 β 5 15
4 β 6 16
5 β 7 17
6 β 8 18
7 β 10 20
1 Like
laut
July 22, 2022, 9:21am
6
With bigger data sets Iβm guessing @nilshg , suggestion to just create the vector object will likely run faster then the anti join. Thanks for the suggestions.
Yes - it will be faster, my suggestion was for the case if you did not or could not save this vector for some reason.