laut
July 21, 2022, 8:22pm
1
@bkamins
How do I get the inverted row index in DataFrames?
# make data frame
df = DataFrame(x=1:10, y=11:20)
# subset df (train set)
df2 = df[[1,3,9],:]
# invert subset (test set)
df3 = df[Not(df2,:]),:] # error
df3 = df[Not(rownumber(df2)),:] # error
It is simpler than you think
julia> using DataFrames
julia> df = DataFrame(x=1:10, y=11:20)
10Γ2 DataFrame
Row β x y
β Int64 Int64
ββββββΌββββββββββββββ
1 β 1 11
2 β 2 12
3 β 3 13
4 β 4 14
5 β 5 15
6 β 6 16
7 β 7 17
8 β 8 18
9 β 9 19
10 β 10 20
julia> df2 = df[[1,3,9],:]
3Γ2 DataFrame
Row β x y
β Int64 Int64
ββββββΌββββββββββββββ
1 β 1 11
2 β 3 13
3 β 9 19
julia> df2 = df[Not([1,3,9]),:]
7Γ2 DataFrame
Row β x y
β Int64 Int64
ββββββΌββββββββββββββ
1 β 2 12
2 β 4 14
3 β 5 15
4 β 6 16
5 β 7 17
6 β 8 18
7 β 10 20
laut
July 22, 2022, 9:07am
3
In my case I want to make df3 from df2 because Iβm not saving the [1,3,9] used in df2 as an object (Iβm using the βsampleβ function to randomly select rows) and it is actually a much longer vector.
If there isnβt a way to do this, I can save the vector as an object and reference it in both data frames.
nilshg
July 22, 2022, 9:16am
4
Yes thatβs the simplest way of doing it, e.g. I tend to write something like:
using StatsBase
train_rows = sample(1:nrow(df), round(Int, nrow(df)*0.8); replace = false)
test_rows = Not(train_rows)
and then you can do df[train_rows, :] to get the training data and df[test_rows, :] to get the testing data.
But how do you want to relate df2 to df? You can e.g. use antijoin if you can identify the key on which you can uniquely identify rows:
julia> antijoin(df, df2, on=names(df))
7Γ2 DataFrame
Row β x y
β Int64 Int64
ββββββΌββββββββββββββ
1 β 2 12
2 β 4 14
3 β 5 15
4 β 6 16
5 β 7 17
6 β 8 18
7 β 10 20
laut
July 22, 2022, 9:21am
6
With bigger data sets Iβm guessing @nilshg , suggestion to just create the vector object will likely run faster then the anti join. Thanks for the suggestions.
Yes - it will be faster, my suggestion was for the case if you did not or could not save this vector for some reason.