DataFrames invert index

@bkamins
How do I get the inverted row index in DataFrames?

# make data frame
df = DataFrame(x=1:10, y=11:20)

# subset df (train set)
df2 = df[[1,3,9],:]

# invert subset (test set)
df3 = df[Not(df2,:]),:] # error
df3 = df[Not(rownumber(df2)),:] # error

It is simpler than you think :slight_smile:

julia> using DataFrames

julia> df = DataFrame(x=1:10, y=11:20)
10Γ—2 DataFrame
 Row β”‚ x      y
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     1     11
   2 β”‚     2     12
   3 β”‚     3     13
   4 β”‚     4     14
   5 β”‚     5     15
   6 β”‚     6     16
   7 β”‚     7     17
   8 β”‚     8     18
   9 β”‚     9     19
  10 β”‚    10     20

julia> df2 = df[[1,3,9],:]
3Γ—2 DataFrame
 Row β”‚ x      y
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     1     11
   2 β”‚     3     13
   3 β”‚     9     19

julia> df2 = df[Not([1,3,9]),:]
7Γ—2 DataFrame
 Row β”‚ x      y
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     2     12
   2 β”‚     4     14
   3 β”‚     5     15
   4 β”‚     6     16
   5 β”‚     7     17
   6 β”‚     8     18
   7 β”‚    10     20
1 Like

In my case I want to make df3 from df2 because I’m not saving the [1,3,9] used in df2 as an object (I’m using the β€˜sample’ function to randomly select rows) and it is actually a much longer vector.

If there isn’t a way to do this, I can save the vector as an object and reference it in both data frames.

Yes that’s the simplest way of doing it, e.g. I tend to write something like:

using StatsBase
train_rows = sample(1:nrow(df), round(Int, nrow(df)*0.8); replace = false)
test_rows = Not(train_rows)

and then you can do df[train_rows, :] to get the training data and df[test_rows, :] to get the testing data.

2 Likes

But how do you want to relate df2 to df? You can e.g. use antijoin if you can identify the key on which you can uniquely identify rows:

julia> antijoin(df, df2, on=names(df))
7Γ—2 DataFrame
 Row β”‚ x      y
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     2     12
   2 β”‚     4     14
   3 β”‚     5     15
   4 β”‚     6     16
   5 β”‚     7     17
   6 β”‚     8     18
   7 β”‚    10     20
1 Like

With bigger data sets I’m guessing @nilshg , suggestion to just create the vector object will likely run faster then the anti join. Thanks for the suggestions.

Yes - it will be faster, my suggestion was for the case if you did not or could not save this vector for some reason.