DataFrames invert index

@bkamins
How do I get the inverted row index in DataFrames?

# make data frame
df = DataFrame(x=1:10, y=11:20)

# subset df (train set)
df2 = df[[1,3,9],:]

# invert subset (test set)
df3 = df[Not(df2,:]),:] # error
df3 = df[Not(rownumber(df2)),:] # error

It is simpler than you think :slight_smile:

julia> using DataFrames

julia> df = DataFrame(x=1:10, y=11:20)
10Γ—2 DataFrame
 Row β”‚ x      y
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     1     11
   2 β”‚     2     12
   3 β”‚     3     13
   4 β”‚     4     14
   5 β”‚     5     15
   6 β”‚     6     16
   7 β”‚     7     17
   8 β”‚     8     18
   9 β”‚     9     19
  10 β”‚    10     20

julia> df2 = df[[1,3,9],:]
3Γ—2 DataFrame
 Row β”‚ x      y
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     1     11
   2 β”‚     3     13
   3 β”‚     9     19

julia> df2 = df[Not([1,3,9]),:]
7Γ—2 DataFrame
 Row β”‚ x      y
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     2     12
   2 β”‚     4     14
   3 β”‚     5     15
   4 β”‚     6     16
   5 β”‚     7     17
   6 β”‚     8     18
   7 β”‚    10     20

In my case I want to make df3 from df2 because I’m not saving the [1,3,9] used in df2 as an object (I’m using the β€˜sample’ function to randomly select rows) and it is actually a much longer vector.

If there isn’t a way to do this, I can save the vector as an object and reference it in both data frames.

Yes that’s the simplest way of doing it, e.g. I tend to write something like:

using StatsBase
train_rows = sample(1:nrow(df), round(Int, nrow(df)*0.8); replace = false)
test_rows = Not(train_rows)

and then you can do df[train_rows, :] to get the training data and df[test_rows, :] to get the testing data.

But how do you want to relate df2 to df? You can e.g. use antijoin if you can identify the key on which you can uniquely identify rows:

julia> antijoin(df, df2, on=names(df))
7Γ—2 DataFrame
 Row β”‚ x      y
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     2     12
   2 β”‚     4     14
   3 β”‚     5     15
   4 β”‚     6     16
   5 β”‚     7     17
   6 β”‚     8     18
   7 β”‚    10     20

With bigger data sets I’m guessing @nilshg , suggestion to just create the vector object will likely run faster then the anti join. Thanks for the suggestions.

Yes - it will be faster, my suggestion was for the case if you did not or could not save this vector for some reason.