DataFrames invert index

laut · July 21, 2022, 8:22pm

@bkamins
How do I get the inverted row index in DataFrames?

# make data frame
df = DataFrame(x=1:10, y=11:20)

# subset df (train set)
df2 = df[[1,3,9],:]

# invert subset (test set)
df3 = df[Not(df2,:]),:] # error
df3 = df[Not(rownumber(df2)),:] # error

bkamins · July 21, 2022, 10:00pm

It is simpler than you think

julia> using DataFrames

julia> df = DataFrame(x=1:10, y=11:20)
10×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14
   5 │     5     15
   6 │     6     16
   7 │     7     17
   8 │     8     18
   9 │     9     19
  10 │    10     20

julia> df2 = df[[1,3,9],:]
3×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     3     13
   3 │     9     19

julia> df2 = df[Not([1,3,9]),:]
7×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     2     12
   2 │     4     14
   3 │     5     15
   4 │     6     16
   5 │     7     17
   6 │     8     18
   7 │    10     20

laut · July 22, 2022, 9:07am

In my case I want to make df3 from df2 because I’m not saving the [1,3,9] used in df2 as an object (I’m using the ‘sample’ function to randomly select rows) and it is actually a much longer vector.

If there isn’t a way to do this, I can save the vector as an object and reference it in both data frames.

nilshg · July 22, 2022, 9:16am

Yes that’s the simplest way of doing it, e.g. I tend to write something like:

using StatsBase
train_rows = sample(1:nrow(df), round(Int, nrow(df)*0.8); replace = false)
test_rows = Not(train_rows)

and then you can do df[train_rows, :] to get the training data and df[test_rows, :] to get the testing data.

bkamins · July 22, 2022, 9:17am

But how do you want to relate df2 to df? You can e.g. use antijoin if you can identify the key on which you can uniquely identify rows:

julia> antijoin(df, df2, on=names(df))
7×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     2     12
   2 │     4     14
   3 │     5     15
   4 │     6     16
   5 │     7     17
   6 │     8     18
   7 │    10     20

laut · July 22, 2022, 9:21am

With bigger data sets I’m guessing @nilshg , suggestion to just create the vector object will likely run faster then the anti join. Thanks for the suggestions.

bkamins · July 22, 2022, 9:37am

Yes - it will be faster, my suggestion was for the case if you did not or could not save this vector for some reason.

Topic		Replies	Views
Row index in a dataframe General Usage question , dataframes	4	1550	October 23, 2021
Access DataFrames index General Usage question , dataframes	1	584	April 11, 2022
How to get row indices and how can I change row indices/row names in a Julia DataFrame? Data dataframes	7	5285	April 23, 2021
How to make compact subset queries of a dataframe? Data dataframes	5	415	March 12, 2020
How to sample a Data frame General Usage	10	6716	December 31, 2019

DataFrames invert index

Related topics