How to sample rows from a dataframe?


#1

Hello.

I’ve created a dataframe by reading a very long csv file.

Now I would like to create another smaller dataframe with the same columns but with a few rows sampled randomly (without replacement) from the original.

Say we have this toy example and we want to create a new one with 3 rows. I’ve tried several alternatives.

myDF = DataFrame(A = 1:10, B = 21:30)
sample(myDF, 3)                   # doesn't work
rand(myDF, 3)                     # doesn't work
randsubseq(myDF, 3)               # doesn't work
myDF[sample(1:10,3),:]          # doesn't work
myDF[rand(1:(size(myDF,1)),3),:]            # it works 
newDF = myDF[rand(1:(size(myDF,1)),3),:] 
3×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 2     │ 9     │
│ 2   │ 4     │ 7     │
│ 3   │ 10    │ 1     │

Is it the best way to do it?
How do you do it?


#2

You can use StatsBase.sample to sample without replacement, and (optionally) keep the order:

using StatsBase, DataFrames
myDF = DataFrame(A = 1:10, B = 21:30)
myDF[sample(axes(myDF, 1), 3; replace = false, ordered = true), :]

#3

And if you original DataFrame would be large (like millions of rows and hundreds of columns) you can use view(myDF, rows, :), where rows is the result from sample as @Tamas_Papp explained.