How to sample a clustered group from a DataFrame?

Hello Julia users!

I have been trying to implement a Bootstrap from a panel data, stored in DataFrames. I think I got it working, but I am doing it in a super inefficient way.
For example, as soon as I draw a vector of keys, say keys=[1,2,1,3,4], I can do df[keys] to make a DataFrame by automatically duplicating the first row. Is there any way to do this in a grouped DataFrame? Below is my attempt, without doing so:

using DataFrames, CSV, Pipe, Parameters
using LinearAlgebra, Statistics, Distributions, Random
using StatsBase:sample
dfex  = DataFrame(pid=repeat([1:4;], inner = [4]), 
                  a = randn(16), 
                  b = rand(16))
groupedDF = groupby(dfex, :pid)
length(dfex.pid)
unique_pid = unique(dfex.pid)
n_pid = length(unique_pid)
# Withe each pid, pick all the rows with that pid.
Bstrap = unique_pid[sample(axes(unique_pid, 1), n_pid; replace = true, ordered = false), 1]
length(unique(Bstrap))

function GenBootstrapDF(groupedDF,Bstrap)
  BootstrapDf = DataFrame()
  for i in Bstrap
      Bootstrap_sample = groupedDF[(pid=i,)]
      BootstrapDf = vcat(BootstrapDf, Bootstrap_sample)
  end
  return BootstrapDf
end 
GenBootstrapDF(groupedDF,Bstrap)

This works in the way I wanted, i.e. sampling the pid’s with replacement and then collect the SubDataFrame of the sampled pid’s from the dfex. However, I would like to improve the performance. Can you give me any ideas, please?

Thank you for your input!

The BootstrapDf = vcat(BootstrapDf, Bootstrap_sample) pattern is generally better written as append!(BootstrapDf, Bootstrap_sample). That will avoid a lot of copying (as internally Julia will anticipate the next call by allocating more space than needed).

Generally it’s also faster to use reduce(vcat, list_of_dfs) as it allows allocating the final data frame upfront. That requires storing a temporary list_of_dfs though, but here they are SubDataFrame views so they are cheap.

Finally, it’s faster/simpler to draw a sample of indices in groupedDF than a sample of pid. In the end something like this should be enough:

reduce(vcat, [groupedDF[i] for i in sample(1:length(groupedDF), n_pid; replace = true, ordered = false)])

This is what I wanted. I was experimenting like [something[i] for i in something] and reduce does the job. Thank you so much! For those who read this later, please run the following:

using DataFrames, CSV, Pipe, Parameters
using LinearAlgebra, Statistics, Distributions, Random
using StatsBase:sample

dfex  = DataFrame(pid=repeat([1:4;], inner = [4]), 
                  a = randn(16), 
                  b = rand(16))
groupedDF = groupby(dfex, :pid)
n_pid = 10000 # whatever the number of sample you want from the data
A = reduce(vcat, [groupedDF[i] for i in sample(1:length(groupedDF), n_pid; replace = true, ordered = false)])
1 Like