Hello Julia users!
I have been trying to implement a Bootstrap from a panel data, stored in DataFrames. I think I got it working, but I am doing it in a super inefficient way.
For example, as soon as I draw a vector of keys, say keys=[1,2,1,3,4]
, I can do df[keys]
to make a DataFrame by automatically duplicating the first row. Is there any way to do this in a grouped DataFrame? Below is my attempt, without doing so:
using DataFrames, CSV, Pipe, Parameters
using LinearAlgebra, Statistics, Distributions, Random
using StatsBase:sample
dfex = DataFrame(pid=repeat([1:4;], inner = [4]),
a = randn(16),
b = rand(16))
groupedDF = groupby(dfex, :pid)
length(dfex.pid)
unique_pid = unique(dfex.pid)
n_pid = length(unique_pid)
# Withe each pid, pick all the rows with that pid.
Bstrap = unique_pid[sample(axes(unique_pid, 1), n_pid; replace = true, ordered = false), 1]
length(unique(Bstrap))
function GenBootstrapDF(groupedDF,Bstrap)
BootstrapDf = DataFrame()
for i in Bstrap
Bootstrap_sample = groupedDF[(pid=i,)]
BootstrapDf = vcat(BootstrapDf, Bootstrap_sample)
end
return BootstrapDf
end
GenBootstrapDF(groupedDF,Bstrap)
This works in the way I wanted, i.e. sampling the pid’s with replacement and then collect the SubDataFrame of the sampled pid’s from the dfex
. However, I would like to improve the performance. Can you give me any ideas, please?
Thank you for your input!