Stratified and weighted sampling in dataframes

I have a function that can take a random sample from a dataframe. However I ran into an issue when working with grouped a dataframe in which case I have grouped the data by several columns and I would like to sample within these groups. I cannot find an obvious way to go about or a detailed tutorial.
So I have the following,

function take_a_sample(df, size)
    df[sample(axes(df, 1), size; replace = false, ordered = true), :]
end

Where df is the dataframe and size is the number of samples i would like. This works fine with plain dataframe but then i need to group and sample within groups
df_grouped = groupby(df,[:country,:date,:location])

take_a_sample(df_grouped,100)

I get ERROR: ArgumentError: GroupedDataFrame requires a single index

How can I get around sampling uniformly across multiple groups in a grouped dataframe.
Thanks.

Without trying a MWE I think the problem is that df in the case where you get an error is a grouped data frame, but your function take_a_sample constructs an array of indicies which you then try to index the grouped data frame with (but I believe when indexing a grouped data frame the first index should be a scalar indexing the sub data frame you want to select). One way you can fix this should be to call take_a_sample from a loop that goes over the sub data frames.

take_a_sample(df::AbstractDataFrame, size) =
    df[sample(axes(df, 1), size; replace = false, ordered = true), :]
take_a_sample(gdf::GroupedDataFrame, size) =
    combine(gdf, x -> take_a_sample(x, size))

is simplest

2 Likes

Many thanks - it looks like there is small typo in your response ( df_grouped should be gdf).

I ran into an issue when I tried to sample with a size > 1 e.g. take_a_sample (gdf, 10) with the grouped dataframe.

`

ERROR: LoadError: Cannot draw more samples without replacement.

 nested task error: Cannot draw more samples without replacement.
    Stacktrace:

`

Also if sample size is 1 it returns a dataframe of the same number of rows as the grouping hierarchy. I think this would be expected because it sampling once per group? Sorry if i was not clear, but would want to sample multiply times within each group (from the grouped dataframe)

This means that some groups in your data frame have less than 10 rows. This is expected if you want to do sampling without replacement

This is also expected and correct.

You mean you want to do sampling with replacement? Then use replace=true.

Thank you so much for the detailed explanation and in addition to the quick response! I ideally want to sample within each group x number of times but if the group size is less than the samples then you would want to pick all the samples in the group and continue without an error. Sampling with replacement means that I draw the same rows/entries multiple times i.e. I would get a duplicated rows. This is OK when the number of samples is less than the groups ( I could then look for unique rows by removing duplicates). I would however not wish to sample the same entry if there are additional non-unique rows or entries, thus i would want to keep replace=false .

@bkamins Slightly off-topic but do you have some recommended tutorials or books or a list of howtos that you could recommend ?

take_a_sample(df::AbstractDataFrame, size) =
    df[sample(axes(df, 1), min(size, nrow(df)); replace = false, ordered = true), :]
1 Like

On what topic? If you mean DataFrames.jl then please check out Introduction · DataFrames.jl.

1 Like