Hi all. I was wondering what was the most efficient way to perform this task:
I have a big DataFrame, and I need to take random samples (this is the easy part). But I need that the columns of the samples match the mean of another DataFrame column. I am not really familiar with bootstrapping, is this something that could be solved with either DependentBootstrap or Bootstrap?
If not, I thought about making a function or a while loop, like this:
sample_function(df::AbstractDataFrame, size) =
df[sample(axes(df, 1), size), :]
rows_sample = false
while rows_sample == false
sample_function(df, nrow(df2)) # rows from the other DataFrame
if mean(df.c1) == mean(df2.c1) && mean(df.c2) == mean(df2.c2)
rows_sample == true
end
end
But I am not enterally sure that this will work (or if there is a better way of doing it).
This is an hard question algorithmically. In general, unless your data has some specific structure it might even be the case that such a sample does not exist. What is exactly the structure of your data frames. Also note that what you write mean.(df.c1) is most likely not what you want:
Give a sense of what is the contents of df.c1, df.c2, df2.c1, df2.c2. It might be technically impossible. For example if df has just large numbers in c1 and c2 and df2 has smaller numbers in these columns.
And so on for a lot more rows.
The other dataframe, from which I want to match the mean values of C1 and C2, has only 889 columns (without missings, of course). The mean values of both columns are 3.75 and 849.55, and, the whole df1 DataFrame (with 16218 rows) has mean values of 5.66 and 548.65, respectively.
This is a Knapsack Problem.
It is NP-Complete
Proving this is equivalent to a knapsack problem is a little fiddly, but I am pretty confident it is.
In particular it is at least as hard as the subset sum problem
Proof sketch:
Consider trying to solve the Subset Sum problem. Consider multiset S of integers and a target-sum T. And we are trying to find out if there exists as subset of S, call it P such that sum(P)==T
Create a dataframe, df_S = DataFrame((s=s, t=T) for s in S). This has mean(df_S.t)==T
Run your sampling algorithm to find the subset of rows call it df_P that has mean(df_P.t) == mean(df_S.t).
If step 3 return an answer (rather than a failure), then the answer to subset sum is true, else false
If you can run the whole thing in polynamial time then congratulations, you have proved P=NP.
Its a bit more fiddly to show this still applies if you allow a tolerance, but we can just set that tolerance to zero and it still holds.
Anyway, just because it is NP complete doesnโt mean it is impossible
You can write a JuMP MIP program to select rows.
and solve it with HiGHs or Cbc.
As long as you donโt have too many rows.
Or you can make a heuristic solution like a some kind of greedy search.
Yep, feels like a heuristic solution will work well here.
On the other hand:
can we get a clearer picture of df.c1 , df.c2 , df2.c1 , df2.c2 without the excessive editing, because this is crucial for this problem. For example, are the values integers? what are the ranges of values?
I agree with all the previous posters about a pragmatic approach, but Iโm wondering about this part of your OP:
I need to take random samples
If you solve a Knapsack problem to select observations, the resulting sample will not be a random sample so standard statistical inference will likely be off. Whether this is a problem of course depends on your application, but if this is about representativeness of the sample in some sense you might want to look at the causal inference literature addressing this stuff, things like matching estimators, IPW or doubly robust estimators, or synthetic controls.
Wellโฆ my degree is in molecular biology, so I do not have a lot of background on statistics. Given that we are on this topics, are there any papers / articles or books recommendations on the topics that you all reckon that could help? Besides de Knapsack problem, that I should read more about.
Good, I will look into them. The sample is not going to be representative of the data where it came from (for examples, the mean and media is going to be way off), but itโs going to match the one from the other data points.
Basically, to compare both data sets, I need to โcontrolโ certain parameters so the comparison is valid. If I simply take the big data set and compare to the smaller one, the share weight of the sample is going to generate a bias on my test. That is way a need a sample that, not being the original one, โbehavesโ like the original one. I think.
But the thing is that I am โforcingโ the system, and not getting a random population that happens to have the same mean value, so I think is not ok to do it.
For the reasons that others expanded over I said that this problem is hard. In general you would need to enumerate all solutions to a NP-hard problem and randomly pick one of them.
Given your description I assume you are happy with approximate solution.
For this you could use the following procedure:
For each row in a reference table find rows in the original table that are similar to it (this should be doable assuming the data frame is not too wide; you can do it in several ways - either use distance cutoff or some specified number of neighbors)
Then sample for each row in a reference table one of its neighbors.
Doing this you should get not only approximate matching of the mean, but approximate matching of the whole joint distribution of the variables on which you match.
An alternative would be to build some propensity model of row being included in the sample and perform propensity score matching.
Iโm spitballing here, this isnโt guaranteed to solve the problem but at least feels like what you mean:
To pick random samples from x with a mean and variance similar to those of y, form p = exp.(-(x.-mean(y)).^2./variance(y)) and use that as your weight vector for sampling. You are likely to pick samples within mean(y)ยฑstd(y), but you are guaranteed they are all from x.