How to take samples from a DataFrame that have an specific mean

Hi all. I was wondering what was the most efficient way to perform this task:

I have a big DataFrame, and I need to take random samples (this is the easy part). But I need that the columns of the samples match the mean of another DataFrame column. I am not really familiar with bootstrapping, is this something that could be solved with either DependentBootstrap or Bootstrap?

If not, I thought about making a function or a while loop, like this:


sample_function(df::AbstractDataFrame, size) =
   df[sample(axes(df, 1), size), :]

   rows_sample = false
   while rows_sample == false
   	sample_function(df, nrow(df2)) # rows from the other DataFrame
   	
   	if mean(df.c1) == mean(df2.c1) && mean(df.c2) == mean(df2.c2)  
   		rows_sample == true 
   	end
   end


But I am not enterally sure that this will work (or if there is a better way of doing it).

Thanks in advance!
Cheers,

This is an hard question algorithmically. In general, unless your data has some specific structure it might even be the case that such a sample does not exist. What is exactly the structure of your data frames. Also note that what you write mean.(df.c1) is most likely not what you want:

julia> df = DataFrame(c1=1:5)
5ร—1 DataFrame
 Row โ”‚ c1
     โ”‚ Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1
   2 โ”‚     2
   3 โ”‚     3
   4 โ”‚     4
   5 โ”‚     5

julia> mean.(df.c1)
5-element Vector{Float64}:
 1.0
 2.0
 3.0
 4.0
 5.0

Give a sense of what is the contents of df.c1, df.c2, df2.c1, df2.c2. It might be technically impossible. For example if df has just large numbers in c1 and c2 and df2 has smaller numbers in these columns.

Ups yes it was mean(df.c1), with out vectorizing it. My bad!

The DataFrame that I want to sample from has 16218 rows in total, without missing values.
Basicly looks like this:

|id|c1|c2|
|--|--|--|
|a | 5| 2|
|b |10| 1|
|c |11| 5|
|d | 2| 3|
|e |16| 4|

And so on for a lot more rows.
The other dataframe, from which I want to match the mean values of C1 and C2, has only 889 columns (without missings, of course). The mean values of both columns are 3.75 and 849.55, and, the whole df1 DataFrame (with 16218 rows) has mean values of 5.66 and 548.65, respectively.

Soโ€ฆ it might be possible! I think.

This is a Knapsack Problem.
It is NP-Complete
Proving this is equivalent to a knapsack problem is a little fiddly, but I am pretty confident it is.
In particular it is at least as hard as the subset sum problem

Proof sketch:

  1. Consider trying to solve the Subset Sum problem. Consider multiset S of integers and a target-sum T. And we are trying to find out if there exists as subset of S, call it P such that sum(P)==T
  2. Create a dataframe, df_S = DataFrame((s=s, t=T) for s in S). This has mean(df_S.t)==T
  3. Run your sampling algorithm to find the subset of rows call it df_P that has mean(df_P.t) == mean(df_S.t).
  4. If step 3 return an answer (rather than a failure), then the answer to subset sum is true, else false

If you can run the whole thing in polynamial time then congratulations, you have proved P=NP.

Its a bit more fiddly to show this still applies if you allow a tolerance, but we can just set that tolerance to zero and it still holds.


Anyway, just because it is NP complete doesnโ€™t mean it is impossible
You can write a JuMP MIP program to select rows.
and solve it with HiGHs or Cbc.
As long as you donโ€™t have too many rows.

Or you can make a heuristic solution like a some kind of greedy search.

2 Likes

Yep, feels like a heuristic solution will work well here.

On the other hand:

can we get a clearer picture of df.c1 , df.c2 , df2.c1 , df2.c2 without the excessive editing, because this is crucial for this problem. For example, are the values integers? what are the ranges of values?

Heya sorry, didnโ€™t notice that one. Rows, the big dataframe that I want to same from has 16218 rows.
Loos like this:

16218ร—3 DataFrame
   Row โ”‚ uniprot_id  length  mobi_th50_dis_fraction 
       โ”‚ String15    Int64   Float64                        
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
     1 โ”‚ P01189         267                           0.007
     2 โ”‚ P01241         217                           0.018
     3 โ”‚ P01275         180                           0.0
     4 โ”‚ P06850         196                           0.0
     5 โ”‚ P09681         153                           0.0
   โ‹ฎ   โ”‚     โ‹ฎ         โ‹ฎ                   โ‹ฎ
 16215 โ”‚ Q9Y6X8         837                           0.002
 16216 โ”‚ Q9Y6Y8        1000                           0.0
 16217 โ”‚ Q9Y6Y9         160                           0.0
 16218 โ”‚ Q9Y6Z7         277                           0.0
                                          16209 rows omitted

c1 and c2 would be the columns that have Ints and Floats, the other one is not needed (uniprot_id).

The other dataframe, the one that I want to compare the sample from the one above, looks like this:

889ร—3 DataFrame
 Row โ”‚ uniprot_id  length  mobi_th50_dis_fraction 
     โ”‚ String15    Int64   Float64                
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ A0A6E1VY70     674                   0.068
   2 โ”‚ A0AVF1         554                   0.058
   3 โ”‚ A0JP26         581                   0.384
   4 โ”‚ A1KZ92        1463                   0.088
   5 โ”‚ A2A3L6         582                   0.29
  โ‹ฎ  โ”‚     โ‹ฎ         โ‹ฎ               โ‹ฎ
 886 โ”‚ Q9Y6X3         613                   0.062
 887 โ”‚ Q9Y6X6        1858                   0.343
 888 โ”‚ Q9Y6Y0         642                   0.0
 889 โ”‚ Q9Y6Y1        1673                   0.361
                                  880 rows omitt

Please let me know if this si better!

1 Like

I agree with all the previous posters about a pragmatic approach, but Iโ€™m wondering about this part of your OP:

I need to take random samples

If you solve a Knapsack problem to select observations, the resulting sample will not be a random sample so standard statistical inference will likely be off. Whether this is a problem of course depends on your application, but if this is about representativeness of the sample in some sense you might want to look at the causal inference literature addressing this stuff, things like matching estimators, IPW or doubly robust estimators, or synthetic controls.

1 Like

Wellโ€ฆ my degree is in molecular biology, so I do not have a lot of background on statistics. Given that we are on this topics, are there any papers / articles or books recommendations on the topics that you all reckon that could help? Besides de Knapsack problem, that I should read more about.

Good, I will look into them. The sample is not going to be representative of the data where it came from (for examples, the mean and media is going to be way off), but itโ€™s going to match the one from the other data points.

Basically, to compare both data sets, I need to โ€œcontrolโ€ certain parameters so the comparison is valid. If I simply take the big data set and compare to the smaller one, the share weight of the sample is going to generate a bias on my test. That is way a need a sample that, not being the original one, โ€œbehavesโ€ like the original one. I think.

One thing that I did was filtering the df for rows that had similar values to the mean value of the dataframe, like this:

filter!(row -> (row.c1>= 700 && row.c1<= 900) && (row.c2>= 3 && row.c2<= 4), df_t2)

But the thing is that I am โ€œforcingโ€ the system, and not getting a random population that happens to have the same mean value, so I think is not ok to do it.

For the reasons that others expanded over I said that this problem is hard. In general you would need to enumerate all solutions to a NP-hard problem and randomly pick one of them.

Given your description I assume you are happy with approximate solution.
For this you could use the following procedure:

  • For each row in a reference table find rows in the original table that are similar to it (this should be doable assuming the data frame is not too wide; you can do it in several ways - either use distance cutoff or some specified number of neighbors)
  • Then sample for each row in a reference table one of its neighbors.

Doing this you should get not only approximate matching of the mean, but approximate matching of the whole joint distribution of the variables on which you match.

An alternative would be to build some propensity model of row being included in the sample and perform propensity score matching.

Iโ€™m spitballing here, this isnโ€™t guaranteed to solve the problem but at least feels like what you mean:

To pick random samples from x with a mean and variance similar to those of y, form p = exp.(-(x.-mean(y)).^2./variance(y)) and use that as your weight vector for sampling. You are likely to pick samples within mean(y)ยฑstd(y), but you are guaranteed they are all from x.