Filtering one DataFrame using parameters of another, with different lengths

Hi, I have the following issue:

I have a pretty big df that basicly stores coordinates, that looks like this:

|coordinate_gen|
|--------------|
|        752566|
|        842013|
|        903426|
|        903428|
|      59033249|

This has 1233013 rows, so it is quite big.

And then I have another df, that only has 20.600 rows, and three rows basically:

| ID|  start|  ends|
|---|-------|------|
|amd| 752540|752589|
|dmc| 903420|903429|
|chv|     10|    15|

What I am trying to do (and failing) is to filter all the rows from the bigger df1 (the one that stores de coordinate_gen) so I get every row that has a value between the values of df2 start and ends, and add a column with the corresponding id.

The result should look like this:

|coordinate_gen|     ID|
|--------------|-------|
|        752566|    amd| # bigger than 752540 and smaller than 752589
|        842013|missing| # can't be set between any values
|        903426|    dmc| # bigger than 903420and smaller than 903429
|        903428|    dmc| # bigger than 903420 and smaller than 903429
|      59033249|missing| # can't be set between any values

I thought about simply using filter, but I don’t know how to add the condition that comes from another dataframe with a different length.

Any help is welcome,

Thanks a lot
J

You are probably looking for Asof join operation, followed by filter. [ANN] FlexiJoins.jl: fresh take on joining datasets might be of use

1 Like

You might want to look at FlexiJoins

Basically you are doing a leftjoin of small_df onto big_df, where the predicate is coordinate_gen in start:ends

Thanks!

I just read the documentation, I think it will help me solve my problem.

The only issue that I have is that is not clarified how can you use two conditions, something like by_pred(:coordinate_gen, >=, :gene_start, & :coordinate_gen, >=, :gene_end), as this gives a mistake, and FlexiJoins.innerjoin((df1, df2), by_pred(:Physical_Position, in, [:gene_start, :gene_end)) also does not work, probably because I am doing the collection wrong.

using DataFrames, IntervalSets

cg=[
        752566,
        842013,
        903426,
        903428,
      59033249]

df1=DataFrame(;cg)

 ID=[ 
"amd",
"dmc",
"chv"]
      
      
 start=[ 
 752540,
 903420,
     10]

ends=[
752589,
903429,
    15]


df2=DataFrame(;ID,start,ends)


ints=ClosedInterval{Int64}.(df2.start,df2.ends)

idx=map(g->findfirst(i->∈(g,i), ints), df1.cg)

df1.IDgen=[!isnothing(i) ? ID[i] : missing for i in idx]

This doesn’t look like valid julia syntax at all.
In FlexiJoins, you generally pass functions that extract join keys from dataset entries. Symbols to denote property names in just a convenient shortcut that works at the top level.
Eg, to create an interval, use regular IntervalSets syntax:

by_pred(:Physical_Position, in, x -> x.gene_start..x.gene_end)  # closed interval