Filtering one DataFrame using parameters of another, with different lengths

Juan_Mac_Donagh · March 8, 2023, 2:57pm

Hi, I have the following issue:

I have a pretty big df that basicly stores coordinates, that looks like this:

|coordinate_gen|
|--------------|
|        752566|
|        842013|
|        903426|
|        903428|
|      59033249|

This has 1233013 rows, so it is quite big.

And then I have another df, that only has 20.600 rows, and three rows basically:

| ID|  start|  ends|
|---|-------|------|
|amd| 752540|752589|
|dmc| 903420|903429|
|chv|     10|    15|

What I am trying to do (and failing) is to filter all the rows from the bigger df1 (the one that stores de coordinate_gen) so I get every row that has a value between the values of df2 start and ends, and add a column with the corresponding id.

The result should look like this:

|coordinate_gen|     ID|
|--------------|-------|
|        752566|    amd| # bigger than 752540 and smaller than 752589
|        842013|missing| # can't be set between any values
|        903426|    dmc| # bigger than 903420and smaller than 903429
|        903428|    dmc| # bigger than 903420 and smaller than 903429
|      59033249|missing| # can't be set between any values

I thought about simply using filter, but I don’t know how to add the condition that comes from another dataframe with a different length.

Any help is welcome,

Thanks a lot
J

artemsolod · March 8, 2023, 3:06pm

You are probably looking for Asof join operation, followed by filter. [ANN] FlexiJoins.jl: fresh take on joining datasets might be of use

nilshg · March 8, 2023, 3:07pm

You might want to look at FlexiJoins

Basically you are doing a leftjoin of small_df onto big_df, where the predicate is coordinate_gen in start:ends

Juan_Mac_Donagh · March 8, 2023, 4:45pm

Thanks!

I just read the documentation, I think it will help me solve my problem.

The only issue that I have is that is not clarified how can you use two conditions, something like by_pred(:coordinate_gen, >=, :gene_start, & :coordinate_gen, >=, :gene_end), as this gives a mistake, and FlexiJoins.innerjoin((df1, df2), by_pred(:Physical_Position, in, [:gene_start, :gene_end)) also does not work, probably because I am doing the collection wrong.

rocco_sprmnt21 · March 8, 2023, 5:25pm

using DataFrames, IntervalSets

cg=[
        752566,
        842013,
        903426,
        903428,
      59033249]

df1=DataFrame(;cg)

 ID=[ 
"amd",
"dmc",
"chv"]
      
      
 start=[ 
 752540,
 903420,
     10]

ends=[
752589,
903429,
    15]


df2=DataFrame(;ID,start,ends)


ints=ClosedInterval{Int64}.(df2.start,df2.ends)

idx=map(g->findfirst(i->∈(g,i), ints), df1.cg)

df1.IDgen=[!isnothing(i) ? ID[i] : missing for i in idx]

aplavin · March 8, 2023, 5:29pm

This doesn’t look like valid julia syntax at all.
In FlexiJoins, you generally pass functions that extract join keys from dataset entries. Symbols to denote property names in just a convenient shortcut that works at the top level.
Eg, to create an interval, use regular IntervalSets syntax:

by_pred(:Physical_Position, in, x -> x.gene_start..x.gene_end)  # closed interval

Topic		Replies	Views
Conditional left join 2 dataframes when none of the columns are common General Usage dataframes , flexijoins	26	1979	May 26, 2022
Merge dataframes where one value is between two others General Usage question	7	11753	November 22, 2020
Asof join support in DataFrames.jl Data dataframes	9	919	October 13, 2022
Is it possible to use filter(:col => myFilter, df) with additional constant input variables? New to Julia question , dataframes	12	790	January 5, 2022
How to filter data conditional on two columns General Usage data , dataframes	17	1779	January 11, 2022

Filtering one DataFrame using parameters of another, with different lengths

Related topics