What I am trying to do (and failing) is to filter all the rows from the bigger df1 (the one that stores de coordinate_gen) so I get every row that has a value between the values of df2start and ends, and add a column with the corresponding id.
The result should look like this:
|coordinate_gen| ID|
|--------------|-------|
| 752566| amd| # bigger than 752540 and smaller than 752589
| 842013|missing| # can't be set between any values
| 903426| dmc| # bigger than 903420and smaller than 903429
| 903428| dmc| # bigger than 903420 and smaller than 903429
| 59033249|missing| # can't be set between any values
I thought about simply using filter, but I don’t know how to add the condition that comes from another dataframe with a different length.
I just read the documentation, I think it will help me solve my problem.
The only issue that I have is that is not clarified how can you use two conditions, something like by_pred(:coordinate_gen, >=, :gene_start, & :coordinate_gen, >=, :gene_end), as this gives a mistake, and FlexiJoins.innerjoin((df1, df2), by_pred(:Physical_Position, in, [:gene_start, :gene_end)) also does not work, probably because I am doing the collection wrong.
This doesn’t look like valid julia syntax at all.
In FlexiJoins, you generally pass functions that extract join keys from dataset entries. Symbols to denote property names in just a convenient shortcut that works at the top level.
Eg, to create an interval, use regular IntervalSets syntax:
by_pred(:Physical_Position, in, x -> x.gene_start..x.gene_end) # closed interval