Is merging data frames by range supported now?
Do you mean asofjoin?
If yes then it is not implemented yet. We are internally discussing with @nalimilan about the best API design for this feature. I expect that it will be implemented in main
branch this year.
For example, dataframe A has two columns id and date, and dataframe B has three columns id, date1, and date2. merge A and B based on (id = id, date1 < date < date2)
do you want all rows from the second table that meet this condition or only the closest one?
all rows.
FlexiJoins.jl is the most general package for joining datasets, to my knowledge. It does efficient joins by predicates such as the one you need, and also supports DataFrames
merge A and B based on (id = id, date1 < date < date2)
using FlexiJoins, IntervalSets
innerjoin((A, B), by_key(:id) & by_pred(:date, ∈, x -> x.date1..x.date2))
x → x.date1…x.date2
Does this range include date1 and date2?
Thanks for this package. Is it as fast as DataFrames?
Yes, this is a closed interval from IntervalSets.jl. See their documentation for other kinds of intervals - open and halfopen. They all are supported in FlexiJoins
.
I believe DataFrames efficiently optimize more cases on equijoins (join by key), but in simple cases the performance seems similar - see some (not documented) benchmarks.
As for nonequijoins, like the range join discussed here, only FlexiJoins support it directly. And this should be much faster than looping over all L-R pairs, filtering them.
In package InMemoryDataSets, the function innerjoin performs exactly the same. In my experience, amongst other packages, it is the fastest and the most efficient one.
DataFrameIntervals.jl is another option for this.