Say I have two dataframes but I wish to join them such that val2 from Table 2 is merge on to Table 1 if and only if Table2.date < Table1.date and Table2.date is the closest to Table1.date
It is not completely clear to me how the logic of the join is in the most geneal case of several different dates between df1 and df2.
Does something like the following make your logic?
using DataFrames, Dates
df1=DataFrame(date=Date.(["2021-12-16","2021-12-16","2021-12-16","2021-12-17","2021-12-17","2021-12-17"]),
val1=[1,2,3,1,2,3])
df2=DataFrame(date=Date.(["2021-12-14","2021-12-15","2021-12-16"]), val2=[1,2,3])
g=groupby(df1,:date)
transform(g, :date=>(d->df2[df2.date .== maximum(filter(x-> x < d[1], df2.date)),:].val2[1])=>:val2)
#### or ####
transform(g, :date=>(d->only(df2[df2.date .== maximum(filter(x-> x < d[1], df2.date)),:].val2))=>:val2)
transform(g, :date=>(d->only(df2[df2.date .== maximum(filter(x-> x < d[1], df2.date)),:]))=>[:d2,:val2])
transform(g, :date=>(d->values(only(df2[df2.date .== maximum(filter(x-> x < d[1], df2.date)),:])))=>:t_d_v)
transform(g, :date=>(d->df2[argmax(filter(x-> x < d[1], df2.date)),:val2])=>:v2)
What youβre trying to do sounds like an βas ofβ join, which is used very commonly when working with financial time series. Normally itβs not implemented with strict inequality as you described, but a simple shift of the time series will change that.
It can be implemented quite efficiently when the 2nd table is sorted. There is already an open issue for Dataframes.jl and Iβm sure theyβd be interested in a pull request: https://github.com/JuliaData/DataFrames.jl/issues/2738