Functions to merge DataFrames by nearest date that is earlier?

Say I have two dataframes but I wish to join them such that val2 from Table 2 is merge on to Table 1 if and only if Table2.date < Table1.date and Table2.date is the closest to Table1.date

Table 1

date, val1
2021-12-16, 1
2021-12-16, 2
2021-12-16, 3

Table 2

date, val2
2021-12-14, 1
2021-12-15, 2
2021-12-16, 3

In the above the result table after the merge will be

Result

date,       val1, val2
2021-12-16, 1,    2
2021-12-16, 2,    2
2021-12-16, 3,    2

This type of join is quite tricky, so I am wondering if a package exists for this already.

1 Like

One way:
(edited function inputs including dataframe columns as symbols)

using DataFrames, Dates

function joinclosestolder(df1, df2, date1::Symbol, date2::Symbol, val2::Symbol)
    if any(df1.date > df2.date)
        d2 = [maximum(df2[:,date1][df2[:,date2] .< d1]) for d1 in df1[:,date1]]
        df1[!,val2] = [df2[:,val2][argmax(d .== df2[:,date2])] for d in d2]
    end
    return df1
end

df1 = DataFrame(date = Date.(["2021-12-16","2021-12-16","2021-12-16"]), val1 = [1,2,3])
df2 = DataFrame(date = Date.(["2021-12-14","2021-12-15","2021-12-16"]), val2 = [1,2,3])
joinclosestolder(df1, df2, :date, :date, :val2)

3×3 DataFrame
 Row │ date        val1   val2  
     │ Date        Int64  Int64
─────┼──────────────────────────
   1 │ 2021-12-16      1      2
   2 │ 2021-12-16      2      2
   3 │ 2021-12-16      3      2

Similar previous discussion here:

1 Like

It is not completely clear to me how the logic of the join is in the most geneal case of several different dates between df1 and df2.
Does something like the following make your logic?

using DataFrames, Dates

df1=DataFrame(date=Date.(["2021-12-16","2021-12-16","2021-12-16","2021-12-17","2021-12-17","2021-12-17"]), 
              val1=[1,2,3,1,2,3])


df2=DataFrame(date=Date.(["2021-12-14","2021-12-15","2021-12-16"]), val2=[1,2,3])

g=groupby(df1,:date)

transform(g, :date=>(d->df2[df2.date .== maximum(filter(x-> x < d[1], df2.date)),:].val2[1])=>:val2)


#### or ####


transform(g, :date=>(d->only(df2[df2.date .== maximum(filter(x-> x < d[1], df2.date)),:].val2))=>:val2)

transform(g, :date=>(d->only(df2[df2.date .== maximum(filter(x-> x < d[1], df2.date)),:]))=>[:d2,:val2])

transform(g, :date=>(d->values(only(df2[df2.date .== maximum(filter(x-> x < d[1], df2.date)),:])))=>:t_d_v)

transform(g, :date=>(d->df2[argmax(filter(x-> x < d[1], df2.date)),:val2])=>:v2)


What you’re trying to do sounds like an “as of” join, which is used very commonly when working with financial time series. Normally it’s not implemented with strict inequality as you described, but a simple shift of the time series will change that.

It can be implemented quite efficiently when the 2nd table is sorted. There is already an open issue for Dataframes.jl and I’m sure they’d be interested in a pull request:

There seems to be an implementation already for IndexedTables, but I’ve not used it:
https://juliadb.juliadata.org/latest/api/#IndexedTables.asofjoin-Tuple{NDSparse,NDSparse}