Functions to merge DataFrames by nearest date that is earlier?

xiaodai · December 16, 2021, 10:54am

Say I have two dataframes but I wish to join them such that val2 from Table 2 is merge on to Table 1 if and only if Table2.date < Table1.date and Table2.date is the closest to Table1.date

Table 1

date, val1
2021-12-16, 1
2021-12-16, 2
2021-12-16, 3

Table 2

date, val2
2021-12-14, 1
2021-12-15, 2
2021-12-16, 3

In the above the result table after the merge will be

Result

date,       val1, val2
2021-12-16, 1,    2
2021-12-16, 2,    2
2021-12-16, 3,    2

This type of join is quite tricky, so I am wondering if a package exists for this already.

rafael.guerra · December 16, 2021, 12:13pm

One way:
(edited function inputs including dataframe columns as symbols)

using DataFrames, Dates

function joinclosestolder(df1, df2, date1::Symbol, date2::Symbol, val2::Symbol)
    if any(df1.date > df2.date)
        d2 = [maximum(df2[:,date1][df2[:,date2] .< d1]) for d1 in df1[:,date1]]
        df1[!,val2] = [df2[:,val2][argmax(d .== df2[:,date2])] for d in d2]
    end
    return df1
end

df1 = DataFrame(date = Date.(["2021-12-16","2021-12-16","2021-12-16"]), val1 = [1,2,3])
df2 = DataFrame(date = Date.(["2021-12-14","2021-12-15","2021-12-16"]), val2 = [1,2,3])
joinclosestolder(df1, df2, :date, :date, :val2)

3×3 DataFrame
 Row │ date        val1   val2  
     │ Date        Int64  Int64
─────┼──────────────────────────
   1 │ 2021-12-16      1      2
   2 │ 2021-12-16      2      2
   3 │ 2021-12-16      3      2

nilshg · December 16, 2021, 1:36pm

Similar previous discussion here:

rocco_sprmnt21 · December 19, 2021, 11:12pm

It is not completely clear to me how the logic of the join is in the most geneal case of several different dates between df1 and df2.
Does something like the following make your logic?

using DataFrames, Dates

df1=DataFrame(date=Date.(["2021-12-16","2021-12-16","2021-12-16","2021-12-17","2021-12-17","2021-12-17"]), 
              val1=[1,2,3,1,2,3])


df2=DataFrame(date=Date.(["2021-12-14","2021-12-15","2021-12-16"]), val2=[1,2,3])

g=groupby(df1,:date)

transform(g, :date=>(d->df2[df2.date .== maximum(filter(x-> x < d[1], df2.date)),:].val2[1])=>:val2)


#### or ####


transform(g, :date=>(d->only(df2[df2.date .== maximum(filter(x-> x < d[1], df2.date)),:].val2))=>:val2)

transform(g, :date=>(d->only(df2[df2.date .== maximum(filter(x-> x < d[1], df2.date)),:]))=>[:d2,:val2])

transform(g, :date=>(d->values(only(df2[df2.date .== maximum(filter(x-> x < d[1], df2.date)),:])))=>:t_d_v)

transform(g, :date=>(d->df2[argmax(filter(x-> x < d[1], df2.date)),:val2])=>:v2)

robsmith11 · December 20, 2021, 3:58am

What you’re trying to do sounds like an “as of” join, which is used very commonly when working with financial time series. Normally it’s not implemented with strict inequality as you described, but a simple shift of the time series will change that.

It can be implemented quite efficiently when the 2nd table is sorted. There is already an open issue for Dataframes.jl and I’m sure they’d be interested in a pull request:
https://github.com/JuliaData/DataFrames.jl/issues/2738

There seems to be an implementation already for IndexedTables, but I’ve not used it:
https://juliadb.juliadata.org/latest/api/#IndexedTables.asofjoin-Tuple{NDSparse,NDSparse}

Topic		Replies	Views
Asof merge in Julia General Usage dataframes , merge	12	1575	March 23, 2022
Fuzzy inexact merge General Usage	5	1526	July 29, 2020
Merge dataframes where one value is between two others General Usage question	7	11748	November 22, 2020
DataFrame query General Usage dataframes	7	317	August 27, 2022
Adding information from one DataFrame to another New to Julia	11	511	July 29, 2021

Functions to merge DataFrames by nearest date that is earlier?

Related topics