Asof merge in Julia

Hi, I am trying to do an ‘as of’ merge in Julia with two large Dataframes, based on the time and the equipment_id provided. As a manner of example, let me introduce the two df.head()

using DataFrames

df1 = DataFrame(
    date = DateTime.(["2022-02-20T08:05:22", "2022-02-20T08:05:42", 
                      "2022-02-20T08:05:52", "2022-02-20T08:07:32", 
                      "2022-02-20T08:07:32", "2022-02-20T08:08:05"]),
    equipment_id = 165,
    loaded = "f",
    percent_grade = [0,-1,-3,-1,0,-8]
)

df2 = DataFrame(
    date = DateTime.(["2022-02-20T08:05:24","2022-02-20T08:05:29",
                      "2022-02-20T08:05:34","2022-02-20T08:05:39",
                      "2022-02-20T08:05:44","2022-02-20T08:05:49"]),
    truck = "T67",
    equipment_id = 165,
    value = [149.85,55.85,0,61.9,81.8,0]
)

Where the expected result has to looks like this:

df3 = DataFrame(
    date = DateTime.(["2022-02-20T08:05:24","2022-02-20T08:05:29",
                      "2022-02-20T08:05:34","2022-02-20T08:05:39",
                      "2022-02-20T08:05:44","2022-02-20T08:05:49"]),
    truck = "T67",
    equipment_id = 165,
    loaded = "f",
    value = [149.85,55.85,0,61.9,81.8,0],
    percent_grade = [0,0,0,0,-1,-1]
)

As you noted, the “date” column present in both DataFrames aren’t able me to do an exact match join. That’s why I am trying to do an as of merge. The closest solution that I have found is this. However, the “double cursor” thing makes the problem even more complicated.
Hope you can help me with this.
Thanks in advance.
Regards

Check this solution using the SortMerge.jl package, it should help.

What is the threshold for a match on the date columns?
If multiple match occurs, how do you choose the best?

Thanks for the suggestion, but it did not deliver the expected result. I do also tried to understand the code and the documentation provided, still is too far from what I want.

FYI: There’s a feature request to add asof to DataFrames.jl and it was put on the 1. milestone:
https://github.com/JuliaData/DataFrames.jl/issues/2738

You can also do it in Pandas in by calling with PythonCall.jl or PyCall.jl, or with the Pandas.jl wrapper I guess.

Also might be helpful (but others seem to have pointed to a package that would be easier?) asof join with Julia data tools - Stack Overflow

@ameresv, if you could answer @gcalderone’s questions he might be able to provide a Julia solution with his package.

I was not aware of this type of merge as a function in other languages, but seems it does exist. I have done something similar manually but it involved computing the minimum distance between the timestamps.

You are right! In most cases, I use “xlookup” from excel; however, the amount of data used here makes it way difficult to handle for them.

Thanks for the question:
I would rather use a 5 seconds threshold. And in case to have multiple matches, the next item has to be used in this case.
Regards

I found this Which is pretty the same what I want to achieve.

I don’t get the result you expect, but I don’t know how else to interpret the conditions you described in a generic way.
Could something like this be right for you?


gdate1=groupby(rightjoin(df1,df2,on=:equipment_id, makeunique=true),:date_1)

subset(gdate1, [:date,:date_1]=>(x,y)->minimum(abs.(x-y)).==abs.(x-y))

julia> subset(gdate1, [:date,:date_1]=>(x,y)->minimum(abs.(x-y)).==abs.(x-y))
6×7 DataFrame
 Row │ date                 equipment_id  loaded   percent_grade  date_1               truck   value   
     │ DateTime?            Int64         String?  Int64?         DateTime             String  Float64 
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 2022-02-20T08:05:22           165  f                    0  2022-02-20T08:05:24  T67      149.85
   2 │ 2022-02-20T08:05:22           165  f                    0  2022-02-20T08:05:29  T67       55.85
   3 │ 2022-02-20T08:05:42           165  f                   -1  2022-02-20T08:05:34  T67        0.0
   4 │ 2022-02-20T08:05:42           165  f                   -1  2022-02-20T08:05:39  T67       61.9
   5 │ 2022-02-20T08:05:42           165  f                   -1  2022-02-20T08:05:44  T67       81.8
   6 │ 2022-02-20T08:05:52           165  f                   -3  2022-02-20T08:05:49  T67        0.0

or maybe that’s what you’re looking for

julia> subset(gdate1, [:date,:date_1]=>(x,y)->maximum(filter(<=(Second(0)),x-y)).==x-y)
6×7 DataFrame
 Row │ date                 equipment_id  loaded   percent_grade  date_1               truck   value   
     │ DateTime?            Int64         String?  Int64?         DateTime             String  Float64 
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 2022-02-20T08:05:22           165  f                    0  2022-02-20T08:05:24  T67      149.85 
   2 │ 2022-02-20T08:05:22           165  f                    0  2022-02-20T08:05:29  T67       55.85 
   3 │ 2022-02-20T08:05:22           165  f                    0  2022-02-20T08:05:34  T67        0.0  
   4 │ 2022-02-20T08:05:22           165  f                    0  2022-02-20T08:05:39  T67       61.9  
   5 │ 2022-02-20T08:05:42           165  f                   -1  2022-02-20T08:05:44  T67       81.8  
   6 │ 2022-02-20T08:05:42           165  f                   -1  2022-02-20T08:05:49  T67        0.0  


or this

subset(gdate1, [:date,:date_1]=>(x,y)->closestlower(x,y[1]).==x)

#where

closestlower(x,l)=reduce((c,sup)->  c< sup <= l ? sup : c ,x, init=typemin(l))


Such a good expression. Thanks a lot for that and sorry for the delay, I am on the mine site with internet problems.
Regards

use closejoin or closejoin! function in InMemoryDatasets

using InMemoryDatasets

df1 = Dataset(
    date = DateTime.(["2022-02-20T08:05:22", "2022-02-20T08:05:42", 
                      "2022-02-20T08:05:52", "2022-02-20T08:07:32", 
                      "2022-02-20T08:07:32", "2022-02-20T08:08:05"]),
    equipment_id = 165,
    loaded = "f",
    percent_grade = [0,-1,-3,-1,0,-8]
)

df2 = Dataset(
    date = DateTime.(["2022-02-20T08:05:24","2022-02-20T08:05:29",
                      "2022-02-20T08:05:34","2022-02-20T08:05:39",
                      "2022-02-20T08:05:44","2022-02-20T08:05:49"]),
    truck = "T67",
    equipment_id = 165,
    value = [149.85,55.85,0,61.9,81.8,0]
)

df3 = Dataset(
    date = DateTime.(["2022-02-20T08:05:24","2022-02-20T08:05:29",
                      "2022-02-20T08:05:34","2022-02-20T08:05:39",
                      "2022-02-20T08:05:44","2022-02-20T08:05:49"]),
    truck = "T67",
    equipment_id = 165,
    loaded = "f",
    value = [149.85,55.85,0,61.9,81.8,0],
    percent_grade = [0,0,0,0,-1,-1]
)

closejoin(df2, df1, on = [:equipment_id => :equipment_id, :date => :date])