Asof merge in Julia

Hi, I am trying to do an ‘as of’ merge in Julia with two large Dataframes, based on the time and the equipment_id provided. As a manner of example, let me introduce the two df.head()

using DataFrames

df1 = DataFrame(
    date = DateTime.(["2022-02-20T08:05:22", "2022-02-20T08:05:42", 
                      "2022-02-20T08:05:52", "2022-02-20T08:07:32", 
                      "2022-02-20T08:07:32", "2022-02-20T08:08:05"]),
    equipment_id = 165,
    loaded = "f",
    percent_grade = [0,-1,-3,-1,0,-8]
)

df2 = DataFrame(
    date = DateTime.(["2022-02-20T08:05:24","2022-02-20T08:05:29",
                      "2022-02-20T08:05:34","2022-02-20T08:05:39",
                      "2022-02-20T08:05:44","2022-02-20T08:05:49"]),
    truck = "T67",
    equipment_id = 165,
    value = [149.85,55.85,0,61.9,81.8,0]
)

Where the expected result has to looks like this:

df3 = DataFrame(
    date = DateTime.(["2022-02-20T08:05:24","2022-02-20T08:05:29",
                      "2022-02-20T08:05:34","2022-02-20T08:05:39",
                      "2022-02-20T08:05:44","2022-02-20T08:05:49"]),
    truck = "T67",
    equipment_id = 165,
    loaded = "f",
    value = [149.85,55.85,0,61.9,81.8,0],
    percent_grade = [0,0,0,0,-1,-1]
)

As you noted, the “date” column present in both DataFrames aren’t able me to do an exact match join. That’s why I am trying to do an as of merge. The closest solution that I have found is this. However, the “double cursor” thing makes the problem even more complicated.
Hope you can help me with this.
Thanks in advance.
Regards

1 Like

Check this solution using the SortMerge.jl package, it should help.

2 Likes

What is the threshold for a match on the date columns?
If multiple match occurs, how do you choose the best?

1 Like

Thanks for the suggestion, but it did not deliver the expected result. I do also tried to understand the code and the documentation provided, still is too far from what I want.

FYI: There’s a feature request to add asof to DataFrames.jl and it was put on the 1. milestone:
https://github.com/JuliaData/DataFrames.jl/issues/2738

You can also do it in Pandas in by calling with PythonCall.jl or PyCall.jl, or with the Pandas.jl wrapper I guess.

Also might be helpful (but others seem to have pointed to a package that would be easier?) asof join with Julia data tools - Stack Overflow

@ameresv, if you could answer @gcalderone’s questions he might be able to provide a Julia solution with his package.

I was not aware of this type of merge as a function in other languages, but seems it does exist. I have done something similar manually but it involved computing the minimum distance between the timestamps.

You are right! In most cases, I use “xlookup” from excel; however, the amount of data used here makes it way difficult to handle for them.

Thanks for the question:
I would rather use a 5 seconds threshold. And in case to have multiple matches, the next item has to be used in this case.
Regards

I found this Which is pretty the same what I want to achieve.

I don’t get the result you expect, but I don’t know how else to interpret the conditions you described in a generic way.
Could something like this be right for you?


gdate1=groupby(rightjoin(df1,df2,on=:equipment_id, makeunique=true),:date_1)

subset(gdate1, [:date,:date_1]=>(x,y)->minimum(abs.(x-y)).==abs.(x-y))

julia> subset(gdate1, [:date,:date_1]=>(x,y)->minimum(abs.(x-y)).==abs.(x-y))
6×7 DataFrame
 Row │ date                 equipment_id  loaded   percent_grade  date_1               truck   value   
     │ DateTime?            Int64         String?  Int64?         DateTime             String  Float64 
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 2022-02-20T08:05:22           165  f                    0  2022-02-20T08:05:24  T67      149.85
   2 │ 2022-02-20T08:05:22           165  f                    0  2022-02-20T08:05:29  T67       55.85
   3 │ 2022-02-20T08:05:42           165  f                   -1  2022-02-20T08:05:34  T67        0.0
   4 │ 2022-02-20T08:05:42           165  f                   -1  2022-02-20T08:05:39  T67       61.9
   5 │ 2022-02-20T08:05:42           165  f                   -1  2022-02-20T08:05:44  T67       81.8
   6 │ 2022-02-20T08:05:52           165  f                   -3  2022-02-20T08:05:49  T67        0.0

or maybe that’s what you’re looking for

julia> subset(gdate1, [:date,:date_1]=>(x,y)->maximum(filter(<=(Second(0)),x-y)).==x-y)
6×7 DataFrame
 Row │ date                 equipment_id  loaded   percent_grade  date_1               truck   value   
     │ DateTime?            Int64         String?  Int64?         DateTime             String  Float64 
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 2022-02-20T08:05:22           165  f                    0  2022-02-20T08:05:24  T67      149.85 
   2 │ 2022-02-20T08:05:22           165  f                    0  2022-02-20T08:05:29  T67       55.85 
   3 │ 2022-02-20T08:05:22           165  f                    0  2022-02-20T08:05:34  T67        0.0  
   4 │ 2022-02-20T08:05:22           165  f                    0  2022-02-20T08:05:39  T67       61.9  
   5 │ 2022-02-20T08:05:42           165  f                   -1  2022-02-20T08:05:44  T67       81.8  
   6 │ 2022-02-20T08:05:42           165  f                   -1  2022-02-20T08:05:49  T67        0.0  


or this

subset(gdate1, [:date,:date_1]=>(x,y)->closestlower(x,y[1]).==x)

#where

closestlower(x,l)=reduce((c,sup)->  c< sup <= l ? sup : c ,x, init=typemin(l))


2 Likes

Such a good expression. Thanks a lot for that and sorry for the delay, I am on the mine site with internet problems.
Regards

use closejoin or closejoin! function in InMemoryDatasets

using InMemoryDatasets

df1 = Dataset(
    date = DateTime.(["2022-02-20T08:05:22", "2022-02-20T08:05:42", 
                      "2022-02-20T08:05:52", "2022-02-20T08:07:32", 
                      "2022-02-20T08:07:32", "2022-02-20T08:08:05"]),
    equipment_id = 165,
    loaded = "f",
    percent_grade = [0,-1,-3,-1,0,-8]
)

df2 = Dataset(
    date = DateTime.(["2022-02-20T08:05:24","2022-02-20T08:05:29",
                      "2022-02-20T08:05:34","2022-02-20T08:05:39",
                      "2022-02-20T08:05:44","2022-02-20T08:05:49"]),
    truck = "T67",
    equipment_id = 165,
    value = [149.85,55.85,0,61.9,81.8,0]
)

df3 = Dataset(
    date = DateTime.(["2022-02-20T08:05:24","2022-02-20T08:05:29",
                      "2022-02-20T08:05:34","2022-02-20T08:05:39",
                      "2022-02-20T08:05:44","2022-02-20T08:05:49"]),
    truck = "T67",
    equipment_id = 165,
    loaded = "f",
    value = [149.85,55.85,0,61.9,81.8,0],
    percent_grade = [0,0,0,0,-1,-1]
)

closejoin(df2, df1, on = [:equipment_id => :equipment_id, :date => :date])
3 Likes