Fuzzy inexact merge

I am looking for a tool/approach to perform an inexact merge. I have two (or more) dataframes with datetime columns (and value columns) that do not correspond.

I imagine specifying a tolerance value (say 5 minutes) for the merge.

Is there a package (or a best practice approach) that can help me with this?
I realize that this needs some assumptions/simplifications (about missing data, about which element to pick if several are are “close” or equally close, etc.)

Background: I want to visualize homeassistant data from multiple sensors (table „states“) in grafana. I understand that grafana can only access a single table for one graph.

EDIT: added MWE below.

using DataFrames
using CSV

url="""https://gist.githubusercontent.com/kafisatz/ed1f42fb4b5fd1ce1cf3a8a12b8e80b6/raw/70df7040c550bbdedd168c90a78caf4ee2808881/hadata.csv"""
fi = download(url)
df0 = CSV.File(fi) |> DataFrame

df1 = filter(x->x.entity_id == "kitchen_temperature",df0)
df2 = filter(x->x.entity_id == "buero_temperature",df0)

#how to merge these on datetime
#I have an approach, but it is very verbose (and feels somewhat complex)
?leftjoin

The quick & dirty solution would be binning to 5-minute blocks then. Of course you lose some matches that span block boundaries.

As a more involved but still quite easy solution, I could also imagine

  1. taking all unique timestamps from one dataframe,
  2. sorting them, call this ts
  3. for each timestamp in the other dataframe, look up the closest one in ts and replace it with that if the distance is <5min, otherwise leave as is
  4. then join

Thanks. 1 to 4 is pretty much what I sketched already.

This is not apparent to me from the code — sorry if I missed something.

The SortMerge.jl package is designed exactly for this purpose:

using DataFrames, CSV, SortMerge, Dates

url="""https://gist.githubusercontent.com/kafisatz/ed1f42fb4b5fd1ce1cf3a8a12b8e80b6/raw/70df7040c550bbdedd168c90a78caf4ee2808881/hadata.csv"""
fi = download(url)
df0 = CSV.File(fi) |> DataFrame

df1 = filter(x->x.entity_id == "kitchen_temperature",df0)
df2 = filter(x->x.entity_id == "buero_temperature",df0)

j = sortmerge(df1.datetime, df2.datetime, Minute(5),
              sd=(v1, v2, i1, i2, threshold) -> begin
              diff = v1[i1] - v2[i2]
              (abs(diff) > threshold)  &&  (return sign(diff))
              return 0
              end)

# Rows common to both dataframes (i.e., inner join)
tmp = hcat(df1[j[1], :], df2[j[2], :], makeunique=true)

# Print maximum clock difference between matched rows
println(Minute(maximum(abs.(df1[j[1], :datetime] .- df2[j[2], :datetime]))))

# Rows from first dataframe without matching rows in the second one:
tmp = df1[countmatch(j, 1) .== 0, :]

# Rows from second dataframe without matching rows in the first one:
tmp = df2[countmatch(j, 2) .== 0, :]

2 Likes

ah thank you.
After running into GitHub - giordano/StarWarsArrays.jl: Arrays indexed as the order of Star Wars movies the other day, I am rather convinced that there is a Julia package for everything :slight_smile:

1 Like