Fuzzy inexact merge

bernhard · July 25, 2020, 12:57pm

I am looking for a tool/approach to perform an inexact merge. I have two (or more) dataframes with datetime columns (and value columns) that do not correspond.

I imagine specifying a tolerance value (say 5 minutes) for the merge.

Is there a package (or a best practice approach) that can help me with this?
I realize that this needs some assumptions/simplifications (about missing data, about which element to pick if several are are “close” or equally close, etc.)

Background: I want to visualize homeassistant data from multiple sensors (table „states“) in grafana. I understand that grafana can only access a single table for one graph.

EDIT: added MWE below.

using DataFrames
using CSV

url="""https://gist.githubusercontent.com/kafisatz/ed1f42fb4b5fd1ce1cf3a8a12b8e80b6/raw/70df7040c550bbdedd168c90a78caf4ee2808881/hadata.csv"""
fi = download(url)
df0 = CSV.File(fi) |> DataFrame

df1 = filter(x->x.entity_id == "kitchen_temperature",df0)
df2 = filter(x->x.entity_id == "buero_temperature",df0)

#how to merge these on datetime
#I have an approach, but it is very verbose (and feels somewhat complex)
?leftjoin

Tamas_Papp · July 25, 2020, 1:52pm

The quick & dirty solution would be binning to 5-minute blocks then. Of course you lose some matches that span block boundaries.

As a more involved but still quite easy solution, I could also imagine

taking all unique timestamps from one dataframe,
sorting them, call this ts
for each timestamp in the other dataframe, look up the closest one in ts and replace it with that if the distance is <5min, otherwise leave as is
then join

bernhard · July 25, 2020, 2:05pm

Thanks. 1 to 4 is pretty much what I sketched already.

Tamas_Papp · July 25, 2020, 2:23pm

This is not apparent to me from the code — sorry if I missed something.

gcalderone · July 29, 2020, 9:11am

The SortMerge.jl package is designed exactly for this purpose:

using DataFrames, CSV, SortMerge, Dates

url="""https://gist.githubusercontent.com/kafisatz/ed1f42fb4b5fd1ce1cf3a8a12b8e80b6/raw/70df7040c550bbdedd168c90a78caf4ee2808881/hadata.csv"""
fi = download(url)
df0 = CSV.File(fi) |> DataFrame

df1 = filter(x->x.entity_id == "kitchen_temperature",df0)
df2 = filter(x->x.entity_id == "buero_temperature",df0)

j = sortmerge(df1.datetime, df2.datetime, Minute(5),
              sd=(v1, v2, i1, i2, threshold) -> begin
              diff = v1[i1] - v2[i2]
              (abs(diff) > threshold)  &&  (return sign(diff))
              return 0
              end)

# Rows common to both dataframes (i.e., inner join)
tmp = hcat(df1[j[1], :], df2[j[2], :], makeunique=true)

# Print maximum clock difference between matched rows
println(Minute(maximum(abs.(df1[j[1], :datetime] .- df2[j[2], :datetime]))))

# Rows from first dataframe without matching rows in the second one:
tmp = df1[countmatch(j, 1) .== 0, :]

# Rows from second dataframe without matching rows in the first one:
tmp = df2[countmatch(j, 2) .== 0, :]

bernhard · July 29, 2020, 11:04am

ah thank you.
After running into GitHub - giordano/StarWarsArrays.jl: Arrays indexed as the order of Star Wars movies the other day, I am rather convinced that there is a Julia package for everything

Topic		Replies	Views
Asof merge in Julia General Usage dataframes , merge	12	1628	March 23, 2022
Functions to merge DataFrames by nearest date that is earlier? Data dataframes	4	1547	December 20, 2021
Merge dataframes where one value is between two others General Usage question	7	11844	November 22, 2020
Join Dataframes on different IDs within date range New to Julia query , dataframes	1	509	October 4, 2020
Two Feature Requests for `merge` Internals & Design	10	1163	May 25, 2018

Fuzzy inexact merge

Related topics