I am looking for a tool/approach to perform an inexact merge. I have two (or more) dataframes with datetime columns (and value columns) that do not correspond.
I imagine specifying a tolerance value (say 5 minutes) for the merge.
Is there a package (or a best practice approach) that can help me with this?
I realize that this needs some assumptions/simplifications (about missing data, about which element to pick if several are are “close” or equally close, etc.)
Background: I want to visualize homeassistant data from multiple sensors (table „states“) in grafana. I understand that grafana can only access a single table for one graph.
EDIT: added MWE below.
using DataFrames
using CSV
url="""https://gist.githubusercontent.com/kafisatz/ed1f42fb4b5fd1ce1cf3a8a12b8e80b6/raw/70df7040c550bbdedd168c90a78caf4ee2808881/hadata.csv"""
fi = download(url)
df0 = CSV.File(fi) |> DataFrame
df1 = filter(x->x.entity_id == "kitchen_temperature",df0)
df2 = filter(x->x.entity_id == "buero_temperature",df0)
#how to merge these on datetime
#I have an approach, but it is very verbose (and feels somewhat complex)
?leftjoin
The quick & dirty solution would be binning to 5-minute blocks then. Of course you lose some matches that span block boundaries.
As a more involved but still quite easy solution, I could also imagine
- taking all
unique
timestamps from one dataframe,
- sorting them, call this
ts
- for each timestamp in the other dataframe, look up the closest one in
ts
and replace it with that if the distance is <5min, otherwise leave as is
- then join
Thanks. 1 to 4 is pretty much what I sketched already.
This is not apparent to me from the code — sorry if I missed something.
The SortMerge.jl package is designed exactly for this purpose:
using DataFrames, CSV, SortMerge, Dates
url="""https://gist.githubusercontent.com/kafisatz/ed1f42fb4b5fd1ce1cf3a8a12b8e80b6/raw/70df7040c550bbdedd168c90a78caf4ee2808881/hadata.csv"""
fi = download(url)
df0 = CSV.File(fi) |> DataFrame
df1 = filter(x->x.entity_id == "kitchen_temperature",df0)
df2 = filter(x->x.entity_id == "buero_temperature",df0)
j = sortmerge(df1.datetime, df2.datetime, Minute(5),
sd=(v1, v2, i1, i2, threshold) -> begin
diff = v1[i1] - v2[i2]
(abs(diff) > threshold) && (return sign(diff))
return 0
end)
# Rows common to both dataframes (i.e., inner join)
tmp = hcat(df1[j[1], :], df2[j[2], :], makeunique=true)
# Print maximum clock difference between matched rows
println(Minute(maximum(abs.(df1[j[1], :datetime] .- df2[j[2], :datetime]))))
# Rows from first dataframe without matching rows in the second one:
tmp = df1[countmatch(j, 1) .== 0, :]
# Rows from second dataframe without matching rows in the first one:
tmp = df2[countmatch(j, 2) .== 0, :]
2 Likes
ah thank you.
After running into GitHub - giordano/StarWarsArrays.jl: Arrays indexed as the order of Star Wars movies the other day, I am rather convinced that there is a Julia package for everything
1 Like