Thank you, it works now
Upgraded the MNWE above to a MWE.
The problem is that the values in both dataframes can be missing, not only in the second one.
By the way, is it possible to see those rows, that break the cardinality assumption, in the error message?
Yes, I see⦠You can use by_pred(f_L, !isdisjoint, f_R)
where f_L
and f_R
create these 0-or-Inf intervals.
This isnβt as efficient as possible with a dedicated implementation, but should work already.
The ideal interface seems to be by_pred(:col_L, isequal_with_missing_wildcard, :col_R)
, probably with a nicer name (: This could directly generalize to isequal_with_wildcard(nothing)
or isequal_with_wildcard(NaN)
β¦
Btw, is a function such as isequal_with_missing_wildcard
already defined somewhere?
Not that I know about. I myself defined a function
equal_missing(x,y) = any(ismissing, (x,y)) ? true : x==y
And used it inside the predicate.
Ideal would be to add some keyword that would modify the behaviour of all by_key
to do this kind of check. Writing up by_pred
for every variable with missing values that you want to join on is too cumbersome.
I agree, maybe by_key(f_L, f_R, isequal=isequal_with_wildcard)
β¦
It should be, for convenience β just isnβt implemented.
What I typically do after getting a cardinality error in
innerjoin((;L, R), by_..., cardinality=(1, 1))
, is run
@p outerjoin((;L, R), by_..., groupby=:L) |> filter(length(_.R) != 1)
(+ same in the other direction, groupby=:R
)
and explore the result.
Is it possible to do asof joins with maximum distance? For example, I want the most recent event that occurred at most 3 months ago. This is possible after the fact, but can it occur as part of the join?
For example, with the data:
d1 = DataFrame(event=[Date(2020), Date(2021)])
d2 = DataFrame(time=[Date(2019, 12), Date(2020, 6), Date(2019, 11)])
You can run
leftjoin(
(d1, d2),
by_pred(:event, >, :time),
multi=(identity, closest,)
)
2Γ2 DataFrame
Row β event time
β Date Date
ββββββΌββββββββββββββββββββββββ
1 β 2020-01-01 2019-12-01
2 β 2021-01-01 2020-06-01 # this is past the 3 month mark
Or you can make sure it is within the 3 month mark but then it is not for sure only 1 observation:
leftjoin(
(d1, d2),
by_pred(x -> x.event - Month(3) .. x.event, β, :time),
)
3Γ2 DataFrame
Row β event time
β Date Date?
ββββββΌββββββββββββββββββββββββ
1 β 2020-01-01 2019-11-01
2 β 2020-01-01 2019-12-01
3 β 2021-01-01 missing
multi=closest
should be supported for > < >= <=
as you noticed, and for by_distance
β e.g. by_distance(:event, :time, Euclidean(), <=(3))
.
But I donβt think thereβs anything out of the box for one-sided closest
with a restriction on the max distanceβ¦ Itβs not more difficult, just not clear how the user-facing interface should look. For by_pred(x -> x.event - Month(3) .. x.event, β, :time)
, itβs not clear what βclosestβ would mean.
If you have suggestions for a natural interface, please share
But also just filtering after join should work and be basically as efficient.
My thought on an interface would be something like:
leftjoin(
(d1, d2),
by_pred(:event, >, :time),
multi=(identity, closest(:event, :time))
)
I actually tried that when I was first looking at the package since it matches the by_key(...)
interface. As an additional piece, it would be interesting if it accepted custom functions (closest(:event, fun, :time)
). I think this could also allow items that are not already in the conditions to matter for picking the closest (e.g., find the closest time to an event without a restriction on how far away that time is).
An alternative would be moving the closest
out of a keyword argument from the parent function and into the by_pred
or by_distance
:
leftjoin(
(d1, d2),
by_pred(:event, >, :time; closest=true)
# or closest(by_pred(:event, >, :time))
)
To me, this clarifies what closest is being run on for the user.
A final alternative would be to combine those two:
leftjoin(
(d1, d2),
by_pred(:event, >, :time) & closest(:event, :time)
)
I am not sure about the feasibility of those options, but all of those to me fit in what is already done within the package.
Thanks for the suggestions! I agree that this functionality should be added as itβs clearly useful.
The main issue with putting closest
into the condition itself, no matter how exactly itβs done, is that it doesnβt specify which side the βclosestβ correspond to. Currently,
innerjoin((;A, B), ..., multi=(;B=closest))
means βfor each a in A
, take the single closest b from B
β. The A/B distinction is important as βfor each b in B
, take the single closest a from A
β would give a different result.
I guess I did not understand that from the docs.
So as a keyword argument, I think you could still do:
leftjoin(
(; A=d1, B=d2),
by_pred(:event, >, :time),
multi=(; B=closest(:event, :time))
)
Or by_pred
and by_distance
could accept keyword arguments and do something like:
leftjoin(
(; A=d1, B=d2),
by_pred(:event, >, :time; mutli=(;B=closest))
)
This would be the most consistent with the current code.
Another thought would be to borrow the left/right terminology, (e.g., leftclosest
and rightclosest
or left_closest
), so something like:
leftjoin(
(d1, d2),
by_pred(:event, >, :time) & leftclosest(:event, :time)
)
Regarding keyword vs positional arg β the idea is that one can use names anywhere, if desired. Both named and unnamed should be supported everywhere if datasets are passed as named args (leftjoin((; A, B), ...)
):
by_key(:key_in_A, :key_in_B)
orby_key(A=:key_in_A, B=:key_in_B)
multi=(identity, closest)
ormulti=(A=identity, B=closest)
ormulti=(;B=closest)
I think I like the clarity of putting multi
into the condition itself, as in by_pred(..., multi=(;B=closest))
or by_key(..., multi=(;B=first))
etc. Iβll try seeing how well it fits with other piecesβ¦
In particular, probably there should be a check that only one multi
argument is provided in case of several conditions ... & ... & ...
.